Observability
Intro
Observability is the ability to understand a system's internal state from its external outputs: metrics, logs, and traces. In distributed systems, failures are emergent, cross service boundaries, and rarely show up as a single obvious exception, so observability is how you move from symptoms to causes quickly. You cannot fix what you cannot see, and you cannot scale what you cannot measure. Reach for observability from day one: retrofitting it after incidents and growth is significantly harder because the missing telemetry was never emitted.
The Three Pillars
The three pillars are complementary signals, not competing tools.
Metrics
Metrics are numeric measurements over time that answer "how much" and "how often".
- Counter: cumulative value that only goes up (for example total requests, total errors).
- Gauge: current point-in-time value that can go up or down (for example queue length, active connections).
- Histogram: distribution of observed values into buckets (for example request latency) so your backend can estimate percentiles like p50, p95, and p99.
For service-level health, use the RED method:
- Rate: requests per second.
- Errors: failed requests per second or error percentage.
- Duration: latency distribution (not only average; percentiles matter).
For resource-level health, use the USE method:
- Utilization: how busy a resource is.
- Saturation: queued work / backlog (pressure).
- Errors: resource-specific failures.
Core interview metrics you should always name for APIs:
- Request rate
- Error rate
- Latency p50/p95/p99
- Saturation signals (CPU, thread pool queue, DB pool exhaustion)
Logs
Logs are structured event records that answer "what exactly happened" at a point in time.
- Unstructured logs (free text) are easy to write but hard to query.
- Structured logs (usually JSON with named properties) are queryable and aggregation-friendly.
Use log levels intentionally:
Trace(Verbosein Serilog): very detailed diagnostics, usually disabled in production.Debug: development diagnostics.Information: normal business flow (request started/completed, key state transitions).Warning: degraded but recoverable behavior.Error: failed operation.Critical(Fatalin Serilog): process/service cannot continue safely.
In distributed systems, correlation IDs are essential in practice: every service should log the same request identifier so operators can reconstruct one end-to-end user request across many log streams.
Traces
Traces represent a single request journey across services and dependencies.
- A trace is the full end-to-end operation.
- A span is one timed unit of work within that trace.
- Parent-child span relationships encode causal flow between components.
- Trace context propagation (
traceparent) carries trace id, parent span id, and trace flags across HTTP/gRPC boundaries;tracestatecan carry vendor-specific context.
Distributed tracing reconstructs the critical path of a request so you can answer where latency is introduced, where errors originate, and which dependency is responsible.
sequenceDiagram
participant C as Client
participant A as API Gateway
participant B as Orders Service
participant D as Inventory Service
C->>A: HTTP request with trace id
activate A
A->>B: gRPC call child span
activate B
B->>D: HTTP call child span
activate D
D-->>B: stock result
deactivate D
B-->>A: order result
deactivate B
A-->>C: response
deactivate AOpenTelemetry in Practice
OpenTelemetry is the vendor-neutral standard for telemetry instrumentation and export.
- API + SDK: consistent model to create metrics, logs, and traces.
- Instrumentation packages in .NET cover common libraries (
ASP.NET Core,HttpClient,gRPC,EF Core, runtime/process metrics); runtime auto-instrumentation is a separate deployment model. - Exporters let you route the same telemetry to different backends (for example Jaeger, Prometheus, Datadog, Azure Monitor) without changing application code.
Minimal .NET tracing and metrics setup:
using OpenTelemetry.Metrics;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;
var builder = WebApplication.CreateBuilder(args);
builder.Services
.AddOpenTelemetry()
.ConfigureResource(resource => resource.AddService("checkout-api"))
.WithTracing(tracing => tracing
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddGrpcClientInstrumentation()
.AddEntityFrameworkCoreInstrumentation()
.AddSource("Checkout.Api")
.AddOtlpExporter())
.WithMetrics(metrics => metrics
.AddAspNetCoreInstrumentation()
.AddRuntimeInstrumentation()
.AddHttpClientInstrumentation()
.AddMeter("Checkout.Api")
.AddPrometheusExporter());
The key senior signal in interviews: instrument from day one with standard telemetry contracts, then choose backends based on team and platform constraints.
Prometheus export also needs a mapped scrape endpoint in ASP.NET Core:
var app = builder.Build();
app.MapPrometheusScrapingEndpoint();
If MapPrometheusScrapingEndpoint() is unavailable in your package version, use middleware instead:
app.UseOpenTelemetryPrometheusScrapingEndpoint();
Prometheus ASP.NET Core exporter support can be version-sensitive and may require prerelease packages. The exporter documentation recommends considering OTLP export for production scenarios, so teams often route metrics to an OpenTelemetry Collector and expose them to Prometheus there.
.NET Implementation Patterns
Custom Metrics with Meter API
using System.Diagnostics;
using System.Diagnostics.Metrics;
var meter = new Meter("Checkout.Api", "1.0.0");
Counter<long> ordersCreated = meter.CreateCounter<long>("orders_created_total");
Histogram<double> checkoutLatencyMs = meter.CreateHistogram<double>("checkout_latency_ms");
public IResult CreateOrder(OrderRequest request)
{
var startedAt = Stopwatch.GetTimestamp();
// Business logic...
ordersCreated.Add(1, new KeyValuePair<string, object?>("tenant", request.TenantId));
var elapsedMs = Stopwatch.GetElapsedTime(startedAt).TotalMilliseconds;
checkoutLatencyMs.Record(elapsedMs);
return Results.Ok();
}
Custom Tracing with ActivitySource
using System.Diagnostics;
public static class Telemetry
{
public static readonly ActivitySource ActivitySource = new("Checkout.Api");
}
public async Task ReserveInventoryAsync(string sku, int quantity)
{
using var activity = Telemetry.ActivitySource.StartActivity("ReserveInventory");
activity?.SetTag("inventory.sku", sku);
activity?.SetTag("inventory.quantity", quantity);
await _inventoryClient.ReserveAsync(sku, quantity);
}
Structured Logging with Serilog
using Serilog;
using Serilog.Formatting.Compact;
builder.Host.UseSerilog((_, config) => config
.Enrich.FromLogContext()
.Enrich.WithProperty("service", "checkout-api")
.WriteTo.Console(new RenderedCompactJsonFormatter()));
app.UseSerilogRequestLogging();
app.MapPost("/checkout", (CheckoutRequest request, ILogger<Program> logger) =>
{
logger.LogInformation(
"Checkout started for {CustomerId} with {ItemCount} items",
request.CustomerId,
request.Items.Count);
return Results.Accepted();
});
Example JSON event shape emitted by structured logging:
{
"@t": "2026-02-28T12:30:45.1234567Z",
"@i": "f2a8a4c1",
"@m": "Checkout started for 42 with 3 items",
"CustomerId": 42,
"ItemCount": 3
}
Pitfalls
Logging Everything or Logging Nothing
Logging every payload and every debug event explodes storage and query cost; logging almost nothing leaves teams blind during incidents. Use strategic sampling and retain high-value structured events at 100% while sampling noisy verbose events.
Unstructured Logs You Cannot Query
Free-form text logs block fast incident response because operators cannot reliably filter by tenant, endpoint, or correlation key. Prefer structured logs with stable property names and consistent schema across services.
Missing Correlation IDs Across Services
Without propagated trace and correlation IDs, each service log appears correct in isolation but impossible to stitch into one request narrative. Ensure incoming IDs are accepted, propagated, and included in all logs and spans.
Alert Fatigue from Noisy Metrics
If thresholds are too sensitive or static, teams get constant false positives and start ignoring alerts. Define SLO-based thresholds, use burn-rate style alerting where possible, and segment alerts by service criticality.
Tradeoffs
- High-cardinality labels on metrics
- Benefit: better drill-down during incident analysis.
- Cost or risk: expensive storage and slower queries.
- Decision rule: keep labels bounded; move highly variable identifiers to logs and traces.
- 100% trace sampling
- Benefit: full forensic visibility.
- Cost or risk: high ingest and storage cost.
- Decision rule: use sampling by default and increase sampling during incidents.
- Long log retention
- Benefit: better historical investigations.
- Cost or risk: compliance and storage burden.
- Decision rule: keep raw logs for short retention and archive aggregates longer.
Questions
- Start from traces to find which span dominates p95/p99 latency on slow requests.
- Correlate with RED metrics per service to confirm whether latency is isolated or systemic.
- Check dependency spans (DB, cache, external API) to identify downstream propagation.
- Use structured logs with the same trace ID to validate edge-case payloads or retries.
- Why this answer is strong: it uses all three pillars together instead of guessing from one dashboard.
- Use RED for customer-facing service endpoints and request pipelines.
- Use USE for underlying resources such as CPU, thread pools, queue depth, and DB connections.
- Combine both: RED shows symptom at service boundary; USE explains pressure source beneath it.
- Why this answer is strong: it separates service health from resource health and links them causally.
- Early instrumentation bakes telemetry into contracts and code paths before scale complexity appears.
- Retrofitting often misses historical baselines and requires invasive code changes across many services.
- Vendor-neutral instrumentation preserves backend flexibility as platform needs evolve.
- Why this answer is strong: it explains operational risk, engineering cost, and architecture flexibility.
References
- OpenTelemetry for .NET (official docs)
- ASP.NET Core observability example with OpenTelemetry and Prometheus (Microsoft Learn)
- W3C Trace Context (traceparent and tracestate)
- Prometheus ASP.NET Core exporter README (OpenTelemetry .NET)
- Google SRE Book: Monitoring Distributed Systems (practitioner)