Observability

Intro

Observability is the ability to understand a system's internal state from its external outputs: metrics, logs, and traces. In distributed systems, failures are emergent, cross service boundaries, and rarely show up as a single obvious exception, so observability is how you move from symptoms to causes quickly. You cannot fix what you cannot see, and you cannot scale what you cannot measure. Reach for observability from day one: retrofitting it after incidents and growth is significantly harder because the missing telemetry was never emitted.

The Three Pillars

The three pillars are complementary signals, not competing tools.

Metrics

Metrics are numeric measurements over time that answer "how much" and "how often".

For service-level health, use the RED method:

For resource-level health, use the USE method:

Core interview metrics you should always name for APIs:

Logs

Logs are structured event records that answer "what exactly happened" at a point in time.

Use log levels intentionally:

In distributed systems, correlation IDs are essential in practice: every service should log the same request identifier so operators can reconstruct one end-to-end user request across many log streams.

Traces

Traces represent a single request journey across services and dependencies.

Distributed tracing reconstructs the critical path of a request so you can answer where latency is introduced, where errors originate, and which dependency is responsible.

sequenceDiagram
    participant C as Client
    participant A as API Gateway
    participant B as Orders Service
    participant D as Inventory Service

    C->>A: HTTP request with trace id
    activate A
    A->>B: gRPC call child span
    activate B
    B->>D: HTTP call child span
    activate D
    D-->>B: stock result
    deactivate D
    B-->>A: order result
    deactivate B
    A-->>C: response
    deactivate A

OpenTelemetry in Practice

OpenTelemetry is the vendor-neutral standard for telemetry instrumentation and export.

Minimal .NET tracing and metrics setup:

using OpenTelemetry.Metrics;
using OpenTelemetry.Resources;
using OpenTelemetry.Trace;

var builder = WebApplication.CreateBuilder(args);

builder.Services
    .AddOpenTelemetry()
    .ConfigureResource(resource => resource.AddService("checkout-api"))
    .WithTracing(tracing => tracing
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddGrpcClientInstrumentation()
        .AddEntityFrameworkCoreInstrumentation()
        .AddSource("Checkout.Api")
        .AddOtlpExporter())
    .WithMetrics(metrics => metrics
        .AddAspNetCoreInstrumentation()
        .AddRuntimeInstrumentation()
        .AddHttpClientInstrumentation()
        .AddMeter("Checkout.Api")
        .AddPrometheusExporter());

The key senior signal in interviews: instrument from day one with standard telemetry contracts, then choose backends based on team and platform constraints.

Prometheus export also needs a mapped scrape endpoint in ASP.NET Core:

var app = builder.Build();
app.MapPrometheusScrapingEndpoint();

If MapPrometheusScrapingEndpoint() is unavailable in your package version, use middleware instead:

app.UseOpenTelemetryPrometheusScrapingEndpoint();

Prometheus ASP.NET Core exporter support can be version-sensitive and may require prerelease packages. The exporter documentation recommends considering OTLP export for production scenarios, so teams often route metrics to an OpenTelemetry Collector and expose them to Prometheus there.

.NET Implementation Patterns

Custom Metrics with Meter API

using System.Diagnostics;
using System.Diagnostics.Metrics;

var meter = new Meter("Checkout.Api", "1.0.0");
Counter<long> ordersCreated = meter.CreateCounter<long>("orders_created_total");
Histogram<double> checkoutLatencyMs = meter.CreateHistogram<double>("checkout_latency_ms");

public IResult CreateOrder(OrderRequest request)
{
    var startedAt = Stopwatch.GetTimestamp();

    // Business logic...

    ordersCreated.Add(1, new KeyValuePair<string, object?>("tenant", request.TenantId));
    var elapsedMs = Stopwatch.GetElapsedTime(startedAt).TotalMilliseconds;
    checkoutLatencyMs.Record(elapsedMs);

    return Results.Ok();
}

Custom Tracing with ActivitySource

using System.Diagnostics;

public static class Telemetry
{
    public static readonly ActivitySource ActivitySource = new("Checkout.Api");
}

public async Task ReserveInventoryAsync(string sku, int quantity)
{
    using var activity = Telemetry.ActivitySource.StartActivity("ReserveInventory");
    activity?.SetTag("inventory.sku", sku);
    activity?.SetTag("inventory.quantity", quantity);

    await _inventoryClient.ReserveAsync(sku, quantity);
}

Structured Logging with Serilog

using Serilog;
using Serilog.Formatting.Compact;

builder.Host.UseSerilog((_, config) => config
    .Enrich.FromLogContext()
    .Enrich.WithProperty("service", "checkout-api")
    .WriteTo.Console(new RenderedCompactJsonFormatter()));

app.UseSerilogRequestLogging();

app.MapPost("/checkout", (CheckoutRequest request, ILogger<Program> logger) =>
{
    logger.LogInformation(
        "Checkout started for {CustomerId} with {ItemCount} items",
        request.CustomerId,
        request.Items.Count);

    return Results.Accepted();
});

Example JSON event shape emitted by structured logging:

{
  "@t": "2026-02-28T12:30:45.1234567Z",
  "@i": "f2a8a4c1",
  "@m": "Checkout started for 42 with 3 items",
  "CustomerId": 42,
  "ItemCount": 3
}

Pitfalls

Logging Everything or Logging Nothing

Logging every payload and every debug event explodes storage and query cost; logging almost nothing leaves teams blind during incidents. Use strategic sampling and retain high-value structured events at 100% while sampling noisy verbose events.

Unstructured Logs You Cannot Query

Free-form text logs block fast incident response because operators cannot reliably filter by tenant, endpoint, or correlation key. Prefer structured logs with stable property names and consistent schema across services.

Missing Correlation IDs Across Services

Without propagated trace and correlation IDs, each service log appears correct in isolation but impossible to stitch into one request narrative. Ensure incoming IDs are accepted, propagated, and included in all logs and spans.

Alert Fatigue from Noisy Metrics

If thresholds are too sensitive or static, teams get constant false positives and start ignoring alerts. Define SLO-based thresholds, use burn-rate style alerting where possible, and segment alerts by service criticality.

Tradeoffs

Questions

References


Whats next