Load Balancing

Intro

Load balancing distributes incoming traffic across multiple service instances so one instance does not become a bottleneck or a single point of failure. In system design interviews, this is usually the first infrastructure building block because it enables horizontal scale without changing client behavior. It matters for availability, failure isolation, and predictable latency under burst traffic. Reach for it as soon as a service runs on more than one instance, especially for AI APIs where request cost varies by prompt size and model path.

Mechanism

Load balancers can operate at different layers, and that layer choice drives what routing decisions are possible.

flowchart LR
    C[Client] --> LB[Load Balancer]
    LB --> A[Server A]
    LB --> B[Server B]
    LB --> D[Server C]

Practical interview rule: pick L4 for high-throughput generic transport routing, pick L7 when business routing logic depends on request content.

Algorithms

No single algorithm is best; choose based on workload shape and fairness goals.

Algorithm How it routes Prefer when Main risk
Round robin Cycles requests evenly across instances. Backend instances are similar and request cost is roughly uniform. Slow instances still get equal share and can queue up.
Weighted round robin Round robin with per-instance weight multipliers. Instance sizes differ, such as mixed VM sizes or mixed CPU generations. Static weights drift from real capacity after noisy-neighbor effects or throttling.
Least connections Picks the instance with the fewest active connections. Connection duration varies, such as streaming or long-running AI completions. Connection count may not reflect CPU or memory cost for short but expensive requests.
IP hash Deterministically maps client IP to backend. You need simple affinity without external session storage. NAT gateways can collapse many users to one IP and create hotspots.
Consistent hashing Maps keys to a hash ring with minimal remapping when nodes change. Cache locality, shard affinity, and gradual scale changes matter. Too few virtual nodes or poor weights can skew ring ownership, and hot keys can still create hotspots even with a balanced ring.

For AI inference endpoints, request duration and compute cost vary heavily, so pick the algorithm by measuring p95 and p99 latency, error rate, and backend saturation under representative load instead of assuming one default winner.

Health Checks

Health checks decide whether an instance should stay in the active pool.

Typical state transition:

  1. Instance fails probes or exceeds passive error thresholds.
  2. LB marks it unhealthy and removes it from new request routing.
  3. Existing requests are drained, failed, or retried according to policy.
  4. Instance must pass recovery criteria before re-entry.

Important implementation point: a readiness endpoint must verify critical dependencies, not just return process-alive.

.NET Context

ASP.NET Core services with Kestrel are commonly deployed behind a reverse proxy or managed load balancer.

Minimal readiness and liveness setup in ASP.NET Core.
The AddSqlServer and AddRedis probes below come from the community Xabaril health checks packages (AspNetCore.HealthChecks.SqlServer and AspNetCore.HealthChecks.Redis).
If your team prefers only Microsoft-maintained dependencies, use custom AddCheck implementations for dependency readiness.

dotnet add package AspNetCore.HealthChecks.SqlServer
dotnet add package AspNetCore.HealthChecks.Redis
using Microsoft.AspNetCore.Diagnostics.HealthChecks;

var builder = WebApplication.CreateBuilder(args);

builder.Services
    .AddHealthChecks()
    .AddSqlServer(builder.Configuration.GetConnectionString("MainDb")!)
    .AddRedis(builder.Configuration.GetConnectionString("Redis")!);

var app = builder.Build();

app.MapHealthChecks("/health/live", new HealthCheckOptions
{
    Predicate = _ => false
});

app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
    Predicate = _ => true
});

app.MapGet("/", () => Results.Ok("service-ready"));

app.Run();

Session affinity in .NET systems:

Pitfalls

Sticky sessions can defeat balancing goals

Health endpoint does not validate dependencies

Thundering herd when recovering instances

TLS termination in the wrong place

Tradeoffs

Decision Option A Option B How to choose
Layer L4 L7 Need content-aware routing and edge features versus lower overhead data-plane routing
Session model Sticky sessions Stateless with shared store Migration speed versus long-term resilience and autoscaling quality
TLS strategy Terminate at LB End-to-end encryption to service Operational simplicity versus stricter east-west security requirements
Health model Active only Active plus passive Simplicity versus better detection of real user-facing failures

Questions

References


Whats next