Circuit Breaker

Intro

The Circuit Breaker pattern stops your service from repeatedly calling a dependency that is already failing, so your system fails fast instead of failing slowly. It matters in distributed systems because it prevents cascading failures: without a breaker, threads, sockets, and retries pile up until healthy parts of the system also degrade. You reach for it when calling external services such as payment providers, LLM APIs, and remote databases where latency spikes and partial outages are normal. In senior .NET systems, a circuit breaker is usually part of a resilience stack with timeout, retry, and fallback, not a standalone feature.

Mechanism

State model

stateDiagram-v2
    [*] --> Closed
    Closed --> Open: Failure ratio over threshold\nwithin sampling duration
    Open --> HalfOpen: Break duration elapsed
    HalfOpen --> Closed: Probe succeeds
    HalfOpen --> Open: Probe fails or\nfailure ratio still high

How transitions are decided

If you set thresholds too low, the breaker chatters (opens and closes too often). If you set them too high, you discover failures too late and still waste resources on doomed calls.

What should count as a failure

For interview depth, explicitly separate expected client errors from server-side dependency failure:

C# Example with Polly v8 in ASP.NET Core

This example uses the .NET HTTP resilience handler (AddResilienceHandler) with Polly v8 strategy options and tracks breaker state changes for telemetry.

using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Http.Resilience;
using Microsoft.Extensions.Logging;
using Polly;
using Polly.CircuitBreaker;
using Polly.Fallback;
using Polly.Retry;
using Polly.Timeout;

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddHttpClient<LlmGateway>(client =>
{
    client.BaseAddress = new Uri("https://api.openai.com/");
    client.Timeout = Timeout.InfiniteTimeSpan;
})
.AddResilienceHandler("llm-api", (pipelineBuilder, context) =>
{
    var logger = context.ServiceProvider.GetRequiredService<ILogger<LlmGateway>>();

    // Outermost: fallback runs after inner resilience logic decides the call failed.
    pipelineBuilder.AddFallback(new FallbackStrategyOptions<HttpResponseMessage>
    {
        ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
            .Handle<BrokenCircuitException>()
            .HandleResult(r => (int)r.StatusCode >= 500),
        FallbackAction = _ => Outcome.FromResultAsValueTask(
            new HttpResponseMessage(System.Net.HttpStatusCode.OK)
            {
                Content = new StringContent("{\"answer\":\"Provider unavailable. Serving cached response.\"}")
            })
    });

    // Retry wraps the breaker so retry attempts still flow through breaker checks.
    pipelineBuilder.AddRetry(new RetryStrategyOptions<HttpResponseMessage>
    {
        MaxRetryAttempts = 2,
        Delay = TimeSpan.FromMilliseconds(250),
        BackoffType = DelayBackoffType.Exponential,
        UseJitter = true,
        ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
            .Handle<HttpRequestException>()
            .Handle<TimeoutRejectedException>()
            .HandleResult(r => (int)r.StatusCode == 429 || (int)r.StatusCode >= 500)
    });

    // Breaker trips on sustained dependency instability.
    pipelineBuilder.AddCircuitBreaker(new CircuitBreakerStrategyOptions<HttpResponseMessage>
    {
        FailureRatio = 0.25,
        MinimumThroughput = 20,
        SamplingDuration = TimeSpan.FromSeconds(30),
        BreakDuration = TimeSpan.FromSeconds(45),
        ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
            .Handle<HttpRequestException>()
            .Handle<TimeoutRejectedException>()
            .HandleResult(r => (int)r.StatusCode == 429 || (int)r.StatusCode >= 500),
        OnOpened = args =>
        {
            logger.LogWarning(
                "Circuit opened for LLM API. Break duration: {BreakDuration}",
                args.BreakDuration);
            return default;
        },
        OnHalfOpened = _ =>
        {
            logger.LogInformation("Circuit half-open for LLM API. Sending probe requests.");
            return default;
        },
        OnClosed = _ =>
        {
            logger.LogInformation("Circuit closed for LLM API. Normal traffic restored.");
            return default;
        }
    });

    // Innermost: timeout is per attempt.
    pipelineBuilder.AddTimeout(new TimeoutStrategyOptions
    {
        Timeout = TimeSpan.FromSeconds(10)
    });
});

var app = builder.Build();
app.Run();

Use the resilient HttpClient in an LLM gateway

public sealed class LlmGateway
{
    private readonly HttpClient _httpClient;

    public LlmGateway(HttpClient httpClient)
    {
        _httpClient = httpClient;
    }

    public Task<HttpResponseMessage> CompleteAsync(HttpRequestMessage request, CancellationToken ct)
    {
        // Resilience handler is attached to this HttpClient instance.
        return _httpClient.SendAsync(request, ct);
    }
}

Integration with Other Resilience Patterns

For real production systems and AI provider calls, stack strategies deliberately:

Apply stack order as outermost to innermost:

  1. Fallback outermost: final degraded path after inner strategies fail.
  2. Retry next: absorb short transient failures.
  3. Circuit Breaker next: fast-fail when sustained instability is detected.
  4. Timeout innermost: bound each attempt.

Interview nuance: teams often say "retry inside breaker" to mean retries must contribute to breaker decisions. In Polly's outer-to-inner execution model, that behavior is achieved by placing retry outside and breaker inside, so every retry attempt still passes through breaker evaluation.

Pitfalls

1) Breaking too aggressively on expected errors

2) Not distinguishing transient vs permanent failures

3) Assuming one instance protects the whole fleet

4) Half-open allows too many probes

Tradeoffs

Choice Benefit Cost Use when
Aggressive thresholds (opens quickly) Protects resources early More false opens, degraded UX Dependency is expensive and failure blast radius is high
Conservative thresholds (opens slowly) Fewer false positives Slower protection during outage Occasional noise is acceptable but hard failures are rare
Per-instance breakers only Simple implementation No fleet-wide coordination Small deployments and low concurrency
Add centralized protection layers Better global control More operational complexity High-scale multi-instance services

Questions

References


Whats next

Parent
Patterns