Retry and Timeout Patterns

Intro

Retry and timeout patterns are defensive reliability strategies for outbound calls: retry re-attempts operations that fail for transient reasons, and timeout bounds how long you wait before treating the attempt as failed. They matter because distributed systems regularly see short network loss, DNS hiccups, brief overload, and cold-start latency spikes that are recoverable seconds later. Without retry, you fail fast on recoverable faults; without timeout, a single hung dependency can hold connection pool slots and request capacity until upstream latency collapses. Reach for both patterns on most request response external dependency boundaries such as HTTP APIs, message brokers, databases, and cache services. For streaming and long-running background flows, use explicit deadline ownership and different timeout and retry budgets. In modern .NET, the standard implementation is Polly v8 through Microsoft.Extensions.Http.Resilience.

Retry mechanism

Retry strategies

Why jitter matters

If 10,000 clients all fail at the same time and all retry at exactly 200 ms, then 400 ms, then 800 ms, they create synchronized request spikes that prolong outage recovery. Jitter decorrelates retry timing, turning one synchronized storm into a spread-out arrival pattern that gives the downstream service room to recover.

Exponential backoff formula

Use this as a conceptual model for exponential backoff:

delay grows exponentially from a base value and jitter randomizes each attempt

Polly v8 exponential retry with UseJitter = true uses a decorrelated jitter approach, so treat the formula as intuition and verify exact delay behavior in the Polly retry docs. In practice, keep baseDelay small, cap max delay, and cap max attempts to stay within your latency SLO.

Max retry attempts

Cap retries on user-facing request paths. For long-running background workers, indefinite retries can be acceptable only when combined with cancellation support, max-delay caps, and monitoring that can stop unhealthy loops.

What to retry

Retry flow

sequenceDiagram
    participant Client
    participant Service
    Client->>Service: Request attempt one
    Service-->>Client: Fail transient
    Client->>Client: Wait short delay with jitter
    Client->>Service: Request attempt two
    Service-->>Client: Fail transient
    Client->>Client: Wait longer delay with jitter
    Client->>Service: Request attempt three
    Service-->>Client: Success

Timeout mechanism

Per-attempt timeout

Per-attempt timeout bounds one call attempt. If the dependency hangs, the attempt is canceled and retry logic can decide whether to try again.

Overall timeout

Overall timeout bounds the total operation budget across all attempts, waits, and strategy overhead. It prevents retry loops from consuming request time indefinitely.

Why both are required

.NET Polly v8 example

This ASP.NET Core example configures an HttpClient for an inventory dependency with an outer total timeout, a transient-fault retry policy using exponential backoff and jitter, and an inner per-attempt timeout.

using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Http.Resilience;
using Polly;
using Polly.Retry;
using Polly.Timeout;

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddHttpClient<InventoryClient>(client =>
{
    client.BaseAddress = new Uri("https://inventory.internal/");
    client.Timeout = Timeout.InfiniteTimeSpan;
})
.AddResilienceHandler("inventory-http", (pipelineBuilder, context) =>
{
    // Outermost total timeout for full operation budget
    pipelineBuilder.AddTimeout(new TimeoutStrategyOptions
    {
        Timeout = TimeSpan.FromSeconds(8)
    });

    pipelineBuilder.AddRetry(new RetryStrategyOptions<HttpResponseMessage>
    {
        MaxRetryAttempts = 3,
        Delay = TimeSpan.FromMilliseconds(200),
        BackoffType = DelayBackoffType.Exponential,
        UseJitter = true,
        ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
            .Handle<HttpRequestException>()
            .Handle<TimeoutRejectedException>()
            .HandleResult(response =>
                response.StatusCode == System.Net.HttpStatusCode.RequestTimeout ||
                response.StatusCode == System.Net.HttpStatusCode.TooManyRequests ||
                (int)response.StatusCode >= 500)
    });

    // Innermost timeout for each individual attempt
    pipelineBuilder.AddTimeout(new TimeoutStrategyOptions
    {
        Timeout = TimeSpan.FromSeconds(2)
    });
});

var app = builder.Build();
app.Run();

public sealed class InventoryClient
{
    private readonly HttpClient _httpClient;

    public InventoryClient(HttpClient httpClient)
    {
        _httpClient = httpClient;
    }

    public Task<HttpResponseMessage> GetAvailabilityAsync(string sku, CancellationToken ct)
    {
        return _httpClient.GetAsync($"api/stock/{sku}", ct);
    }
}

Integration with other resilience patterns

For production systems, compose retry and timeout with neighboring patterns in a deliberate order from outermost to innermost:

  1. Total timeout outermost to cap full operation time.
  2. Fallback after inner strategies fail to provide degraded response.
  3. Retry to absorb short transient failures.
  4. Circuit Breaker to fast-fail during sustained instability.
  5. Per-attempt timeout innermost to cap single attempt duration.

Use this pipeline together with Circuit Breaker and Rate Limiting to protect both dependency health and caller latency.

Pitfalls

Retrying non idempotent operations

No jitter in backoff

Missing timeout boundary

Retry amplification across layers

Tradeoffs

Strategy Benefit Cost Use when
Immediate retry Lowest added latency for short glitches Highest risk of immediate re-pressure on unstable dependency Failure is likely a one off transport hiccup and dependency is lightly loaded
Fixed delay retry Simple predictable behavior Can still synchronize clients and recover slowly under heavy contention You need straightforward behavior and traffic is moderate
Exponential backoff with jitter Best protection against retry storms and downstream overload Higher implementation complexity and longer tail latency on repeated failures Dependency instability is common and fleet size is large
Per-attempt timeout only Prevents single attempt hang Total operation can still run too long across retries You have no retries and only need per call bound
Per-attempt plus overall timeout Bounds both attempt and end to end latency Requires careful budget tuning between layers You run retries or multi-hop calls and have strict SLO targets

Decision rule: start with exponential backoff plus jitter and dual timeout boundaries then tune attempt count and timeout budgets from observed latency percentiles and downstream error rates.

Questions

References


Whats next

Parent
Patterns

Pages