Retry and Timeout Patterns
Intro
Retry and timeout patterns are defensive reliability strategies for outbound calls: retry re-attempts operations that fail for transient reasons, and timeout bounds how long you wait before treating the attempt as failed. They matter because distributed systems regularly see short network loss, DNS hiccups, brief overload, and cold-start latency spikes that are recoverable seconds later. Without retry, you fail fast on recoverable faults; without timeout, a single hung dependency can hold connection pool slots and request capacity until upstream latency collapses. Reach for both patterns on most request response external dependency boundaries such as HTTP APIs, message brokers, databases, and cache services. For streaming and long-running background flows, use explicit deadline ownership and different timeout and retry budgets. In modern .NET, the standard implementation is Polly v8 through Microsoft.Extensions.Http.Resilience.
Retry mechanism
Retry strategies
Immediate retry: run the next attempt with no delay; useful only for very short transient blips.Fixed delay: wait the same interval each time; simple and predictable, but can still synchronize clients.Exponential backoff: increase wait duration after each failure to reduce pressure on an unhealthy dependency.Exponential backoff with jitter: add randomization to each delay so clients do not retry in lockstep.
Why jitter matters
If 10,000 clients all fail at the same time and all retry at exactly 200 ms, then 400 ms, then 800 ms, they create synchronized request spikes that prolong outage recovery. Jitter decorrelates retry timing, turning one synchronized storm into a spread-out arrival pattern that gives the downstream service room to recover.
Exponential backoff formula
Use this as a conceptual model for exponential backoff:
delay grows exponentially from a base value and jitter randomizes each attempt
Polly v8 exponential retry with UseJitter = true uses a decorrelated jitter approach, so treat the formula as intuition and verify exact delay behavior in the Polly retry docs. In practice, keep baseDelay small, cap max delay, and cap max attempts to stay within your latency SLO.
Max retry attempts
Cap retries on user-facing request paths. For long-running background workers, indefinite retries can be acceptable only when combined with cancellation support, max-delay caps, and monitoring that can stop unhealthy loops.
What to retry
- Retry only transient failures: connection reset, timeout, temporary DNS failure, HTTP
408,429, and most5xx. - Do not retry client bugs and invalid requests (
400,401,403,404, validation errors). - Do not retry non-idempotent operations unless you provide idempotency keys or equivalent deduplication.
Retry flow
sequenceDiagram
participant Client
participant Service
Client->>Service: Request attempt one
Service-->>Client: Fail transient
Client->>Client: Wait short delay with jitter
Client->>Service: Request attempt two
Service-->>Client: Fail transient
Client->>Client: Wait longer delay with jitter
Client->>Service: Request attempt three
Service-->>Client: SuccessTimeout mechanism
Per-attempt timeout
Per-attempt timeout bounds one call attempt. If the dependency hangs, the attempt is canceled and retry logic can decide whether to try again.
Overall timeout
Overall timeout bounds the total operation budget across all attempts, waits, and strategy overhead. It prevents retry loops from consuming request time indefinitely.
Why both are required
- Per-attempt timeout only: each attempt is bounded, but cumulative retries can still exceed acceptable end-to-end latency.
- Overall timeout only: one hung attempt can consume the full budget before retry gets a chance.
- Combined: each attempt is bounded and the full operation is also bounded.
.NET Polly v8 example
This ASP.NET Core example configures an HttpClient for an inventory dependency with an outer total timeout, a transient-fault retry policy using exponential backoff and jitter, and an inner per-attempt timeout.
using Microsoft.Extensions.DependencyInjection;
using Microsoft.Extensions.Http.Resilience;
using Polly;
using Polly.Retry;
using Polly.Timeout;
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddHttpClient<InventoryClient>(client =>
{
client.BaseAddress = new Uri("https://inventory.internal/");
client.Timeout = Timeout.InfiniteTimeSpan;
})
.AddResilienceHandler("inventory-http", (pipelineBuilder, context) =>
{
// Outermost total timeout for full operation budget
pipelineBuilder.AddTimeout(new TimeoutStrategyOptions
{
Timeout = TimeSpan.FromSeconds(8)
});
pipelineBuilder.AddRetry(new RetryStrategyOptions<HttpResponseMessage>
{
MaxRetryAttempts = 3,
Delay = TimeSpan.FromMilliseconds(200),
BackoffType = DelayBackoffType.Exponential,
UseJitter = true,
ShouldHandle = new PredicateBuilder<HttpResponseMessage>()
.Handle<HttpRequestException>()
.Handle<TimeoutRejectedException>()
.HandleResult(response =>
response.StatusCode == System.Net.HttpStatusCode.RequestTimeout ||
response.StatusCode == System.Net.HttpStatusCode.TooManyRequests ||
(int)response.StatusCode >= 500)
});
// Innermost timeout for each individual attempt
pipelineBuilder.AddTimeout(new TimeoutStrategyOptions
{
Timeout = TimeSpan.FromSeconds(2)
});
});
var app = builder.Build();
app.Run();
public sealed class InventoryClient
{
private readonly HttpClient _httpClient;
public InventoryClient(HttpClient httpClient)
{
_httpClient = httpClient;
}
public Task<HttpResponseMessage> GetAvailabilityAsync(string sku, CancellationToken ct)
{
return _httpClient.GetAsync($"api/stock/{sku}", ct);
}
}
Integration with other resilience patterns
For production systems, compose retry and timeout with neighboring patterns in a deliberate order from outermost to innermost:
Total timeoutoutermost to cap full operation time.Fallbackafter inner strategies fail to provide degraded response.Retryto absorb short transient failures.Circuit Breakerto fast-fail during sustained instability.Per-attempt timeoutinnermost to cap single attempt duration.
Use this pipeline together with Circuit Breaker and Rate Limiting to protect both dependency health and caller latency.
Pitfalls
Retrying non idempotent operations
- What goes wrong: duplicate orders or duplicate payments happen when a non-idempotent write is retried after uncertain completion.
- Why it happens: the client cannot distinguish between failed execution and failed response delivery, so a second attempt may repeat a completed write.
- How to avoid it: use idempotency keys for write APIs and retry only operations that are explicitly safe to replay.
No jitter in backoff
- What goes wrong: all clients retry at the same time and generate a retry storm that extends outage duration.
- Why it happens: deterministic delays synchronize retries across instances and across regions.
- How to avoid it: enable jitter and combine it with exponential backoff and capped attempt count.
Missing timeout boundary
- What goes wrong: a hung dependency call holds connection slots and request budget for minutes.
- Why it happens: only one timeout layer is configured or no timeout is configured at all.
- How to avoid it: configure both per-attempt timeout and overall timeout then align both with your service latency SLO.
Retry amplification across layers
- What goes wrong: one user request fans out into many downstream calls for example three retries in service A and three retries in service B can produce nine calls into service C.
- Why it happens: each layer retries independently without a shared retry budget.
- How to avoid it: define retry ownership by layer cap total attempts end to end and propagate deadlines so lower layers stop retrying when budget is exhausted.
Tradeoffs
| Strategy | Benefit | Cost | Use when |
|---|---|---|---|
| Immediate retry | Lowest added latency for short glitches | Highest risk of immediate re-pressure on unstable dependency | Failure is likely a one off transport hiccup and dependency is lightly loaded |
| Fixed delay retry | Simple predictable behavior | Can still synchronize clients and recover slowly under heavy contention | You need straightforward behavior and traffic is moderate |
| Exponential backoff with jitter | Best protection against retry storms and downstream overload | Higher implementation complexity and longer tail latency on repeated failures | Dependency instability is common and fleet size is large |
| Per-attempt timeout only | Prevents single attempt hang | Total operation can still run too long across retries | You have no retries and only need per call bound |
| Per-attempt plus overall timeout | Bounds both attempt and end to end latency | Requires careful budget tuning between layers | You run retries or multi-hop calls and have strict SLO targets |
Decision rule: start with exponential backoff plus jitter and dual timeout boundaries then tune attempt count and timeout budgets from observed latency percentiles and downstream error rates.
Questions
- Without jitter each client computes nearly identical retry times so failures synchronize into periodic traffic spikes.
- Those spikes hit while the dependency is already degraded which increases queue depth and recovery time.
- Jitter randomizes each delay so retries spread over time and reduce synchronized pressure.
- This improves recovery odds and stabilizes shared infrastructure such as load balancers and connection pools.
- Tradeoff jitter reduces herd effects but increases per request timing variance and makes behavior slightly harder to predict.
- Assign retry ownership to one layer per call path usually the edge caller or the service nearest the user boundary.
- Propagate cancellation tokens and deadlines so downstream services respect the remaining time budget.
- Keep low retry counts and combine with circuit breaker and rate limits to avoid multiplicative pressure.
- Measure effective attempts per request in telemetry and alert when fan out exceeds budget.
- Tradeoff centralizing retries improves control and cost but can reduce local autonomy for service teams.
References
- Polly docs retry strategy - Official Polly v8 retry options, backoff types, jitter behavior, and
ShouldHandlepredicates. - Polly docs timeout strategy - Official Polly v8 timeout behavior, cancellation semantics, and timeout strategy configuration.
- Microsoft Learn .NET HTTP resilience -
Microsoft.Extensions.Http.Resilienceguidance for composing retry, timeout, circuit breaker, and fallback inHttpClientpipelines. - Microsoft Learn transient fault handling - Cloud architecture guidance on identifying transient failures and choosing retry and timeout policies.
- AWS Architecture Blog Exponential Backoff and Jitter - Marc Brooker explanation of why jitter reduces coordinated retries and improves system recovery under contention.