Rate Limiting
Intro
Rate limiting controls how many requests a client can make in a period of time so one caller cannot exhaust shared resources. It matters because it protects reliability, reduces abuse, and keeps cost predictable when downstream work is expensive, especially LLM inference and embedding calls billed per request or token. In system design interviews, rate limiting is usually a quota protection mechanism, not just a security feature: it keeps latency stable for well-behaved users when traffic spikes. Reach for it on public APIs, shared multi-tenant services, and any endpoint that fans out to costly dependencies.
In .NET systems, rate limiting is often a layered decision: edge gateway limits, app-level per-tenant limits, and provider-level limits from dependencies like OpenAI or Stripe. The algorithm you choose defines failure behavior under burst traffic, memory usage, and fairness.
Why It Matters in Senior Design Discussions
- Reliability: shields thread pools, DB connections, and downstream APIs from overload.
- Cost control: caps spend for metered dependencies such as LLM completions and vector search.
- Fairness: prevents a noisy tenant from starving others in shared infrastructure.
- Backpressure signal: explicit
429 Too Many Requeststells clients when and how to retry.
Core Algorithms
Token Bucket
Token bucket maintains a bucket with capacity B tokens. Tokens are added at a refill rate R over time, and each request consumes one or more tokens. If tokens are available, the request is allowed; if not, it is rejected or queued.
Why teams like it:
- Smooth average throughput while still allowing short bursts up to bucket capacity.
- Easy to reason about for user-facing APIs where occasional bursts are expected.
- Good fit for "N requests per second with burst M" style requirements.
Tradeoffs:
- More state than fixed window (need token count plus last refill timestamp).
- Distributed implementations need atomic refill-and-consume logic per key.
When to prefer it:
- Public API with bursty clients.
- AI inference endpoints where tenants need short spikes but bounded sustained load.
flowchart LR
Refill[Refill timer adds tokens] --> Bucket[Token bucket]
Client[Client request] --> Check{Token available}
Bucket --> Check
Check -->|yes| Consume[Consume token and allow request]
Check -->|no| Reject[Return 429 and retry hint]
Consume --> BucketSliding Window Log
Sliding window log stores timestamps of recent requests per key and removes entries older than the window size. A new request is allowed only if the count of timestamps in the active window is below the limit.
Why it is precise:
- Exact count for any rolling interval.
- No boundary artifacts like fixed window.
Tradeoffs:
- Highest memory and write overhead, especially for high-cardinality keys.
- Cleanup cost grows with traffic.
When to prefer it:
- Low to moderate traffic where precision is more important than memory.
- Compliance-sensitive quotas where approximation is not acceptable.
Sliding Window Counter
Sliding window counter approximates rolling windows using two adjacent fixed buckets (current and previous), then weights the previous bucket based on elapsed time. It estimates requests in the active rolling window without storing every timestamp.
Why it is a strong default:
- Better fairness than fixed window.
- Lower memory than sliding log.
Tradeoffs:
- Approximation error near bucket boundaries.
- Slightly more implementation complexity than fixed window.
When to prefer it:
- High-throughput APIs that need near-rolling accuracy with controlled cost.
- Distributed rate limits where you want better fairness without log storage.
Fixed Window Counter
Fixed window tracks a simple counter per key for each discrete window (for example, one minute). Counter resets when the window changes.
Why teams start with it:
- Easiest to implement and operate.
- Very low memory footprint.
Tradeoffs:
- Boundary spike: clients can send near-limit requests at the end of one window and immediately again at the start of the next.
- Effective burst can approach 2x configured rate at window edges.
When to prefer it:
- Internal services with predictable traffic.
- Early implementation where simplicity is the dominant requirement.
Quick Comparison
| Algorithm | Burst support | Accuracy | Memory cost | Operational complexity | Typical fit |
|---|---|---|---|---|---|
| Fixed Window | Poor at edges | Low to medium | Low | Low | Simple internal quotas |
| Sliding Window Log | Limited by policy | High | High | Medium | Strict fairness and auditability |
| Sliding Window Counter | Medium | Medium to high | Medium | Medium | Balanced general purpose API limits |
| Token Bucket | Strong and controlled | Medium to high | Medium | Medium | Public APIs and tenant burst tolerance |
ASP.NET Core Example
ASP.NET Core has first-class middleware support via Microsoft.AspNetCore.RateLimiting. You register policies in AddRateLimiter and attach a policy globally or per endpoint.
using System.Threading.RateLimiting;
using Microsoft.AspNetCore.RateLimiting;
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddRateLimiter(options =>
{
options.RejectionStatusCode = StatusCodes.Status429TooManyRequests;
options.AddFixedWindowLimiter("fixed-per-client", limiterOptions =>
{
limiterOptions.PermitLimit = 100;
limiterOptions.Window = TimeSpan.FromMinutes(1);
limiterOptions.QueueLimit = 0;
limiterOptions.AutoReplenishment = true;
});
options.AddTokenBucketLimiter("token-per-client", limiterOptions =>
{
limiterOptions.TokenLimit = 200;
limiterOptions.TokensPerPeriod = 20;
limiterOptions.ReplenishmentPeriod = TimeSpan.FromSeconds(1);
limiterOptions.QueueLimit = 0;
limiterOptions.AutoReplenishment = true;
});
});
var app = builder.Build();
app.UseRateLimiter();
app.MapGet("/api/public", () => Results.Ok("ok"))
.RequireRateLimiting("token-per-client");
app.MapGet("/api/admin", () => Results.Ok("ok"))
.RequireRateLimiting("fixed-per-client");
app.Run();
Per Tenant Partitioning
For multi-tenant APIs, partition by tenant or API key, not only by IP address. ASP.NET Core supports partitioning with PartitionedRateLimiter so each key gets its own limiter state.
using System.Threading.RateLimiting;
builder.Services.AddRateLimiter(options =>
{
options.GlobalLimiter = PartitionedRateLimiter.Create<HttpContext, string>(httpContext =>
{
var tenantId = httpContext.User.FindFirst("tenant_id")?.Value
?? httpContext.Request.Headers["X-Tenant-Id"].ToString();
if (string.IsNullOrWhiteSpace(tenantId))
{
tenantId = "anonymous";
}
return RateLimitPartition.GetTokenBucketLimiter(
partitionKey: tenantId,
factory: _ => new TokenBucketRateLimiterOptions
{
TokenLimit = 120,
TokensPerPeriod = 60,
ReplenishmentPeriod = TimeSpan.FromMinutes(1),
QueueLimit = 0,
AutoReplenishment = true
});
});
});
Design note: partition key choice is part of domain design. For B2B SaaS, tenant key is usually correct for fairness and billing. For public anonymous APIs, IP plus user agent or a gateway-issued client ID can be more robust than raw IP alone.
Distributed Rate Limiting
In-memory limiter state works only per process. With multiple instances behind a load balancer, each instance sees only a subset of requests, so a "100 req/min" limit can become roughly 100 x instance_count if state is not shared.
Single-instance in-memory:
- Lowest latency and simplest operations.
- Acceptable for monoliths or when a single gateway instance enforces limits.
- Not accurate when requests are distributed across replicas.
Redis-backed distributed counters:
- Shared state across all instances for accurate global enforcement.
- Typical patterns: atomic
INCRwith expiry for fixed windows, Lua script for token bucket, sorted sets for sliding log. - For sliding window counters,
MULTI/EXECcan atomically group bucket reads and increments, but strict allow/deny decisions are safer in one Lua script (orWATCHplus retry) to avoid race conditions.
Operational caveat: once limiter state is remote, availability of the limiter backend becomes part of your critical path. Define a clear failure mode upfront: fail-open for availability-sensitive consumer traffic, or fail-closed for security-sensitive operations.
# Simplified Redis transaction pattern for sliding window buckets
MULTI
INCR rl:tenant:{tenantId}:bucket:{currentBucket}
EXPIRE rl:tenant:{tenantId}:bucket:{currentBucket} 120
GET rl:tenant:{tenantId}:bucket:{previousBucket}
EXEC
Pitfalls
1) Fixed window boundary spike
What goes wrong: with a limit of 100/minute, a client can send 100 requests at 12:00:59 and another 100 at 12:01:00, effectively 200 in two seconds.
Why it happens: counters reset on hard boundaries rather than rolling time.
Mitigation: prefer token bucket or sliding window counter for edge-exposed endpoints.
2) Wrong partition key
What goes wrong: limiting by IP can unfairly throttle many users behind one NAT, while bad actors rotate IPs to evade limits.
Why it happens: key does not reflect identity or billing unit.
Mitigation: choose key by business objective (API key, tenant, user, or composite key). Align limiter key with quota ownership.
3) Clock skew in distributed limiters
What goes wrong: nodes disagree on current time, leading to inconsistent window calculations and unfair accepts/rejects.
Why it happens: window math depends on timestamps from different hosts.
Mitigation: centralize time decisions in Redis scripts when possible, run NTP everywhere, and avoid client-provided timestamps.
4) Missing response metadata
What goes wrong: clients receive 429 without actionable retry guidance, causing aggressive blind retries and more load.
Why it happens: only status code is returned, no quota context.
Mitigation: include Retry-After and useful quota headers such as X-RateLimit-Remaining, X-RateLimit-Limit, and X-RateLimit-Reset.
Interview Questions
Expected answer
- Use distributed shared state, usually Redis, because per-instance memory breaks global accuracy.
- Partition by tenant ID so quotas align with billing and fairness.
- Choose token bucket when tenants need controlled burst capacity with stable average throughput.
- Use atomic operations (Lua or transaction pattern) for refill and consume to avoid race conditions.
- Return
429withRetry-Afterand remaining quota headers to support client backoff.
Why this question matters - It tests algorithm choice plus distributed systems correctness, not just definition recall.
Expected answer
- Prefer sliding window counter when edge fairness matters and fixed window boundary bursts are unacceptable.
- It gives near-rolling behavior with lower memory than sliding log.
- Accept approximation error in exchange for better operational cost.
- Keep fixed window only where simplicity dominates and traffic patterns are predictable.
Why this question matters - It checks whether the candidate can justify tradeoffs under realistic constraints.
Expected answer
- Decide by endpoint risk profile, not globally.
- Fail-open for low-risk endpoints when availability is the top priority.
- Fail-closed for sensitive operations where abuse or cost explosion is unacceptable.
- Document and test the behavior with chaos drills.
Why this question matters - It tests operational judgment and explicit risk tradeoff reasoning.
References
- Rate limiting in ASP.NET Core
- System.Threading.RateLimiting namespace
- Redis transactions (MULTI and EXEC)
- Scaling your API with rate limiters (Stripe Engineering)
- Cloudflare rate limiting rules