Rate Limiting

Intro

Rate limiting controls how many requests a client can make in a period of time so one caller cannot exhaust shared resources. It matters because it protects reliability, reduces abuse, and keeps cost predictable when downstream work is expensive, especially LLM inference and embedding calls billed per request or token. In system design interviews, rate limiting is usually a quota protection mechanism, not just a security feature: it keeps latency stable for well-behaved users when traffic spikes. Reach for it on public APIs, shared multi-tenant services, and any endpoint that fans out to costly dependencies.

In .NET systems, rate limiting is often a layered decision: edge gateway limits, app-level per-tenant limits, and provider-level limits from dependencies like OpenAI or Stripe. The algorithm you choose defines failure behavior under burst traffic, memory usage, and fairness.

Why It Matters in Senior Design Discussions

Core Algorithms

Token Bucket

Token bucket maintains a bucket with capacity B tokens. Tokens are added at a refill rate R over time, and each request consumes one or more tokens. If tokens are available, the request is allowed; if not, it is rejected or queued.

Why teams like it:

Tradeoffs:

When to prefer it:

flowchart LR
    Refill[Refill timer adds tokens] --> Bucket[Token bucket]
    Client[Client request] --> Check{Token available}
    Bucket --> Check
    Check -->|yes| Consume[Consume token and allow request]
    Check -->|no| Reject[Return 429 and retry hint]
    Consume --> Bucket

Sliding Window Log

Sliding window log stores timestamps of recent requests per key and removes entries older than the window size. A new request is allowed only if the count of timestamps in the active window is below the limit.

Why it is precise:

Tradeoffs:

When to prefer it:

Sliding Window Counter

Sliding window counter approximates rolling windows using two adjacent fixed buckets (current and previous), then weights the previous bucket based on elapsed time. It estimates requests in the active rolling window without storing every timestamp.

Why it is a strong default:

Tradeoffs:

When to prefer it:

Fixed Window Counter

Fixed window tracks a simple counter per key for each discrete window (for example, one minute). Counter resets when the window changes.

Why teams start with it:

Tradeoffs:

When to prefer it:

Quick Comparison

Algorithm Burst support Accuracy Memory cost Operational complexity Typical fit
Fixed Window Poor at edges Low to medium Low Low Simple internal quotas
Sliding Window Log Limited by policy High High Medium Strict fairness and auditability
Sliding Window Counter Medium Medium to high Medium Medium Balanced general purpose API limits
Token Bucket Strong and controlled Medium to high Medium Medium Public APIs and tenant burst tolerance

ASP.NET Core has first-class middleware support via Microsoft.AspNetCore.RateLimiting. You register policies in AddRateLimiter and attach a policy globally or per endpoint.

using System.Threading.RateLimiting;
using Microsoft.AspNetCore.RateLimiting;

var builder = WebApplication.CreateBuilder(args);

builder.Services.AddRateLimiter(options =>
{
    options.RejectionStatusCode = StatusCodes.Status429TooManyRequests;

    options.AddFixedWindowLimiter("fixed-per-client", limiterOptions =>
    {
        limiterOptions.PermitLimit = 100;
        limiterOptions.Window = TimeSpan.FromMinutes(1);
        limiterOptions.QueueLimit = 0;
        limiterOptions.AutoReplenishment = true;
    });

    options.AddTokenBucketLimiter("token-per-client", limiterOptions =>
    {
        limiterOptions.TokenLimit = 200;
        limiterOptions.TokensPerPeriod = 20;
        limiterOptions.ReplenishmentPeriod = TimeSpan.FromSeconds(1);
        limiterOptions.QueueLimit = 0;
        limiterOptions.AutoReplenishment = true;
    });
});

var app = builder.Build();

app.UseRateLimiter();

app.MapGet("/api/public", () => Results.Ok("ok"))
   .RequireRateLimiting("token-per-client");

app.MapGet("/api/admin", () => Results.Ok("ok"))
   .RequireRateLimiting("fixed-per-client");

app.Run();

Per Tenant Partitioning

For multi-tenant APIs, partition by tenant or API key, not only by IP address. ASP.NET Core supports partitioning with PartitionedRateLimiter so each key gets its own limiter state.

using System.Threading.RateLimiting;

builder.Services.AddRateLimiter(options =>
{
    options.GlobalLimiter = PartitionedRateLimiter.Create<HttpContext, string>(httpContext =>
    {
        var tenantId = httpContext.User.FindFirst("tenant_id")?.Value
                       ?? httpContext.Request.Headers["X-Tenant-Id"].ToString();

        if (string.IsNullOrWhiteSpace(tenantId))
        {
            tenantId = "anonymous";
        }

        return RateLimitPartition.GetTokenBucketLimiter(
            partitionKey: tenantId,
            factory: _ => new TokenBucketRateLimiterOptions
            {
                TokenLimit = 120,
                TokensPerPeriod = 60,
                ReplenishmentPeriod = TimeSpan.FromMinutes(1),
                QueueLimit = 0,
                AutoReplenishment = true
            });
    });
});

Design note: partition key choice is part of domain design. For B2B SaaS, tenant key is usually correct for fairness and billing. For public anonymous APIs, IP plus user agent or a gateway-issued client ID can be more robust than raw IP alone.

Distributed Rate Limiting

In-memory limiter state works only per process. With multiple instances behind a load balancer, each instance sees only a subset of requests, so a "100 req/min" limit can become roughly 100 x instance_count if state is not shared.

Single-instance in-memory:

Redis-backed distributed counters:

Operational caveat: once limiter state is remote, availability of the limiter backend becomes part of your critical path. Define a clear failure mode upfront: fail-open for availability-sensitive consumer traffic, or fail-closed for security-sensitive operations.

# Simplified Redis transaction pattern for sliding window buckets
MULTI
INCR rl:tenant:{tenantId}:bucket:{currentBucket}
EXPIRE rl:tenant:{tenantId}:bucket:{currentBucket} 120
GET rl:tenant:{tenantId}:bucket:{previousBucket}
EXEC

Pitfalls

1) Fixed window boundary spike

What goes wrong: with a limit of 100/minute, a client can send 100 requests at 12:00:59 and another 100 at 12:01:00, effectively 200 in two seconds.

Why it happens: counters reset on hard boundaries rather than rolling time.

Mitigation: prefer token bucket or sliding window counter for edge-exposed endpoints.

2) Wrong partition key

What goes wrong: limiting by IP can unfairly throttle many users behind one NAT, while bad actors rotate IPs to evade limits.

Why it happens: key does not reflect identity or billing unit.

Mitigation: choose key by business objective (API key, tenant, user, or composite key). Align limiter key with quota ownership.

3) Clock skew in distributed limiters

What goes wrong: nodes disagree on current time, leading to inconsistent window calculations and unfair accepts/rejects.

Why it happens: window math depends on timestamps from different hosts.

Mitigation: centralize time decisions in Redis scripts when possible, run NTP everywhere, and avoid client-provided timestamps.

4) Missing response metadata

What goes wrong: clients receive 429 without actionable retry guidance, causing aggressive blind retries and more load.

Why it happens: only status code is returned, no quota context.

Mitigation: include Retry-After and useful quota headers such as X-RateLimit-Remaining, X-RateLimit-Limit, and X-RateLimit-Reset.

Interview Questions

Your AI service wraps OpenAI APIs with per-tenant limits and runs on 4 instances. How do you enforce limits accurately, and which algorithm do you choose?

Expected answer

  • Use distributed shared state, usually Redis, because per-instance memory breaks global accuracy.
  • Partition by tenant ID so quotas align with billing and fairness.
  • Choose token bucket when tenants need controlled burst capacity with stable average throughput.
  • Use atomic operations (Lua or transaction pattern) for refill and consume to avoid race conditions.
  • Return 429 with Retry-After and remaining quota headers to support client backoff.
    Why this question matters
  • It tests algorithm choice plus distributed systems correctness, not just definition recall.
When would you prefer sliding window counter over fixed window in a public API?

Expected answer

  • Prefer sliding window counter when edge fairness matters and fixed window boundary bursts are unacceptable.
  • It gives near-rolling behavior with lower memory than sliding log.
  • Accept approximation error in exchange for better operational cost.
  • Keep fixed window only where simplicity dominates and traffic patterns are predictable.
    Why this question matters
  • It checks whether the candidate can justify tradeoffs under realistic constraints.
What failure mode should you choose if Redis-based rate limiting is unavailable: fail-open or fail-closed?

Expected answer

  • Decide by endpoint risk profile, not globally.
  • Fail-open for low-risk endpoints when availability is the top priority.
  • Fail-closed for sensitive operations where abuse or cost explosion is unacceptable.
  • Document and test the behavior with chaos drills.
    Why this question matters
  • It tests operational judgment and explicit risk tradeoff reasoning.

References


Whats next

Parent
Patterns

Pages