CAP theorem
Intro
CAP theorem says that in a distributed data system, once a network partition happens, you can guarantee at most one of strong consistency or availability (while still tolerating the partition). This matters because real systems eventually hit partial failures: links drop, regions isolate, packets reorder, and suddenly nodes cannot communicate reliably. A partition is not "the whole system is down"; it is specifically "some nodes can still process requests, but they cannot exchange enough messages to maintain a single, current view of data." You reach for CAP when deciding failure behavior in system design: do we reject some operations to protect correctness, or accept operations and repair divergence later?
What CAP Actually Means
Definitions in operational terms
- Consistency (C): every successful read sees the most recent successful write (or an error), as if there is one up-to-date value.
- Availability (A): every request to a non-failed node receives a non-error response in finite time.
- Partition tolerance (P): the system continues operating despite message loss/delay between node groups.
The common "pick any 2 of 3" slogan is a simplification that often causes wrong design decisions. In modern distributed systems, partition tolerance is usually non-negotiable once data is replicated across machines, racks, zones, or regions. The real forced choice is:
- during a partition, choose C (reject/timeout some operations)
- or choose A (continue serving operations with possible staleness/conflicts)
When there is no partition, many systems can provide both consistency and availability for normal operation.
Mechanism: Why You Cannot Have C and A During Partition
Imagine two replicas, R1 and R2, serving the same key.
- Client writes
x=5toR1. - A network partition isolates
R1fromR2. - Another client reads from
R2.
If R2 answers immediately, it may return old x=4 (availability preserved, consistency broken). If R2 refuses/blocks until it can confirm latest state from R1, it preserves consistency but sacrifices availability for that request path.
That is the CAP tension: with no reliable communication path, a node cannot both always answer and always be globally current.
flowchart TD
A[Partition detected between replica groups] --> B{Request arrives}
B --> C[CP path]
B --> D[AP path]
C --> E[Require quorum or leader confirmation]
E --> F[If quorum unavailable then reject or timeout]
F --> G[State remains strongly consistent]
D --> H[Accept request on reachable replica]
H --> I[Return possibly stale or conflicting value]
I --> J[Reconcile later via repair or conflict resolution]CP vs AP With Concrete Systems
CP behavior (consistency-first during partition)
Representative systems: ZooKeeper / etcd style coordination services, majority-quorum relational deployments.
- They require leader or quorum confirmation before committing writes.
- If a partition prevents quorum, writes are rejected or blocked.
- Reads may also be restricted if linearizability is required.
Concrete effect:
- Good: no split-brain writes, strong correctness for locks, config, leader election.
- Cost: reduced availability for some operations during partition.
ZooKeeper-style mindset: "If I cannot prove this write is globally safe, I will not accept it."
AP behavior (availability-first during partition)
Representative systems: Amazon DynamoDB (Dynamo-style), Cassandra, and many multi-region eventually consistent setups.
- Replicas accept writes on reachable nodes even when not fully coordinated.
- Divergent versions can exist temporarily.
- Background repair, vector clocks/timestamps, or app-level merge rules reconcile state.
Concrete effect:
- Good: service continues under partition, better uptime for user-facing traffic.
- Cost: clients may observe stale reads or conflict resolution artifacts.
Dynamo-style mindset: "Keep accepting traffic now, converge state later."
CAP Is About Partition Time, Not Normal Time
This is one of the most important interview points:
- If links are healthy and quorum is reachable, a CP system can look both consistent and available.
- If links are healthy, an AP system can also look fully correct because replicas converge quickly.
- CAP only constrains guarantees when partition actually exists.
Practical implication: ask "What happens in the bad 0.1% network case?" rather than evaluating only happy-path latency graphs.
PACELC Extension (What You Face Daily)
CAP explains partition behavior, but most daily engineering happens without active partition alarms. PACELC extends the model:
- PA/EC: if Partition then choose Availability or Consistency; Else choose Latency or Consistency.
So even without partitions, distributed databases still force a design choice:
- wait for more replicas/quorum to improve consistency
- or respond faster from local/near replicas with weaker freshness guarantees
This is why engineers spend so much time on read consistency levels, session guarantees, quorum sizes, and timeout policy tuning.
.NET System Design Relevance
For senior .NET interviews, tie CAP/PACELC to concrete platform choices instead of abstract definitions.
SQL Server with Always On/synchronous replication (CP-leaning)
- CAP tradeoffs show up when SQL Server is deployed as a replicated system (for example, Always On Availability Groups), not as a single standalone instance.
- Strong transactional guarantees and synchronous commit patterns prioritize correctness when replicas must coordinate commit.
- Under replication or failover network issues, some operations may block/fail rather than return divergent committed state.
- Good fit for orders, payments, inventory reservation, ledger-like data.
Azure Cosmos DB (tunable consistency)
- You can select consistency models (Strong, Bounded Staleness, Session, Consistent Prefix, Eventual).
- This lets you pick different points on latency/freshness per workload.
- Interview signal: mention that one product can serve CP-like or AP-leaning behaviors depending on configuration and operation.
Redis (AP-leaning in cache usage patterns)
- In most architectures, Redis is used as a cache where temporary staleness or key loss is acceptable.
- During partitions/failover races, cache inconsistencies are tolerated because database remains source of truth.
- The business decision is explicit: keep low-latency serving path available, recover correctness from authoritative store.
Mixed-store architecture is normal
Most production .NET systems are not globally CP or AP. They are operation-scoped:
PlaceOrderpath: CP-leaning store + strict idempotency + transactional guarantees.GetRecommendationspath: AP-leaning cache/search index + eventual refresh.UserProfilepath: session consistency may be enough.
That per-operation selection is usually what interviewers want to hear.
Pitfalls
Pitfall 1: "CAP means pick two of three"
- What goes wrong: teams assume they can permanently choose C and A while ignoring P.
- Why it is wrong: once replication spans unreliable networks, partitions will happen; P is not optional in practice.
- How to avoid it: restate CAP as "during partition, choose C or A" and design explicit failure policy for each critical operation.
Pitfall 2: Treating CAP choice as system-wide and static
- What goes wrong: architecture docs label entire platform "CP" or "AP," then apply one rule to all endpoints.
- Why it is risky: different endpoints have different correctness and UX budgets.
- How to avoid it: classify operations by business invariants and allowed stale window, then pick per-operation consistency/availability behavior.
Pitfall 3: Ignoring reconciliation design in AP paths
- What goes wrong: system accepts writes under partition but has weak conflict strategy.
- Why it is risky: silent data corruption appears later as duplicate orders or lost preference updates.
- How to avoid it: define merge policy, idempotency keys, causality/version metadata, and repair observability from day one.
Interview Questions
- Orders are correctness-critical, so write path should favor consistency (reject/timeout if quorum or primary connectivity is not safe).
- Session cache can favor availability; stale/missing session values are tolerable and recoverable.
- This is an operation-level CAP decision, not contradiction.
- The database remains source of truth; cache is disposable acceleration.
- Tradeoff: it tests whether you can map CAP to business impact instead of reciting definitions.
- Because PACELC: without partition, there is still latency vs consistency tradeoff.
- Waiting for more replicas/quorum improves freshness/ordering confidence but increases latency.
- Reading locally reduces latency but may return older data.
- Practical tuning depends on endpoint SLO and correctness requirements.
- Tradeoff: it separates textbook CAP knowledge from real operational design judgment.
- Ask for exact conflict resolution policy (last-write-wins, merge function, user-assisted resolution).
- Ask how causality/version metadata is stored and how retries stay idempotent.
- Ask how reconciliation failures are detected (metrics, dead-letter, repair jobs, audits).
- Ask which entities are forbidden from AP behavior (money movement, inventory decrements, security state).
- Tradeoff: senior candidates must translate AP choice into concrete safety controls.
References
- Brewer, "Towards Robust Distributed Systems" (PODC 2000 keynote) — the original CAP conjecture presentation by Eric Brewer.
- Gilbert and Lynch, "Brewer's Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services" — the formal proof of the CAP theorem with precise definitions of consistency and availability.
- Azure Cosmos DB consistency levels — practical example of a production system offering five tunable consistency levels, illustrating CAP tradeoffs in a real product.
- Amazon Dynamo paper (SOSP 2007) — canonical AP system design paper showing how Amazon chose availability over consistency and the engineering consequences.
- Abadi, "Consistency Tradeoffs in Modern Distributed Database System Design: CAP is only part of the story" (PACELC) — extends CAP with the PACELC model, adding latency vs consistency tradeoffs during normal operation.