CAP theorem

Intro

CAP theorem says that in a distributed data system, once a network partition happens, you can guarantee at most one of strong consistency or availability (while still tolerating the partition). This matters because real systems eventually hit partial failures: links drop, regions isolate, packets reorder, and suddenly nodes cannot communicate reliably. A partition is not "the whole system is down"; it is specifically "some nodes can still process requests, but they cannot exchange enough messages to maintain a single, current view of data." You reach for CAP when deciding failure behavior in system design: do we reject some operations to protect correctness, or accept operations and repair divergence later?

What CAP Actually Means

Definitions in operational terms

The common "pick any 2 of 3" slogan is a simplification that often causes wrong design decisions. In modern distributed systems, partition tolerance is usually non-negotiable once data is replicated across machines, racks, zones, or regions. The real forced choice is:

When there is no partition, many systems can provide both consistency and availability for normal operation.

Mechanism: Why You Cannot Have C and A During Partition

Imagine two replicas, R1 and R2, serving the same key.

  1. Client writes x=5 to R1.
  2. A network partition isolates R1 from R2.
  3. Another client reads from R2.

If R2 answers immediately, it may return old x=4 (availability preserved, consistency broken). If R2 refuses/blocks until it can confirm latest state from R1, it preserves consistency but sacrifices availability for that request path.

That is the CAP tension: with no reliable communication path, a node cannot both always answer and always be globally current.

flowchart TD
    A[Partition detected between replica groups] --> B{Request arrives}
    B --> C[CP path]
    B --> D[AP path]
    C --> E[Require quorum or leader confirmation]
    E --> F[If quorum unavailable then reject or timeout]
    F --> G[State remains strongly consistent]
    D --> H[Accept request on reachable replica]
    H --> I[Return possibly stale or conflicting value]
    I --> J[Reconcile later via repair or conflict resolution]

CP vs AP With Concrete Systems

CP behavior (consistency-first during partition)

Representative systems: ZooKeeper / etcd style coordination services, majority-quorum relational deployments.

Concrete effect:

ZooKeeper-style mindset: "If I cannot prove this write is globally safe, I will not accept it."

AP behavior (availability-first during partition)

Representative systems: Amazon DynamoDB (Dynamo-style), Cassandra, and many multi-region eventually consistent setups.

Concrete effect:

Dynamo-style mindset: "Keep accepting traffic now, converge state later."

CAP Is About Partition Time, Not Normal Time

This is one of the most important interview points:

Practical implication: ask "What happens in the bad 0.1% network case?" rather than evaluating only happy-path latency graphs.

PACELC Extension (What You Face Daily)

CAP explains partition behavior, but most daily engineering happens without active partition alarms. PACELC extends the model:

So even without partitions, distributed databases still force a design choice:

This is why engineers spend so much time on read consistency levels, session guarantees, quorum sizes, and timeout policy tuning.

.NET System Design Relevance

For senior .NET interviews, tie CAP/PACELC to concrete platform choices instead of abstract definitions.

SQL Server with Always On/synchronous replication (CP-leaning)

Azure Cosmos DB (tunable consistency)

Redis (AP-leaning in cache usage patterns)

Mixed-store architecture is normal

Most production .NET systems are not globally CP or AP. They are operation-scoped:

That per-operation selection is usually what interviewers want to hear.

Pitfalls

Pitfall 1: "CAP means pick two of three"

Pitfall 2: Treating CAP choice as system-wide and static

Pitfall 3: Ignoring reconciliation design in AP paths

Interview Questions

References


Whats next