Distributed Transactions

Distributed Transactions

A distributed transaction coordinates a change that spans multiple services or databases, ensuring all participants either commit or roll back together. The challenge: unlike a local database transaction, there is no single transaction manager that can atomically commit across network boundaries. A network partition or crash between steps leaves the system in an inconsistent state unless you design for it explicitly.

Two main approaches exist: Two-Phase Commit (2PC) for strong consistency with synchronous coordination, and Saga for eventual consistency with compensating transactions. The Outbox pattern is a reliability primitive used alongside both.

Two-Phase Commit (2PC)

2PC uses a coordinator to drive all participants through two phases:

  1. Prepare: coordinator asks each participant "can you commit?" — each locks resources and votes yes/no.
  2. Commit/Abort: if all vote yes, coordinator sends commit; if any vote no, coordinator sends abort to all.
Coordinator → Participant A: PREPARE
Coordinator → Participant B: PREPARE
Participant A → Coordinator: VOTE YES (locks row)
Participant B → Coordinator: VOTE YES (locks row)
Coordinator → Participant A: COMMIT
Coordinator → Participant B: COMMIT

When it works: tightly coupled systems where all participants support XA transactions (e.g., two SQL databases in the same data center). MSDTC on Windows implements 2PC for SQL Server.

Why it fails in microservices: the coordinator is a single point of failure. If it crashes after sending PREPARE but before COMMIT, participants hold locks indefinitely. Network latency between services makes the prepare-to-commit window long, increasing lock contention. Most modern services (HTTP APIs, NoSQL stores) don't support XA.

Saga Pattern

A Saga breaks a distributed transaction into a sequence of local transactions, each with a compensating transaction that undoes its effect if a later step fails.

Step 1: Order Service → PlaceOrder (local commit)
Step 2: Payment Service → ChargeCard (local commit)
Step 3: Inventory Service → ReserveStock (local commit)

If Step 3 fails:
  Compensate Step 2: RefundCard
  Compensate Step 1: CancelOrder

Two coordination styles:

Choreography — each service publishes events and reacts to others' events. No central coordinator.

// OrderService publishes after local commit
await bus.PublishAsync(new OrderPlaced(orderId, amount), ct);

// PaymentService subscribes and reacts
public async Task HandleAsync(OrderPlaced evt, CancellationToken ct)
{
    await _payments.ChargeAsync(evt.OrderId, evt.Amount, ct);
    await _bus.PublishAsync(new PaymentCharged(evt.OrderId), ct);
}

Orchestration — a central saga orchestrator sends commands and tracks state.

public class OrderSaga : ISaga
{
    public async Task StartAsync(PlaceOrderCommand cmd, CancellationToken ct)
    {
        await _payments.SendAsync(new ChargeCardCommand(cmd.OrderId, cmd.Amount), ct);
    }

    public async Task HandleAsync(PaymentCharged evt, CancellationToken ct)
    {
        await _inventory.SendAsync(new ReserveStockCommand(evt.OrderId), ct);
    }

    public async Task HandleAsync(PaymentFailed evt, CancellationToken ct)
    {
        await _orders.SendAsync(new CancelOrderCommand(evt.OrderId), ct);
    }
}

Outbox Pattern

The Outbox pattern solves the "publish after commit" reliability problem: writing an event to the database in the same transaction as the domain change, then publishing from the outbox asynchronously.

// Single transaction: domain change + outbox entry
await using var tx = await db.Database.BeginTransactionAsync(ct);
db.Orders.Add(order);
db.OutboxMessages.Add(new OutboxMessage
{
    Type    = nameof(OrderPlaced),
    Payload = JsonSerializer.Serialize(new OrderPlaced(order.Id, order.Total))
});
await db.SaveChangesAsync(ct);
await tx.CommitAsync(ct);
// Background worker reads OutboxMessages and publishes to broker, retrying until acknowledged

Without the Outbox: if the broker publish fails after the DB commit, the event is lost. With the Outbox: the event is durable in the DB and will be published eventually.

Pitfalls

Compensating Transactions That Cannot Undo

What goes wrong: a compensation step fails or is impossible — e.g., you cannot "un-send" an email or "un-charge" a card if the payment provider has no refund API.

Why it happens: compensations are designed as happy-path reversals without considering external system limitations.

Mitigation: design compensations before implementing the forward path. For irreversible side effects (emails, SMS), use idempotent "cancel" semantics (send a cancellation email) rather than true undo. Accept that some compensations are best-effort.

Saga State Lost on Crash

What goes wrong: the orchestrator crashes mid-saga. On restart, it doesn't know which steps completed and which need compensation.

Why it happens: saga state is held in memory rather than persisted.

Mitigation: persist saga state to a database after each step. Use a saga framework (MassTransit, NServiceBus) that handles state persistence and retry automatically.

Tradeoffs

Approach Consistency Complexity Latency When to use
2PC Strong (ACID) Medium High (locks held during coordination) Same data center, XA-capable stores, low throughput
Saga (choreography) Eventual Low (no coordinator) Low Loosely coupled services, simple flows
Saga (orchestration) Eventual High (orchestrator state) Medium Complex multi-step flows, explicit visibility needed

Decision rule: avoid 2PC in microservices — the lock contention and coordinator SPOF make it impractical. Use Saga with choreography for simple 2-3 step flows. Use Saga with orchestration when you need explicit state tracking, retries, and visibility into long-running workflows. Always pair with the Outbox pattern for reliable event publishing.

Questions

References


Whats next