Design a Payment System: Transaction Processing

Key Insights

Payment systems require idempotency at every layer—duplicate charges destroy user trust faster than any other bug, and network failures make retries inevitable
State machines aren’t just elegant design; they’re your primary defense against invalid operations that could leave money in limbo between states
The saga pattern with compensating transactions beats two-phase commit for payment flows because you need to handle partial failures gracefully, not pretend they won’t happen

Introduction & System Requirements

Payment processing sits at the intersection of everything that makes distributed systems hard: you need exactly-once semantics in a world of at-least-once delivery, you’re coordinating with external providers you don’t control, and the cost of bugs is measured in real money and regulatory fines.

Before diving into architecture, let’s establish what we’re building.

Functional Requirements:

Process card payments (authorize, capture, refund, void)
Support multiple payment providers (Stripe, Adyen, Braintree)
Handle partial refunds and split payments
Provide real-time transaction status

Non-Functional Requirements:

99.99% uptime (52 minutes downtime/year maximum)
Sub-second latency for authorization (p99 < 500ms)
PCI DSS compliance (or scope reduction through tokenization)
Audit trail for every state change

The 99.99% target is particularly challenging because you depend on external providers who won’t give you that SLA. Your architecture must account for provider failures without failing customer transactions.

High-Level Architecture

A production payment system needs clear separation between orchestration logic and provider-specific implementations. Here’s the component breakdown:

┌─────────────────────────────────────────────────────────────────┐
│                         API Gateway                              │
│              (Rate limiting, authentication, TLS)                │
└─────────────────────────┬───────────────────────────────────────┘
                          │
┌─────────────────────────▼───────────────────────────────────────┐
│                   Payment Orchestrator                           │
│         (State machine, saga coordination, routing)              │
└──────┬──────────────────┬──────────────────────┬────────────────┘
       │                  │                      │
┌──────▼──────┐   ┌───────▼───────┐   ┌─────────▼─────────┐
│   Stripe    │   │    Adyen      │   │    Braintree      │
│   Adapter   │   │    Adapter    │   │    Adapter        │
└─────────────┘   └───────────────┘   └───────────────────┘
       │                  │                      │
       └──────────────────┴──────────────────────┘
                          │
┌─────────────────────────▼───────────────────────────────────────┐
│                     Event Bus (Kafka)                            │
│            (Transaction events, webhooks, reconciliation)        │
└─────────────────────────────────────────────────────────────────┘

The Payment Orchestrator is the brain—it manages transaction state, coordinates multi-step flows, and routes to the appropriate provider. Provider adapters translate our internal API to provider-specific formats.

interface PaymentService {
  authorize(request: AuthorizeRequest): Promise<AuthorizeResponse>;
  capture(transactionId: string, amount?: Money): Promise<CaptureResponse>;
  refund(transactionId: string, amount: Money, reason: string): Promise<RefundResponse>;
  void(transactionId: string): Promise<VoidResponse>;
  getTransaction(transactionId: string): Promise<Transaction>;
}

interface AuthorizeRequest {
  idempotencyKey: string;
  merchantId: string;
  amount: Money;
  paymentMethod: PaymentMethodToken;
  metadata?: Record<string, string>;
}

interface Money {
  amount: number;  // Minor units (cents)
  currency: string; // ISO 4217
}

Notice the idempotencyKey as a required field—not optional. This forces callers to think about retry behavior upfront.

Transaction State Machine

Every payment transaction moves through a defined set of states. Modeling this explicitly as a state machine prevents bugs like capturing an already-refunded transaction or refunding more than was captured.

                    ┌──────────────┐
                    │  INITIATED   │
                    └──────┬───────┘
                           │ authorize()
                    ┌──────▼───────┐
              ┌─────│  AUTHORIZED  │─────┐
              │     └──────┬───────┘     │
        void()│            │ capture()   │ expires/fails
              │     ┌──────▼───────┐     │
              │     │   CAPTURED   │     │
              │     └──────┬───────┘     │
              │            │ refund()    │
              │     ┌──────▼───────┐     │
              └────►│   VOIDED     │     │
                    └──────────────┘     │
                    ┌──────────────┐     │
                    │   REFUNDED   │◄────┘
                    └──────────────┘
                    ┌──────────────┐
                    │    FAILED    │
                    └──────────────┘

Here’s a state machine implementation that enforces valid transitions:

enum TransactionState {
  INITIATED = 'INITIATED',
  AUTHORIZED = 'AUTHORIZED',
  CAPTURED = 'CAPTURED',
  VOIDED = 'VOIDED',
  REFUNDED = 'REFUNDED',
  PARTIALLY_REFUNDED = 'PARTIALLY_REFUNDED',
  FAILED = 'FAILED',
}

type TransactionEvent = 'AUTHORIZE' | 'CAPTURE' | 'VOID' | 'REFUND' | 'FAIL';

const VALID_TRANSITIONS: Record<TransactionState, Partial<Record<TransactionEvent, TransactionState>>> = {
  [TransactionState.INITIATED]: {
    AUTHORIZE: TransactionState.AUTHORIZED,
    FAIL: TransactionState.FAILED,
  },
  [TransactionState.AUTHORIZED]: {
    CAPTURE: TransactionState.CAPTURED,
    VOID: TransactionState.VOIDED,
    FAIL: TransactionState.FAILED,
  },
  [TransactionState.CAPTURED]: {
    REFUND: TransactionState.REFUNDED,  // Full refund
  },
  [TransactionState.PARTIALLY_REFUNDED]: {
    REFUND: TransactionState.REFUNDED,  // Complete remaining refund
  },
  [TransactionState.VOIDED]: {},
  [TransactionState.REFUNDED]: {},
  [TransactionState.FAILED]: {},
};

class TransactionStateMachine {
  constructor(private transaction: Transaction) {}

  canTransition(event: TransactionEvent): boolean {
    return VALID_TRANSITIONS[this.transaction.state][event] !== undefined;
  }

  transition(event: TransactionEvent): TransactionState {
    const nextState = VALID_TRANSITIONS[this.transaction.state][event];
    if (!nextState) {
      throw new InvalidTransitionError(
        `Cannot ${event} transaction in state ${this.transaction.state}`
      );
    }
    return nextState;
  }
}

The state machine becomes your source of truth. Before any operation, check canTransition(). This catches bugs at the application layer rather than relying on database constraints or provider rejections.

Idempotency & Exactly-Once Processing

Network failures, timeouts, and client retries are inevitable. Without idempotency, a timeout during authorization could result in duplicate charges when the client retries.

class IdempotencyMiddleware {
  constructor(
    private cache: Redis,
    private ttlSeconds: number = 86400  // 24 hours
  ) {}

  async execute<T>(
    idempotencyKey: string,
    operation: () => Promise<T>
  ): Promise<T> {
    const cacheKey = `idempotency:${idempotencyKey}`;
    
    // Check for existing result
    const cached = await this.cache.get(cacheKey);
    if (cached) {
      const result = JSON.parse(cached);
      if (result.status === 'COMPLETED') {
        return result.response;
      }
      if (result.status === 'IN_PROGRESS') {
        throw new ConflictError('Request already in progress');
      }
    }

    // Mark as in-progress with NX (only set if not exists)
    const acquired = await this.cache.set(
      cacheKey,
      JSON.stringify({ status: 'IN_PROGRESS', startedAt: Date.now() }),
      'EX', this.ttlSeconds,
      'NX'
    );

    if (!acquired) {
      // Race condition: another request started between our GET and SET
      throw new ConflictError('Request already in progress');
    }

    try {
      const response = await operation();
      
      // Store successful result
      await this.cache.set(
        cacheKey,
        JSON.stringify({ status: 'COMPLETED', response }),
        'EX', this.ttlSeconds
      );
      
      return response;
    } catch (error) {
      // Clear in-progress marker on failure so retries work
      await this.cache.del(cacheKey);
      throw error;
    }
  }
}

Critical implementation details: the NX flag ensures atomic check-and-set, preventing race conditions. Failed operations clear the cache so legitimate retries succeed. The 24-hour TTL balances memory usage against the retry window.

Handling Distributed Transactions

A typical payment flow involves multiple steps: authorize the card, run fraud checks, capture funds. If fraud check fails after authorization, you need to void the auth. This is the saga pattern in action.

Two-phase commit doesn’t work here because you can’t hold locks across external provider calls, and providers don’t support prepare/commit protocols.

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│  Authorize  │────►│ Fraud Check │────►│   Capture   │
└──────┬──────┘     └──────┬──────┘     └──────┬──────┘
       │                   │                   │
       │ Compensate:       │ Compensate:       │ Compensate:
       │ (none needed)     │ Void Auth         │ Refund
       ▼                   ▼                   ▼

interface SagaStep<T> {
  name: string;
  execute: (context: T) => Promise<T>;
  compensate: (context: T) => Promise<void>;
}

class PaymentSaga {
  private steps: SagaStep<PaymentContext>[] = [
    {
      name: 'authorize',
      execute: async (ctx) => {
        const authResult = await this.paymentProvider.authorize(ctx.request);
        return { ...ctx, authorizationId: authResult.id };
      },
      compensate: async (ctx) => {
        if (ctx.authorizationId) {
          await this.paymentProvider.void(ctx.authorizationId);
        }
      },
    },
    {
      name: 'fraudCheck',
      execute: async (ctx) => {
        const fraudResult = await this.fraudService.check(ctx);
        if (fraudResult.risk > 0.8) {
          throw new FraudDetectedError('High risk transaction');
        }
        return { ...ctx, fraudCheckPassed: true };
      },
      compensate: async () => {}, // No compensation needed
    },
    {
      name: 'capture',
      execute: async (ctx) => {
        const captureResult = await this.paymentProvider.capture(ctx.authorizationId);
        return { ...ctx, captureId: captureResult.id };
      },
      compensate: async (ctx) => {
        if (ctx.captureId) {
          await this.paymentProvider.refund(ctx.captureId, ctx.request.amount);
        }
      },
    },
  ];

  async execute(request: PaymentRequest): Promise<PaymentResult> {
    let context: PaymentContext = { request };
    const completedSteps: SagaStep<PaymentContext>[] = [];

    for (const step of this.steps) {
      try {
        context = await step.execute(context);
        completedSteps.push(step);
      } catch (error) {
        // Compensate in reverse order
        for (const completedStep of completedSteps.reverse()) {
          try {
            await completedStep.compensate(context);
          } catch (compensationError) {
            // Log and continue—compensation failures need manual intervention
            this.alerting.critical('Compensation failed', { step: completedStep.name, error: compensationError });
          }
        }
        throw error;
      }
    }

    return { transactionId: context.captureId, status: 'SUCCESS' };
  }
}

Compensation failures are the hardest edge case. When a refund fails during compensation, you need alerting and manual reconciliation processes. Never silently swallow these errors.

Failure Handling & Reconciliation

Payment providers go down. When they do, you need circuit breakers to fail fast and reconciliation to fix inconsistencies.

class PaymentReconciler {
  async reconcile(startDate: Date, endDate: Date): Promise<ReconciliationReport> {
    const localTransactions = await this.transactionStore.findByDateRange(startDate, endDate);
    const providerTransactions = await this.paymentProvider.listTransactions(startDate, endDate);
    
    const discrepancies: Discrepancy[] = [];
    const providerMap = new Map(providerTransactions.map(t => [t.id, t]));

    for (const local of localTransactions) {
      const provider = providerMap.get(local.providerTransactionId);
      
      if (!provider) {
        discrepancies.push({
          type: 'MISSING_AT_PROVIDER',
          localTransaction: local,
          action: local.state === TransactionState.CAPTURED ? 'INVESTIGATE' : 'MARK_FAILED',
        });
        continue;
      }

      if (local.amount !== provider.amount) {
        discrepancies.push({
          type: 'AMOUNT_MISMATCH',
          localTransaction: local,
          providerTransaction: provider,
          action: 'MANUAL_REVIEW',
        });
      }

      if (this.statesMismatch(local.state, provider.status)) {
        discrepancies.push({
          type: 'STATE_MISMATCH',
          localTransaction: local,
          providerTransaction: provider,
          action: 'SYNC_FROM_PROVIDER',
        });
      }

      providerMap.delete(local.providerTransactionId);
    }

    // Remaining provider transactions don't exist locally
    for (const orphan of providerMap.values()) {
      discrepancies.push({
        type: 'MISSING_LOCALLY',
        providerTransaction: orphan,
        action: 'CREATE_LOCAL_RECORD',
      });
    }

    return { discrepancies, processedCount: localTransactions.length };
  }
}

Run reconciliation continuously—not just daily. Provider webhooks can arrive out of order or fail entirely. The reconciler is your safety net.

Security & Compliance Considerations

PCI DSS compliance is non-negotiable for payment systems. The simplest path to compliance is scope reduction through tokenization: never let raw card numbers touch your servers.

Key security requirements:

Tokenization: Use provider tokens (Stripe’s tok_xxx) instead of storing card data
Encryption: TLS 1.3 in transit, AES-256 for sensitive data at rest
Audit logging: Every state change, every access, immutable logs
Key management: HSMs or cloud KMS for encryption keys

Payment security deserves its own deep-dive article. For now, the critical point is: design for PCI scope reduction from day one. Retrofitting security is expensive and error-prone.

Building a payment system that handles real money requires paranoid attention to edge cases. State machines, idempotency, and sagas aren’t optional patterns—they’re the minimum viable architecture for not losing money or customer trust.