TechnicalJanuary 20, 202415 min

How a Database Deadlock Killed Our Production System (And How We Fixed It)

A deep dive into a critical production incident where database deadlocks brought down a high-traffic SaaS platform. Learn the root cause, our debugging process, and the architectural changes that prevented recurrence.

The Incident: 3 AM Wake-Up Call

It was 3:17 AM when PagerDuty woke me up. Our monitoring system detected that API response times had spiked from ~80ms to over 30 seconds. Within minutes, the entire platform became unresponsive. Over 15,000 active users were locked out, and our support channels exploded with complaints.

This is the story of how a subtle database deadlock pattern emerged under production load, how we diagnosed it under pressure, and the architectural changes we implemented to ensure it never happened again.

The Symptoms: What We Saw First

When I SSHed into our production Kubernetes cluster, the symptoms were clear but confusing:

  • API latency: p99 > 30s (normal: ~100ms)
  • Database connection pool: 100% exhausted (200/200 connections in use)
  • CPU and memory: completely normal (no resource saturation)
  • Error rate: 504 Gateway Timeouts spiking

The confusing part? CPU was at 15%, memory at 40%, disk I/O was minimal. This wasn't a resource exhaustion problem. Something else was blocking our database queries.

The Root Cause: A Hidden Deadlock Pattern

After connecting to our PostgreSQL instance and running SELECT * FROM pg_stat_activity WHERE state != 'idle';, I saw the smoking gun:

pid  | state            | wait_event_type | wait_event
-----|------------------|-----------------|------------
1234 | active          | Lock            | transactionid
5678 | active          | Lock            | transactionid
9012 | active          | Lock            | transactionid
(... 197 more rows ...)

Nearly all 200 connections were waiting on transaction locks. Checking pg_locks revealed a circular dependency:

  • Transaction A held a lock on users table row ID 42, waiting for subscriptions row ID 89
  • Transaction B held a lock on subscriptions row ID 89, waiting for users row ID 42

Classic deadlock. But why was it happening at 3 AM, and why so catastrophically?

Debugging Under Pressure: The Detective Work

Step 1: Immediate Mitigation

First priority: restore service. I killed the deadlocked transactions:

SELECT pg_terminate_backend(pid) 
FROM pg_stat_activity 
WHERE wait_event = 'transactionid' 
AND state = 'active';

Within 30 seconds, the connection pool drained, latency dropped to normal, and users could access the platform again. Total downtime: 12 minutes.

Step 2: Understanding the Pattern

I pulled database logs for the past hour and reconstructed what happened. Our application had two critical code paths that could execute concurrently:

Code Path A (User Update Endpoint):

BEGIN;
UPDATE users SET last_active = NOW() WHERE id = ?;  -- Locks user row
UPDATE subscriptions SET usage_count = usage_count + 1 
  WHERE user_id = ?;  -- Then locks subscription row
COMMIT;

Code Path B (Subscription Webhook Handler):

BEGIN;
UPDATE subscriptions SET status = 'active' WHERE id = ?;  -- Locks subscription row
UPDATE users SET subscription_tier = 'premium' 
  WHERE id = ?;  -- Then locks user row
COMMIT;

The deadlock occurred when:

  1. Transaction A locked users.id=42
  2. Transaction B locked subscriptions.id=89 (for the same user)
  3. Transaction A tried to lock subscriptions.id=89 → blocked by B
  4. Transaction B tried to lock users.id=42 → blocked by A
  5. Deadlock.

Step 3: Why Now? The Traffic Pattern

This code had been deployed for 3 months. Why did it suddenly break at 3 AM?

Checking our analytics, we discovered that a new marketing campaign had driven a surge of premium subscription signups (triggering Code Path B frequently). Simultaneously, those new users were highly active, triggering Code Path A repeatedly.

The higher concurrency made the race condition inevitable. With 15,000 active users and subscription webhooks firing every few seconds, the probability of these two transactions overlapping increased from ~0.1% to ~15%. That was enough to cascade into total deadlock.

The Fix: Three Layers of Defense

Layer 1: Consistent Lock Ordering

The immediate fix: ensure all transactions acquire locks in the same order. We standardized on always lock users first, then subscriptions:

-- BEFORE (Code Path B)
BEGIN;
UPDATE subscriptions SET status = 'active' WHERE id = ?;
UPDATE users SET subscription_tier = 'premium' WHERE id = ?;
COMMIT;

-- AFTER (Code Path B)
BEGIN;
-- CHANGED: Lock users first, even if we don't update it yet
SELECT id FROM users WHERE id = ? FOR UPDATE;
UPDATE subscriptions SET status = 'active' WHERE id = ?;
UPDATE users SET subscription_tier = 'premium' WHERE id = ?;
COMMIT;

This eliminated circular dependencies. If Transaction A locks users first, Transaction B will wait for that lock before locking subscriptions. No more cycles.

Layer 2: Reduce Transaction Scope

We also realized that Code Path A didn't actually need to run in a transaction at all. The last_active timestamp update was purely informational and didn't require atomic consistency with the subscription update.

We split the operations:

// BEFORE: Single transaction
await db.transaction(async (tx) => {
  await tx.update(users).set({ lastActive: new Date() });
  await tx.update(subscriptions).set({ usageCount: sql`usage_count + 1` });
});

// AFTER: Separate operations
await db.update(users).set({ lastActive: new Date() });  // No transaction needed
await db.update(subscriptions).set({ usageCount: sql`usage_count + 1` });

This reduced lock duration by 60% and further decreased deadlock probability.

Layer 3: Deadlock Detection and Retry Logic

As a final safety net, we added deadlock detection with exponential backoff retry:

async function executeWithDeadlockRetry<T>(
  operation: () => Promise<T>,
  maxRetries = 3
): Promise<T> {
  for (let attempt = 0; attempt < maxRetries; attempt++) {
    try {
      return await operation();
    } catch (error) {
      // PostgreSQL deadlock error code
      if (error.code === '40P01' && attempt < maxRetries - 1) {
        const backoffMs = Math.pow(2, attempt) * 100; // 100ms, 200ms, 400ms
        await new Promise(resolve => setTimeout(resolve, backoffMs));
        continue;
      }
      throw error;
    }
  }
  throw new Error('Max retries exceeded');
}

This catches the rare remaining deadlocks (PostgreSQL detects them and aborts one transaction) and retries with backoff. In practice, retries succeed on the first attempt 99.9% of the time.

Prevention: Monitoring and Alerting

To catch similar issues before they reach production, we added:

  1. Database lock monitoring: Alert if >10 queries are waiting on locks for >5 seconds
  2. Connection pool monitoring: Alert if pool utilization exceeds 80%
  3. Deadlock rate tracking: Log and graph deadlock occurrences (should be ~0 after our fix)
  4. Load testing with concurrency: Our CI now runs tests with 50 concurrent users to surface race conditions early

The Results: What Changed

After deploying these fixes to production:

  • Zero deadlocks in 8 weeks of production traffic (previously: 2-5 per day)
  • p99 latency reduced from ~120ms to ~75ms (lower lock contention)
  • Connection pool utilization dropped from average 85% to 40%
  • Database CPU reduced by 25% (fewer blocked queries)

Key Lessons Learned

1. Race Conditions Are Probability Functions

This bug existed for months but only surfaced when traffic patterns changed. Low-probability race conditions become certainties at scale. Assume that if something can happen, it will happen under production load.

2. Consistent Lock Ordering Is Non-Negotiable

If your application uses transactions that lock multiple tables, enforce lock ordering in code reviews and linting rules. Document the canonical order (e.g., "always lock users, then subscriptions, then audit_logs").

3. Transaction Scope Should Be Minimal

Every line of code inside a transaction extends lock duration and increases deadlock risk. Aggressively minimize transaction scope. If an operation doesn't need atomicity, move it outside the transaction.

4. Monitoring Must Detect Lock Contention Early

By the time users report slow performance, you're already in crisis mode. Monitor pg_stat_activity and pg_locks proactively. Alert on abnormal lock wait times before they cascade into total failure.

5. Retries Are a Safety Net, Not a Solution

Deadlock retry logic is useful as a last line of defense, but if you're relying on it frequently, you haven't solved the underlying architectural problem. Fix the root cause first, then add retries for the remaining edge cases.

Conclusion: Production Reliability Is About Architecture

This incident reinforced a core principle: production reliability isn't about heroic debugging at 3 AM—it's about architectural discipline that prevents 3 AM wake-up calls in the first place.

The real fix wasn't the code changes. It was the systematic approach:

  1. Diagnose with data (logs, metrics, database introspection)
  2. Understand the root cause (not just the symptoms)
  3. Implement layered defenses (prevention + detection + mitigation)
  4. Add monitoring to catch recurrence early
  5. Load test to verify the fix under realistic concurrency

If you're building a SaaS platform that processes real-world traffic, database deadlocks aren't a matter of "if" but "when." The difference between a minor incident and a catastrophic outage is whether you've designed your system to handle them gracefully.

Need help ensuring your production systems can handle scale without breaking? Let's talk. We specialize in diagnosing, fixing, and preventing exactly these kinds of production incidents.

Need Help With Production Systems?

If you're facing similar challenges in your production infrastructure, we can help. Book a technical audit or talk to our CTO directly.