The lock_timeout Death Spiral: Why Every Postgres Migration Needs a Timeout
Your migration grabs an ACCESS EXCLUSIVE lock. It queues behind a long-running query. Every new connection piles up behind it. In 30 seconds, your entire database is frozen. Here's the fix.
Your migration is fine. The SQL is correct. The lock mode is known. The safe rewrite is applied.
And it still takes down production.
The scenario
You deploy a migration at 2:15 PM. It runs ALTER TABLE orders ADD COLUMN verified BOOLEAN, which needs an ACCESS EXCLUSIVE lock on the orders table.
But there’s a long-running analytics query that started at 2:14 PM. It holds an ACCESS SHARE lock on orders. Your DDL statement can’t acquire its lock, so it waits.
Here’s where it gets bad.
Every new query that arrives after your DDL statement tries to access orders. PostgreSQL is fair to the waiting DDL, so later conflicting queries line up behind your migration, not behind the original analytics query.
2:14:00 Analytics query starts (ACCESS SHARE on orders)
2:15:00 ALTER TABLE starts, waits for lock
2:15:01 SELECT * FROM orders ... waits behind ALTER TABLE
2:15:01 INSERT INTO orders ... waits behind ALTER TABLE
2:15:02 SELECT * FROM orders ... waits behind ALTER TABLE
2:15:02 UPDATE orders SET ... waits behind ALTER TABLE
...every query on `orders` is now queued
Within seconds, your connection pool is saturated. Healthchecks fail. Pods restart. The analytics query eventually finishes, your DDL executes in 200ms, and the queue drains. But the damage is done: 30 seconds of downtime caused by a migration that was technically correct.
This is the lock queue death spiral.
Why it happens
The death spiral has three ingredients:
- A DDL statement that needs a lock strong enough to block hot traffic, often ACCESS EXCLUSIVE
- A concurrent session holding a conflicting lock (a long query, an idle transaction, a forgotten
pg_dump) - No timeout on the DDL’s lock acquisition
Ingredients 1 and 2 are normal. You can’t avoid them. Production databases have concurrent activity. That’s the whole point.
Ingredient 3 is the one you control.
The fix: SET lock_timeout
SET lock_timeout = '2s';
ALTER TABLE orders ADD CONSTRAINT chk_status
CHECK (status IS NOT NULL) NOT VALID;
With lock_timeout = '2s', PostgreSQL will wait at most 2 seconds to acquire the lock. If it can’t get it, the statement fails with:
ERROR: canceling statement due to lock timeout
Your migration fails. But your database stays alive. You retry in a minute when the conflicting query finishes.
Two seconds of failed migration vs. thirty seconds of frozen database. That’s the tradeoff, and it’s not close.
What about statement_timeout?
statement_timeout limits how long a statement can execute, not how long it can wait for a lock. They solve different problems:
| Setting | Controls | Protects against |
|---|---|---|
lock_timeout | Lock acquisition wait time | Death spiral from queued locks |
statement_timeout | Total execution time | Runaway queries, infinite backfills |
You need both. A CREATE INDEX on a 100M-row table might legitimately need 10 minutes to build, but should never wait more than a few seconds for the initial lock.
SET lock_timeout = '2s';
SET statement_timeout = '10min';
CREATE INDEX CONCURRENTLY IF NOT EXISTS idx_orders_status ON orders(status);
The ordering trap
Setting lock_timeout is necessary. Setting it after the dangerous statement is useless.
-- BAD: the ALTER runs without protection
ALTER TABLE orders ADD COLUMN verified BOOLEAN;
SET lock_timeout = '2s'; -- too late
This is surprisingly common in auto-generated migrations. The ORM emits the DDL first, and a developer adds the timeout at the bottom of the file.
More policy checks that prevent incidents
The lock queue death spiral is the most common policy failure, but not the only one. Other session settings that are good defaults for production migrations:
SET idle_in_transaction_session_timeout = '30s'
Prevents orphaned transactions. If your migration starts a transaction and then hangs (network issue, crashed process), the open transaction holds its locks indefinitely. Other sessions queue behind it. Same death spiral, different cause.
SET application_name = 'pgfence-migration'
Does not prevent anything directly. But when you’re debugging a lock queue at 2 AM, pg_stat_activity showing pgfence-migration instead of unknown tells you instantly which process to kill.
CREATE INDEX CONCURRENTLY inside a transaction
This one doesn’t cause a death spiral. It just fails immediately:
ERROR: CREATE INDEX CONCURRENTLY cannot run inside a transaction block
But it fails at runtime, not at review time. If your CI doesn’t catch it, the first time you learn about it is during deployment.
What a migration header should look like
A strong default migration header is:
SET lock_timeout = '2s';
SET statement_timeout = '10min';
SET idle_in_transaction_session_timeout = '30s';
SET application_name = 'migration_20260320_add_status_check';
Then the DDL. This is the pattern that prevents the death spiral, caps runaway operations, cleans up orphaned transactions, and makes debugging possible.
Enforcing this automatically
You could add these lines manually. You will forget. Your teammate will forget. The new hire will definitely forget.
pgfence enforces all of these as policy checks:
$ pgfence analyze migrations/add_status_check.sql
Policy Violations:
ERROR Missing SET lock_timeout: lock queue death spiral risk
-> Add SET lock_timeout = '2s'; at the start of the migration
WARN Missing SET statement_timeout: long operations need a timeout
-> Add SET statement_timeout = '10min'; at the start
WARN Missing SET idle_in_transaction_session_timeout
-> Add SET idle_in_transaction_session_timeout = '30s';
WARN Missing SET application_name
-> Add SET application_name = 'migration_name';
In CI mode, missing-lock-timeout is an error (exit code 1). The migration can’t merge without it.
pgfence also catches:
lock_timeoutset to 0: equivalent to no timeout at all (the Postgres default)lock_timeoutset after dangerous DDL: ordering violation, the dangerous statement runs unprotectedlock_timeoutexceeding threshold: a 60-second lock_timeout defeats the purpose- NOT VALID + VALIDATE CONSTRAINT in the same transaction: the whole point of NOT VALID is to split the work across two transactions
This is an area pgfence treats as first-class migration safety input. The SQL text matters, but the session configuration matters too: a perfectly reasonable statement can still cause an incident if it waits forever or runs with the wrong transaction settings.
The rule
Every migration must start with SET lock_timeout. The value should be low enough that a failed acquisition is cheaper than a death spiral (2-5 seconds is typical). The statement must appear before any DDL.
This is not a best practice. It is a production requirement. Without it, every migration is one long-running query away from a full outage.
pgfence enforces it. Get started in 60 seconds.