How one ADD COLUMN migration took down our 12M-row table
A war story about a volatile-default footgun, why ORMs hide it from you, and the staged rollout pattern that prevents it.
Tuesday, 2:47 PM. Slack lights up.
ALERT: healthcheck failures on api-prod-3, api-prod-7. PgBouncer queue depth: 342.
The deploy had gone out fourteen minutes earlier. A TypeORM migration. One line:
ALTER TABLE users ADD COLUMN created_at TIMESTAMPTZ NOT NULL DEFAULT now();
Twelve million rows. ACCESS EXCLUSIVE lock. The entire users table was frozen while Postgres rewrote existing rows to materialize the non-constant default. Every query that touched users, which was basically every query, piled up behind it. PgBouncer’s connection pool filled. Healthchecks started failing. Kubernetes restarted pods, which opened more connections, which made it worse.
Forty-seven seconds of total lockout on a table that serves 2,400 requests per second.
What actually happened
Here’s the thing that trips people up: on Postgres 11+, ADD COLUMN ... DEFAULT <constant> is often instant. It’s a metadata-only operation. Postgres stores the default in pg_attrdef and fills it in lazily when rows are read.
But that is not true for expressions like now() or gen_random_uuid(). Those are non-constant defaults, so Postgres rewrites existing rows under ACCESS EXCLUSIVE. In our case, the column also needed NOT NULL, which meant the migration needed a staged rollout anyway.
The ORM hid this from us. TypeORM’s migration generator produced that single line. Nobody on the team knew the difference between a constant default like DEFAULT false (metadata-only on PG11+) and a non-constant default like DEFAULT now() (row rewrite under a strong lock). The PR looked clean. The reviewer approved it. CI passed.
The fix is boring
The safe pattern has been documented for years. It’s just tedious enough that people skip it:
-- Step 1: Add the column nullable (instant, metadata-only)
ALTER TABLE users ADD COLUMN IF NOT EXISTS created_at timestamptz;
-- Step 2: Backfill in batches (out-of-band, no long lock)
-- Do this in your application code or a script, not the migration
UPDATE users SET created_at = now()
WHERE id BETWEEN $start AND $end;
-- Step 3: Set the default for new rows
ALTER TABLE users ALTER COLUMN created_at SET DEFAULT now();
-- Step 4: Add the NOT NULL constraint safely
ALTER TABLE users ADD CONSTRAINT chk_created_at
CHECK (created_at IS NOT NULL) NOT VALID;
ALTER TABLE users VALIDATE CONSTRAINT chk_created_at;
ALTER TABLE users ALTER COLUMN created_at SET NOT NULL;
ALTER TABLE users DROP CONSTRAINT chk_created_at;
NOT VALID means Postgres doesn’t scan existing rows when adding the constraint. VALIDATE CONSTRAINT does the scan later under a weaker SHARE UPDATE EXCLUSIVE lock, so reads and writes continue normally. Then SET NOT NULL is instant because Postgres trusts the validated constraint.
Three migrations instead of one. Each schema step is brief, and the expensive work moves into the out-of-band backfill.
Why I built pgfence
After the incident, I went looking for something that could catch this in CI. The options were:
- strong_migrations: Ruby only. We’re a Node.js shop.
- Eugene: Rust, great tool, but no TypeORM/Knex extraction.
- Squawk: Rust, SQL-only linter.
I could not find anything in the TypeScript/Node ecosystem that could parse a TypeORM migration file, extract the SQL, and tell you “this is going to grab ACCESS EXCLUSIVE on your users table while it rewrites existing rows.”
So I built pgfence. It uses libpg-query, the actual PostgreSQL parser compiled to a C library, to produce a real AST. No regex. It knows the difference between a constant default that is metadata-only and a non-constant default that rewrites the table. It extracts SQL from TypeORM migrations, Knex migrations, Prisma migration files, Sequelize migrations, and plain SQL.
Run it in CI and it tells you exactly what lock each statement acquires, what it blocks, and what the safer rollout looks like:
pgfence analyze --ci --max-risk medium migrations/*.sql
If anything exceeds your risk threshold, the build fails. For patterns with a built rewrite, the safer sequence is right there in the output.
The lesson
Every team learns this lesson eventually. The question is whether you learn it from a postmortem or from a linter. I’d rather it be the linter.