Database ops is where most production incidents come from
A surprisingly large fraction of "the system is down" incidents trace to the database layer — bad migration, replica drift, backup failure, IO saturation, version upgrade gone wrong. The patterns below are what we apply to keep databases boring.Backups, restored regularly
Already covered in detail in the Backup & Recovery thread, but database-specific:- Logical backups (pg_dump for Postgres, mongodump for Mongo) — slow but portable
- Physical / block-level backups — fast, large, locked to the same engine version
- WAL archiving / continuous replication — point-in-time recovery
- Restore tested quarterly — minimum
The single biggest backup discipline gap we see: WAL archiving configured but never restore-tested. Test the restore.
Replication strategies[/HEADING>
- Async replica — primary writes, replica catches up. Standard. Risk: replica lags during high write load; reads from replica may show stale data.
- Sync replica — primary waits for replica acknowledgment. No data loss on failover, but write latency takes a hit.
- Multi-master — possible (Postgres BDR, MySQL Group Replication) but operationally complex. Conflict resolution is hard.
- Logical replication — granular, table-level replication. Useful for migrations, ETL.
For most workloads: a primary + 1-2 async replicas is enough. The sophistication should match the actual reliability requirement, not the diagrams.
Failover: tested[/HEADING>
Failover is where many "high availability" claims fall apart. The discipline:
- Tested failover procedure — automated where possible
- Connection layer that handles the failover (pgBouncer, ProxySQL, application connection pool)
- Documented manual failover steps for the case automation fails
- Quarterly failover drill — actual failover, not just simulation
The team that's never failed over the production database doesn't actually know what happens during a failover.
Schema migrations: the highest-risk regular activity[/HEADING>
- Backward-compatible migrations — the schema change is deployed before the application code that uses it. Application can roll back without rolling back the schema.
- Two-phase additions — add new column / table, deploy code that writes both old and new, backfill, deploy code that reads new, deploy code that stops writing old, drop old. Long but safe.
- Lock-free changes where possible — `ALTER TABLE ADD COLUMN` is fast in modern Postgres. `ALTER TABLE ADD CONSTRAINT NOT VALID` then `VALIDATE CONSTRAINT` separately.
- Long-running migrations in batches — backfill 1 000 rows at a time, sleep between batches, monitor.
The deploy that locks the production table for 4 hours is the deploy that produced the multi-hour outage. Engineering effort spent on lock-free migration patterns pays back immediately.
Connection management[/HEADING>
- Connection pooling at the application level (Hikari, pgBouncer)
- Server-side connection limits set conservatively
- Idle connection timeouts to prevent leaks
- Connection retry with exponential backoff in the application
- Circuit breaker for "database is down" → degrade gracefully, don't queue infinitely
A common production pattern: 50 application instances × 100 connection pool = 5 000 connections, exceeding Postgres's default 100. The fix: pgBouncer in front, connection multiplexing.
Observability[/HEADING>
For databases, log:
- Slow query log (queries above a threshold)
- Lock waits
- Replication lag
- Connection counts
- Cache hit ratios
- Disk usage and IOPS
The team that learns to read these logs catches problems early. The team that runs blind discovers them through customer reports.
Upgrades: planned, tested, rolled[/HEADING>
- Major version upgrades happen on a planned cadence — not "we'll do it eventually"
- Test on a staging copy of production data
- Schedule for low-traffic windows with a rollback plan
- Use logical replication for major version upgrades on Postgres (no downtime)
The team running an EOL major version is the team with mounting CVEs and increasing migration debt.
Scaling reads[/HEADING>
- Read replicas for read-heavy workloads
- Caching at the application or query level
- Materialised views for expensive aggregate queries
- Denormalised secondary stores (Elasticsearch, etc.) for non-OLTP queries
Scaling reads is mostly tractable. Scaling writes is harder and architectural.
Scaling writes[/HEADING>
- Vertical scaling (bigger instance) — first answer, often enough
- Sharding — operationally heavy, do it after vertical scaling has hit its limit
- CQRS / event sourcing — split write and read models. Architectural decision, not a database tweak
- Specialised stores for specific write patterns (time-series → Timescale, key-value → Redis / Dynamo)
One pattern we'd warn about[/HEADING>
"It's just a small migration." Production migrations need the same discipline as production deployments. Review, staging test, rollback plan, off-peak timing.
One pattern that always pays off[/HEADING>
A documented "what to do when the database is slow" runbook. Slow queries to investigate first, who to escalate to, what diagnostic to run. The 3 AM page is not the time to be inventing process.
What's your DB ops practice? And — for the major-version-upgrade folks — has logical replication held up for big PG version jumps in production?
Failover is where many "high availability" claims fall apart. The discipline:
- Tested failover procedure — automated where possible
- Connection layer that handles the failover (pgBouncer, ProxySQL, application connection pool)
- Documented manual failover steps for the case automation fails
- Quarterly failover drill — actual failover, not just simulation
The team that's never failed over the production database doesn't actually know what happens during a failover.
Schema migrations: the highest-risk regular activity[/HEADING>
- Backward-compatible migrations — the schema change is deployed before the application code that uses it. Application can roll back without rolling back the schema.
- Two-phase additions — add new column / table, deploy code that writes both old and new, backfill, deploy code that reads new, deploy code that stops writing old, drop old. Long but safe.
- Lock-free changes where possible — `ALTER TABLE ADD COLUMN` is fast in modern Postgres. `ALTER TABLE ADD CONSTRAINT NOT VALID` then `VALIDATE CONSTRAINT` separately.
- Long-running migrations in batches — backfill 1 000 rows at a time, sleep between batches, monitor.
The deploy that locks the production table for 4 hours is the deploy that produced the multi-hour outage. Engineering effort spent on lock-free migration patterns pays back immediately.
Connection management[/HEADING>
- Connection pooling at the application level (Hikari, pgBouncer)
- Server-side connection limits set conservatively
- Idle connection timeouts to prevent leaks
- Connection retry with exponential backoff in the application
- Circuit breaker for "database is down" → degrade gracefully, don't queue infinitely
A common production pattern: 50 application instances × 100 connection pool = 5 000 connections, exceeding Postgres's default 100. The fix: pgBouncer in front, connection multiplexing.
Observability[/HEADING>
For databases, log:
- Slow query log (queries above a threshold)
- Lock waits
- Replication lag
- Connection counts
- Cache hit ratios
- Disk usage and IOPS
The team that learns to read these logs catches problems early. The team that runs blind discovers them through customer reports.
Upgrades: planned, tested, rolled[/HEADING>
- Major version upgrades happen on a planned cadence — not "we'll do it eventually"
- Test on a staging copy of production data
- Schedule for low-traffic windows with a rollback plan
- Use logical replication for major version upgrades on Postgres (no downtime)
The team running an EOL major version is the team with mounting CVEs and increasing migration debt.
Scaling reads[/HEADING>
- Read replicas for read-heavy workloads
- Caching at the application or query level
- Materialised views for expensive aggregate queries
- Denormalised secondary stores (Elasticsearch, etc.) for non-OLTP queries
Scaling reads is mostly tractable. Scaling writes is harder and architectural.
Scaling writes[/HEADING>
- Vertical scaling (bigger instance) — first answer, often enough
- Sharding — operationally heavy, do it after vertical scaling has hit its limit
- CQRS / event sourcing — split write and read models. Architectural decision, not a database tweak
- Specialised stores for specific write patterns (time-series → Timescale, key-value → Redis / Dynamo)
One pattern we'd warn about[/HEADING>
"It's just a small migration." Production migrations need the same discipline as production deployments. Review, staging test, rollback plan, off-peak timing.
One pattern that always pays off[/HEADING>
A documented "what to do when the database is slow" runbook. Slow queries to investigate first, who to escalate to, what diagnostic to run. The 3 AM page is not the time to be inventing process.
What's your DB ops practice? And — for the major-version-upgrade folks — has logical replication held up for big PG version jumps in production?
- Connection pooling at the application level (Hikari, pgBouncer)
- Server-side connection limits set conservatively
- Idle connection timeouts to prevent leaks
- Connection retry with exponential backoff in the application
- Circuit breaker for "database is down" → degrade gracefully, don't queue infinitely
A common production pattern: 50 application instances × 100 connection pool = 5 000 connections, exceeding Postgres's default 100. The fix: pgBouncer in front, connection multiplexing.
Observability[/HEADING>
For databases, log:
- Slow query log (queries above a threshold)
- Lock waits
- Replication lag
- Connection counts
- Cache hit ratios
- Disk usage and IOPS
The team that learns to read these logs catches problems early. The team that runs blind discovers them through customer reports.
Upgrades: planned, tested, rolled[/HEADING>
- Major version upgrades happen on a planned cadence — not "we'll do it eventually"
- Test on a staging copy of production data
- Schedule for low-traffic windows with a rollback plan
- Use logical replication for major version upgrades on Postgres (no downtime)
The team running an EOL major version is the team with mounting CVEs and increasing migration debt.
Scaling reads[/HEADING>
- Read replicas for read-heavy workloads
- Caching at the application or query level
- Materialised views for expensive aggregate queries
- Denormalised secondary stores (Elasticsearch, etc.) for non-OLTP queries
Scaling reads is mostly tractable. Scaling writes is harder and architectural.
Scaling writes[/HEADING>
- Vertical scaling (bigger instance) — first answer, often enough
- Sharding — operationally heavy, do it after vertical scaling has hit its limit
- CQRS / event sourcing — split write and read models. Architectural decision, not a database tweak
- Specialised stores for specific write patterns (time-series → Timescale, key-value → Redis / Dynamo)
One pattern we'd warn about[/HEADING>
"It's just a small migration." Production migrations need the same discipline as production deployments. Review, staging test, rollback plan, off-peak timing.
One pattern that always pays off[/HEADING>
A documented "what to do when the database is slow" runbook. Slow queries to investigate first, who to escalate to, what diagnostic to run. The 3 AM page is not the time to be inventing process.
What's your DB ops practice? And — for the major-version-upgrade folks — has logical replication held up for big PG version jumps in production?
- Major version upgrades happen on a planned cadence — not "we'll do it eventually"
- Test on a staging copy of production data
- Schedule for low-traffic windows with a rollback plan
- Use logical replication for major version upgrades on Postgres (no downtime)
The team running an EOL major version is the team with mounting CVEs and increasing migration debt.
Scaling reads[/HEADING>
- Read replicas for read-heavy workloads
- Caching at the application or query level
- Materialised views for expensive aggregate queries
- Denormalised secondary stores (Elasticsearch, etc.) for non-OLTP queries
Scaling reads is mostly tractable. Scaling writes is harder and architectural.
Scaling writes[/HEADING>
- Vertical scaling (bigger instance) — first answer, often enough
- Sharding — operationally heavy, do it after vertical scaling has hit its limit
- CQRS / event sourcing — split write and read models. Architectural decision, not a database tweak
- Specialised stores for specific write patterns (time-series → Timescale, key-value → Redis / Dynamo)
One pattern we'd warn about[/HEADING>
"It's just a small migration." Production migrations need the same discipline as production deployments. Review, staging test, rollback plan, off-peak timing.
One pattern that always pays off[/HEADING>
A documented "what to do when the database is slow" runbook. Slow queries to investigate first, who to escalate to, what diagnostic to run. The 3 AM page is not the time to be inventing process.
What's your DB ops practice? And — for the major-version-upgrade folks — has logical replication held up for big PG version jumps in production?
- Vertical scaling (bigger instance) — first answer, often enough
- Sharding — operationally heavy, do it after vertical scaling has hit its limit
- CQRS / event sourcing — split write and read models. Architectural decision, not a database tweak
- Specialised stores for specific write patterns (time-series → Timescale, key-value → Redis / Dynamo)
One pattern we'd warn about[/HEADING>
"It's just a small migration." Production migrations need the same discipline as production deployments. Review, staging test, rollback plan, off-peak timing.
One pattern that always pays off[/HEADING>
A documented "what to do when the database is slow" runbook. Slow queries to investigate first, who to escalate to, what diagnostic to run. The 3 AM page is not the time to be inventing process.
What's your DB ops practice? And — for the major-version-upgrade folks — has logical replication held up for big PG version jumps in production?
A documented "what to do when the database is slow" runbook. Slow queries to investigate first, who to escalate to, what diagnostic to run. The 3 AM page is not the time to be inventing process.
What's your DB ops practice? And — for the major-version-upgrade folks — has logical replication held up for big PG version jumps in production?