Database operations that don't wake you up at 3 AM: backups, replicas, migrations

Aior · Friday at 8:05 AM

Database ops is where most production incidents come from

A surprisingly large fraction of "the system is down" incidents trace to the database layer — bad migration, replica drift, backup failure, IO saturation, version upgrade gone wrong. The patterns below are what we apply to keep databases boring.

Backups, restored regularly

Already covered in detail in the Backup & Recovery thread, but database-specific:

Logical backups (pg_dump for Postgres, mongodump for Mongo) — slow but portable
Physical / block-level backups — fast, large, locked to the same engine version
WAL archiving / continuous replication — point-in-time recovery
Restore tested quarterly — minimum

The single biggest backup discipline gap we see: WAL archiving configured but never restore-tested. Test the restore.

Replication strategies[/HEADING>

Async replica — primary writes, replica catches up. Standard. Risk: replica lags during high write load; reads from replica may show stale data.

Sync replica — primary waits for replica acknowledgment. No data loss on failover, but write latency takes a hit.

Multi-master — possible (Postgres BDR, MySQL Group Replication) but operationally complex. Conflict resolution is hard.

Logical replication — granular, table-level replication. Useful for migrations, ETL.

For most workloads: a primary + 1-2 async replicas is enough. The sophistication should match the actual reliability requirement, not the diagrams.

Failover: tested[/HEADING>
Failover is where many "high availability" claims fall apart. The discipline:

Tested failover procedure — automated where possible

Connection layer that handles the failover (pgBouncer, ProxySQL, application connection pool)

Documented manual failover steps for the case automation fails

Quarterly failover drill — actual failover, not just simulation

The team that's never failed over the production database doesn't actually know what happens during a failover.

Schema migrations: the highest-risk regular activity[/HEADING>

Backward-compatible migrations — the schema change is deployed before the application code that uses it. Application can roll back without rolling back the schema.

Two-phase additions — add new column / table, deploy code that writes both old and new, backfill, deploy code that reads new, deploy code that stops writing old, drop old. Long but safe.

Lock-free changes where possible — `ALTER TABLE ADD COLUMN` is fast in modern Postgres. `ALTER TABLE ADD CONSTRAINT NOT VALID` then `VALIDATE CONSTRAINT` separately.

Long-running migrations in batches — backfill 1 000 rows at a time, sleep between batches, monitor.

The deploy that locks the production table for 4 hours is the deploy that produced the multi-hour outage. Engineering effort spent on lock-free migration patterns pays back immediately.

Connection management[/HEADING>

Connection pooling at the application level (Hikari, pgBouncer)

Server-side connection limits set conservatively

Idle connection timeouts to prevent leaks

Connection retry with exponential backoff in the application

Circuit breaker for "database is down" → degrade gracefully, don't queue infinitely

A common production pattern: 50 application instances × 100 connection pool = 5 000 connections, exceeding Postgres's default 100. The fix: pgBouncer in front, connection multiplexing.

Observability[/HEADING>
For databases, log:

Slow query log (queries above a threshold)

Lock waits

Replication lag

Connection counts

Cache hit ratios

Disk usage and IOPS

The team that learns to read these logs catches problems early. The team that runs blind discovers them through customer reports.

Upgrades: planned, tested, rolled[/HEADING>

Major version upgrades happen on a planned cadence — not "we'll do it eventually"

Test on a staging copy of production data

Schedule for low-traffic windows with a rollback plan

Use logical replication for major version upgrades on Postgres (no downtime)

The team running an EOL major version is the team with mounting CVEs and increasing migration debt.

Scaling reads[/HEADING>

Read replicas for read-heavy workloads

Caching at the application or query level

Materialised views for expensive aggregate queries

Denormalised secondary stores (Elasticsearch, etc.) for non-OLTP queries

Scaling reads is mostly tractable. Scaling writes is harder and architectural.

Scaling writes[/HEADING>

Vertical scaling (bigger instance) — first answer, often enough

Sharding — operationally heavy, do it after vertical scaling has hit its limit

CQRS / event sourcing — split write and read models. Architectural decision, not a database tweak

Specialised stores for specific write patterns (time-series → Timescale, key-value → Redis / Dynamo)

One pattern we'd warn about[/HEADING>
"It's just a small migration." Production migrations need the same discipline as production deployments. Review, staging test, rollback plan, off-peak timing.

One pattern that always pays off[/HEADING>
A documented "what to do when the database is slow" runbook. Slow queries to investigate first, who to escalate to, what diagnostic to run. The 3 AM page is not the time to be inventing process.

What's your DB ops practice? And — for the major-version-upgrade folks — has logical replication held up for big PG version jumps in production?

Database operations that don't wake you up at 3 AM: backups, replicas, migrations

Database operations that don't wake you up at 3 AM: backups, replicas, migrations

Aior

Administrator

Database ops is where most production incidents come from

Backups, restored regularly

One pattern we'd warn about[/HEADING>
"It's just a small migration." Production migrations need the same discipline as production deployments. Review, staging test, rollback plan, off-peak timing.

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Legal Notice

We value your privacy

Database operations that don't wake you up at 3 AM: backups, replicas, migrations

Database operations that don't wake you up at 3 AM: backups, replicas, migrations

Aior

Administrator

Database ops is where most production incidents come from​

Backups, restored regularly​

Observability[/HEADING> For databases, log: Slow query log (queries above a threshold) Lock waits Replication lag Connection counts Cache hit ratios Disk usage and IOPS The team that learns to read these logs catches problems early. The team that runs blind discovers them through customer reports.

One pattern we'd warn about[/HEADING> "It's just a small migration." Production migrations need the same discipline as production deployments. Review, staging test, rollback plan, off-peak timing.

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Tüm ihtiyaçlarınız için Teklif alın

Legal Notice

We value your privacy

Database ops is where most production incidents come from

Backups, restored regularly

One pattern we'd warn about[/HEADING>
"It's just a small migration." Production migrations need the same discipline as production deployments. Review, staging test, rollback plan, off-peak timing.