İçeriğe geç
KAMPANYA

Logo Tasarım + Web Tasarım + 1 Yıl Domain + E-posta + Hosting — $299 +KDV

AIOR

Database operations that don't wake you up at 3 AM: backups, replicas, migrations

Sektör topluluğu — sorularınız, deneyimleriniz ve duyurularınız için.

Database operations that don't wake you up at 3 AM: backups, replicas, migrations

Aior

Administrator
Staff member
Joined
Apr 2, 2023
Messages
175
Reaction score
2
Points
18
Age
40
Location
Turkey
Website
aior.com
1/3
Thread owner

Database ops is where most production incidents come from​

A surprisingly large fraction of "the system is down" incidents trace to the database layer — bad migration, replica drift, backup failure, IO saturation, version upgrade gone wrong. The patterns below are what we apply to keep databases boring.

Backups, restored regularly​

Already covered in detail in the Backup & Recovery thread, but database-specific:
  • Logical backups (pg_dump for Postgres, mongodump for Mongo) — slow but portable
  • Physical / block-level backups — fast, large, locked to the same engine version
  • WAL archiving / continuous replication — point-in-time recovery
  • Restore tested quarterly — minimum

The single biggest backup discipline gap we see: WAL archiving configured but never restore-tested. Test the restore.

Replication strategies[/HEADING>
  • Async replica — primary writes, replica catches up. Standard. Risk: replica lags during high write load; reads from replica may show stale data.
  • Sync replica — primary waits for replica acknowledgment. No data loss on failover, but write latency takes a hit.
  • Multi-master — possible (Postgres BDR, MySQL Group Replication) but operationally complex. Conflict resolution is hard.
  • Logical replication — granular, table-level replication. Useful for migrations, ETL.

For most workloads: a primary + 1-2 async replicas is enough. The sophistication should match the actual reliability requirement, not the diagrams.

Failover: tested[/HEADING>
Failover is where many "high availability" claims fall apart. The discipline:
  • Tested failover procedure — automated where possible
  • Connection layer that handles the failover (pgBouncer, ProxySQL, application connection pool)
  • Documented manual failover steps for the case automation fails
  • Quarterly failover drill — actual failover, not just simulation

The team that's never failed over the production database doesn't actually know what happens during a failover.

Schema migrations: the highest-risk regular activity[/HEADING>
  • Backward-compatible migrations — the schema change is deployed before the application code that uses it. Application can roll back without rolling back the schema.
  • Two-phase additions — add new column / table, deploy code that writes both old and new, backfill, deploy code that reads new, deploy code that stops writing old, drop old. Long but safe.
  • Lock-free changes where possible — `ALTER TABLE ADD COLUMN` is fast in modern Postgres. `ALTER TABLE ADD CONSTRAINT NOT VALID` then `VALIDATE CONSTRAINT` separately.
  • Long-running migrations in batches — backfill 1 000 rows at a time, sleep between batches, monitor.

The deploy that locks the production table for 4 hours is the deploy that produced the multi-hour outage. Engineering effort spent on lock-free migration patterns pays back immediately.

Connection management[/HEADING>
  • Connection pooling at the application level (Hikari, pgBouncer)
  • Server-side connection limits set conservatively
  • Idle connection timeouts to prevent leaks
  • Connection retry with exponential backoff in the application
  • Circuit breaker for "database is down" → degrade gracefully, don't queue infinitely

A common production pattern: 50 application instances × 100 connection pool = 5 000 connections, exceeding Postgres's default 100. The fix: pgBouncer in front, connection multiplexing.

Observability[/HEADING>
For databases, log:
  • Slow query log (queries above a threshold)
  • Lock waits
  • Replication lag
  • Connection counts
  • Cache hit ratios
  • Disk usage and IOPS

The team that learns to read these logs catches problems early. The team that runs blind discovers them through customer reports.

Upgrades: planned, tested, rolled[/HEADING>
  • Major version upgrades happen on a planned cadence — not "we'll do it eventually"
  • Test on a staging copy of production data
  • Schedule for low-traffic windows with a rollback plan
  • Use logical replication for major version upgrades on Postgres (no downtime)

The team running an EOL major version is the team with mounting CVEs and increasing migration debt.

Scaling reads[/HEADING>
  • Read replicas for read-heavy workloads
  • Caching at the application or query level
  • Materialised views for expensive aggregate queries
  • Denormalised secondary stores (Elasticsearch, etc.) for non-OLTP queries

Scaling reads is mostly tractable. Scaling writes is harder and architectural.

Scaling writes[/HEADING>
  • Vertical scaling (bigger instance) — first answer, often enough
  • Sharding — operationally heavy, do it after vertical scaling has hit its limit
  • CQRS / event sourcing — split write and read models. Architectural decision, not a database tweak
  • Specialised stores for specific write patterns (time-series → Timescale, key-value → Redis / Dynamo)

One pattern we'd warn about[/HEADING>
"It's just a small migration." Production migrations need the same discipline as production deployments. Review, staging test, rollback plan, off-peak timing.

One pattern that always pays off[/HEADING>
A documented "what to do when the database is slow" runbook. Slow queries to investigate first, who to escalate to, what diagnostic to run. The 3 AM page is not the time to be inventing process.

What's your DB ops practice? And — for the major-version-upgrade folks — has logical replication held up for big PG version jumps in production?​

 

Forum statistics

Threads
171
Messages
178
Members
27
Latest member
AIORAli

Members online

No members online now.

Featured content

AIOR
AIOR TEKNOLOJİ

Tüm ihtiyaçlarınız için Teklif alın

Hosting · Domain · Sunucu · Tasarım · Yazılım · Mühendislik · Sektörel Çözümler

Teklif al

7/24 Destek · Anında yanıt

Back
Top