Backup Strategies

Backups protect against data loss. Without tested backups, every other reliability measure is incomplete — redundant systems guard against hardware failure, but only backups guard against accidental deletion, corruption, and ransomware.

The 3-2-1 Rule

The foundational principle: maintain 3 copies of your data, on 2 different media types, with 1 copy offsite. In practice, this means your production database, a local backup on separate storage, and a remote copy in a different geographic region. The rule is simple because the failure modes it protects against are not.

Recovery Objectives

Two numbers define your backup strategy:

Recovery Point Objective (RPO) is how much data you can afford to lose, measured in time. An RPO of one hour means you need backups at least every hour. An RPO of zero requires continuous replication.

Recovery Time Objective (RTO) is how quickly you need to be operational again. A four-hour RTO means your restore process — including detection, decision-making, and verification — must complete within four hours.

These numbers drive every other decision: backup frequency, storage location, tooling, and automation.

Backup Types

Full backups capture everything. They're the simplest to restore from but the most expensive to create and store. Most strategies use full backups as a periodic baseline.

Incremental backups capture only what changed since the last backup of any type. They're fast and storage-efficient but require the full chain of previous backups to restore.

Snapshots are point-in-time copies created at the filesystem or block storage level. They're nearly instantaneous to create and work well for consistent-state captures, but they're not a substitute for offsite backups — a snapshot stored on the same system it protects is vulnerable to the same failures.

Database Backups

Databases require special consideration because they maintain internal consistency guarantees that raw file copies can violate. See postgresql for Postgres-specific operational details.

Logical backups (like pg_dump) export data as SQL statements. They're portable and human-readable but slow for large datasets. They're the right choice for small-to-medium databases where portability matters — migrating between providers, restoring individual tables, or seeding development environments.

# Logical backup — custom format, compressed, parallel
pg_dump -Fc -j 4 -f backup.dump mydb

# Restore
pg_restore -d mydb -j 4 backup.dump

Physical backups (like pg_basebackup) copy the underlying data files. Combined with WAL archiving (write-ahead log), they enable point-in-time recovery — restoring to any moment, not just when a backup was taken. This is the right strategy for production databases where RPO is measured in minutes or seconds.

# Physical backup with WAL streaming
pg_basebackup -D /backups/base -Ft -z -Xs -P

Object storage as a backup destination combines durability with cost efficiency. Ship backups to an S3-compatible store with lifecycle rules that transition older backups to cheaper storage tiers. See storage for the tradeoffs between storage types.

Automation

Manual backups don't happen reliably. Automate the full lifecycle:

Scheduled backups via cron, systemd timers, or your orchestrator's job scheduler
Upload to offsite storage immediately after completion
Retention policy enforcement — delete backups older than your retention window automatically
Alerting on failure — a missed backup should trigger an alert. The job_last_success_timestamp_seconds metric pattern from metrics works well here

Testing

A backup that hasn't been restored is a hypothesis. Regular restore testing — automated where possible — is what turns backups from an insurance policy into an operational capability. Test the full path: retrieval, restoration, verification, and application reconnection.

A good restore test answers three questions: Can you retrieve the backup from offsite storage? Does the restored database pass integrity checks? Can the application connect and serve requests against the restored data? If any answer is no, the backup is not complete.

For incident recovery procedures that depend on working backups, see incident-response.