Backup and Disaster Recovery for Startups: RTO/RPO Planning, AWS Backup, and Testing Your DR Plan
Free DevOps Audit Checklist
Get our comprehensive checklist to identify gaps in your infrastructure, security, and deployment processes
Introduction
Every startup says they have backups. Very few have actually tested restoring from them. Even fewer have a documented disaster recovery plan with defined recovery objectives. Then something goes wrong - a developer runs a destructive migration against production, a region goes down, ransomware encrypts your EBS volumes - and the team discovers their "backup strategy" was an untested assumption.
Disaster recovery does not have to be complicated or expensive, especially on AWS. But it does have to be intentional. This guide walks through building a real DR strategy: defining your recovery objectives, implementing the 3-2-1 backup rule, configuring AWS Backup, setting up cross-region replication, and most importantly, testing that everything actually works.
Understanding RTO and RPO
Before you configure a single backup, you need two numbers:
Recovery Point Objective (RPO): How much data can you afford to lose? If your RPO is 1 hour, you need backups at least every hour. If your RPO is zero, you need synchronous replication.
Recovery Time Objective (RTO): How long can your service be down before the business impact becomes unacceptable? If your RTO is 4 hours, you need a recovery process that can restore full service within 4 hours.
These numbers vary by system. Here is a realistic example for a SaaS startup:
| System | RPO | RTO | Strategy |
|---|---|---|---|
| Primary database (PostgreSQL) | 5 minutes | 30 minutes | Continuous WAL archiving + read replica |
| User uploads (S3) | 0 (no loss) | 1 hour | Cross-region replication |
| Application servers | N/A (stateless) | 15 minutes | Auto-scaling group, multi-AZ |
| Redis cache | 1 hour | 15 minutes | Rebuild from DB if needed |
| Elasticsearch | 24 hours | 4 hours | Nightly snapshots to S3 |
| Configuration/secrets | 0 (no loss) | 30 minutes | Versioned in AWS Secrets Manager |
The key insight: not everything needs the same level of protection. Your production database needs 5-minute RPO. Your Elasticsearch cluster (which can be rebuilt from the database) can tolerate 24-hour RPO. Treating everything equally means overspending on DR for low-criticality systems.
Need DevOps help?
InstaDevOps provides expert DevOps engineering starting at $2,999/mo. Skip the hiring headache.
Book a free 15-min call →The 3-2-1 Backup Rule
The 3-2-1 rule is decades old and still the best framework for backup strategy:
- 3 copies of your data
- 2 different storage media/types
- 1 copy offsite (different region or provider)
In AWS terms:
- Copy 1: Production data (RDS instance, EBS volumes, S3 buckets)
- Copy 2: AWS Backup vault in the same region (different storage from production)
- Copy 3: Cross-region backup copy (different region entirely)
For truly critical data, consider a fourth copy outside AWS entirely (GCP bucket, Azure blob, or even Backblaze B2). This protects against the unlikely but possible scenario of an AWS account compromise.
# Example: backup RDS snapshot to an external provider using pg_dump
pg_dump -h your-rds-endpoint.region.rds.amazonaws.com \
-U dbadmin -d production \
--format=custom \
--file=production-$(date +%Y%m%d).dump
# Encrypt and upload to external storage
gpg --symmetric --cipher-algo AES256 production-$(date +%Y%m%d).dump
aws s3 cp production-$(date +%Y%m%d).dump.gpg s3://offsite-backups/ --storage-class GLACIER_IR
Configuring AWS Backup
AWS Backup provides a centralized service to manage backups across RDS, EBS, EFS, DynamoDB, S3, and more. Here is a production-ready configuration using Terraform:
# Backup vault
resource "aws_backup_vault" "main" {
name = "production-vault"
kms_key_arn = aws_kms_key.backup.arn
tags = {
Environment = "production"
}
}
# Cross-region vault for DR
resource "aws_backup_vault" "dr" {
provider = aws.dr_region
name = "production-vault-dr"
kms_key_arn = aws_kms_key.backup_dr.arn
}
# Backup plan
resource "aws_backup_plan" "production" {
name = "production-backup-plan"
# Hourly backups retained for 24 hours
rule {
rule_name = "hourly"
target_vault_name = aws_backup_vault.main.name
schedule = "cron(0 * * * ? *)"
start_window = 60
completion_window = 120
lifecycle {
delete_after = 1 # days
}
}
# Daily backups retained for 30 days
rule {
rule_name = "daily"
target_vault_name = aws_backup_vault.main.name
schedule = "cron(0 3 * * ? *)"
start_window = 60
completion_window = 180
lifecycle {
cold_storage_after = 7
delete_after = 30
}
# Copy to DR region
copy_action {
destination_vault_arn = aws_backup_vault.dr.arn
lifecycle {
delete_after = 30
}
}
}
# Weekly backups retained for 1 year
rule {
rule_name = "weekly"
target_vault_name = aws_backup_vault.main.name
schedule = "cron(0 4 ? * SUN *)"
start_window = 60
completion_window = 300
lifecycle {
cold_storage_after = 30
delete_after = 365
}
}
}
# Assign resources to backup plan
resource "aws_backup_selection" "production" {
iam_role_arn = aws_iam_role.backup.arn
name = "production-resources"
plan_id = aws_backup_plan.production.id
# Backup everything tagged with Backup=true
selection_tag {
type = "STRINGEQUALS"
key = "Backup"
value = "true"
}
}
Tag your resources to include them in the backup plan:
resource "aws_db_instance" "production" {
# ... other config
tags = {
Backup = "true"
Environment = "production"
}
}
resource "aws_ebs_volume" "data" {
# ... other config
tags = {
Backup = "true"
Environment = "production"
}
}
Cross-Region Replication
For services where you need faster recovery than restoring from backups, set up active replication:
RDS Cross-Region Read Replica:
# Primary in us-east-1
resource "aws_db_instance" "primary" {
identifier = "production-primary"
engine = "postgres"
engine_version = "16.2"
instance_class = "db.r6g.xlarge"
multi_az = true
backup_retention_period = 7
backup_window = "03:00-04:00"
}
# Cross-region replica in us-west-2
resource "aws_db_instance" "dr_replica" {
provider = aws.dr_region
identifier = "production-dr-replica"
replicate_source_db = aws_db_instance.primary.arn
instance_class = "db.r6g.large" # Can be smaller to save cost
# Can be promoted to primary during DR
auto_minor_version_upgrade = true
}
S3 Cross-Region Replication:
resource "aws_s3_bucket" "primary" {
bucket = "myapp-uploads-primary"
}
resource "aws_s3_bucket_versioning" "primary" {
bucket = aws_s3_bucket.primary.id
versioning_configuration {
status = "Enabled" # Required for replication
}
}
resource "aws_s3_bucket_replication_configuration" "primary" {
bucket = aws_s3_bucket.primary.id
role = aws_iam_role.replication.arn
rule {
id = "replicate-all"
status = "Enabled"
destination {
bucket = aws_s3_bucket.dr.arn
storage_class = "STANDARD_IA"
}
}
}
DynamoDB Global Tables:
resource "aws_dynamodb_table" "sessions" {
name = "sessions"
billing_mode = "PAY_PER_REQUEST"
hash_key = "session_id"
attribute {
name = "session_id"
type = "S"
}
replica {
region_name = "us-west-2"
}
}
Testing Your DR Plan
This is where most organizations fail. A backup strategy without tested recovery procedures is not a strategy - it is a hope.
Quarterly DR drill checklist:
## DR Drill - Q2 2026
### Pre-drill
- [ ] Schedule 4-hour maintenance window
- [ ] Notify stakeholders
- [ ] Document current production state (record counts, latest transactions)
### Database Recovery
- [ ] Restore RDS from latest cross-region snapshot
- [ ] Verify restore completed without errors
- [ ] Compare row counts against production
- [ ] Run application smoke tests against restored DB
- [ ] Record actual restore time: _____ minutes (RTO target: 30 min)
- [ ] Record data loss window: _____ minutes (RPO target: 5 min)
### Application Recovery
- [ ] Deploy application stack in DR region using Terraform
- [ ] Verify DNS failover configuration works
- [ ] Run full integration test suite
- [ ] Record actual application recovery time: _____ minutes
### S3 Data Verification
- [ ] Compare object counts between primary and DR bucket
- [ ] Download and verify random sample of objects
- [ ] Verify replication lag: _____ seconds
### Post-drill
- [ ] Document any issues encountered
- [ ] Update runbooks based on findings
- [ ] File tickets for gaps identified
- [ ] Update RTO/RPO estimates based on actual times
Automate what you can. This script verifies your RDS backup is restorable:
#!/bin/bash
# dr-test-rds-restore.sh
set -euo pipefail
SNAPSHOT_ID=$(aws rds describe-db-snapshots \
--db-instance-identifier production-primary \
--query 'reverse(sort_by(DBSnapshots, &SnapshotCreateTime))[0].DBSnapshotIdentifier' \
--output text)
echo "Restoring from snapshot: $SNAPSHOT_ID"
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier dr-test-restore \
--db-snapshot-identifier "$SNAPSHOT_ID" \
--db-instance-class db.t3.medium \
--no-multi-az
echo "Waiting for instance to be available..."
aws rds wait db-instance-available --db-instance-identifier dr-test-restore
echo "Running verification queries..."
ENDPOINT=$(aws rds describe-db-instances \
--db-instance-identifier dr-test-restore \
--query 'DBInstances[0].Endpoint.Address' \
--output text)
RESTORED_COUNT=$(psql -h "$ENDPOINT" -U dbadmin -d production -tAc "SELECT count(*) FROM users")
PROD_COUNT=$(psql -h prod-endpoint -U dbadmin -d production -tAc "SELECT count(*) FROM users")
echo "Production users: $PROD_COUNT"
echo "Restored users: $RESTORED_COUNT"
# Cleanup
aws rds delete-db-instance \
--db-instance-identifier dr-test-restore \
--skip-final-snapshot
Common DR Mistakes to Avoid
1. Backing up data but not configuration. Your database backup is useless if you cannot recreate the VPC, security groups, IAM roles, and application configuration needed to run your service. Infrastructure as code (Terraform, CloudFormation) is a DR requirement, not a nice-to-have.
2. Not encrypting backups. Unencrypted backups in S3 are a security liability. Use KMS encryption on all backup vaults and S3 buckets.
3. Using the same KMS key for primary and DR. If your primary region's KMS key becomes unavailable, you cannot decrypt backups in the DR region. Use separate KMS keys in each region, or use multi-region KMS keys.
4. Ignoring DNS TTL. If you need to failover DNS to a DR region, a 24-hour TTL means clients will keep hitting the dead primary for up to 24 hours. Set TTLs to 60 seconds for any record that might need to change during DR.
5. No communication plan. When production goes down, people panic. Document who gets notified, how (Slack, PagerDuty, phone), and who has the authority to initiate DR procedures. Practice this too.
Need Help with Your DevOps?
Building a disaster recovery strategy that actually works when you need it requires planning, implementation, and regular testing. At InstaDevOps, we help startups and SMBs implement production-grade backup and DR infrastructure - starting at $2,999/mo.
Book a free 15-minute consultation to discuss your backup and disaster recovery needs.
Ready to Transform Your DevOps?
Get started with InstaDevOps and experience world-class DevOps services.
Book a Free CallNever Miss an Update
Get the latest DevOps insights, tutorials, and best practices delivered straight to your inbox. Join 500+ engineers leveling up their DevOps skills.