Name: InstaDevOps Subscription
Brand: InstaDevOps
Price: 2999 USD
Availability: InStock
Rating: 4.8 (47 reviews)

Introduction

Every startup says they have backups. Very few have actually tested restoring from them. Even fewer have a documented disaster recovery plan with defined recovery objectives. Then something goes wrong - a developer runs a destructive migration against production, a region goes down, ransomware encrypts your EBS volumes - and the team discovers their "backup strategy" was an untested assumption.

Disaster recovery does not have to be complicated or expensive, especially on AWS. But it does have to be intentional. This guide walks through building a real DR strategy: defining your recovery objectives, implementing the 3-2-1 backup rule, configuring AWS Backup, setting up cross-region replication, and most importantly, testing that everything actually works.

Understanding RTO and RPO

Before you configure a single backup, you need two numbers:

Recovery Point Objective (RPO): How much data can you afford to lose? If your RPO is 1 hour, you need backups at least every hour. If your RPO is zero, you need synchronous replication.

Recovery Time Objective (RTO): How long can your service be down before the business impact becomes unacceptable? If your RTO is 4 hours, you need a recovery process that can restore full service within 4 hours.

These numbers vary by system. Here is a realistic example for a SaaS startup:

System	RPO	RTO	Strategy
Primary database (PostgreSQL)	5 minutes	30 minutes	Continuous WAL archiving + read replica
User uploads (S3)	0 (no loss)	1 hour	Cross-region replication
Application servers	N/A (stateless)	15 minutes	Auto-scaling group, multi-AZ
Redis cache	1 hour	15 minutes	Rebuild from DB if needed
Elasticsearch	24 hours	4 hours	Nightly snapshots to S3
Configuration/secrets	0 (no loss)	30 minutes	Versioned in AWS Secrets Manager

The key insight: not everything needs the same level of protection. Your production database needs 5-minute RPO. Your Elasticsearch cluster (which can be rebuilt from the database) can tolerate 24-hour RPO. Treating everything equally means overspending on DR for low-criticality systems.

The 3-2-1 Backup Rule

The 3-2-1 rule is decades old and still the best framework for backup strategy:

3 copies of your data
2 different storage media/types
1 copy offsite (different region or provider)

In AWS terms:

Copy 1: Production data (RDS instance, EBS volumes, S3 buckets)
Copy 2: AWS Backup vault in the same region (different storage from production)
Copy 3: Cross-region backup copy (different region entirely)

For truly critical data, consider a fourth copy outside AWS entirely (GCP bucket, Azure blob, or even Backblaze B2). This protects against the unlikely but possible scenario of an AWS account compromise.

# Example: backup RDS snapshot to an external provider using pg_dump
pg_dump -h your-rds-endpoint.region.rds.amazonaws.com \
  -U dbadmin -d production \
  --format=custom \
  --file=production-$(date +%Y%m%d).dump

# Encrypt and upload to external storage
gpg --symmetric --cipher-algo AES256 production-$(date +%Y%m%d).dump
aws s3 cp production-$(date +%Y%m%d).dump.gpg s3://offsite-backups/ --storage-class GLACIER_IR

Configuring AWS Backup

AWS Backup provides a centralized service to manage backups across RDS, EBS, EFS, DynamoDB, S3, and more. Here is a production-ready configuration using Terraform:

# Backup vault
resource "aws_backup_vault" "main" {
  name        = "production-vault"
  kms_key_arn = aws_kms_key.backup.arn

  tags = {
    Environment = "production"
  }
}

# Cross-region vault for DR
resource "aws_backup_vault" "dr" {
  provider = aws.dr_region
  name     = "production-vault-dr"
  kms_key_arn = aws_kms_key.backup_dr.arn
}

# Backup plan
resource "aws_backup_plan" "production" {
  name = "production-backup-plan"

  # Hourly backups retained for 24 hours
  rule {
    rule_name         = "hourly"
    target_vault_name = aws_backup_vault.main.name
    schedule          = "cron(0 * * * ? *)"
    start_window      = 60
    completion_window  = 120

    lifecycle {
      delete_after = 1  # days
    }
  }

  # Daily backups retained for 30 days
  rule {
    rule_name         = "daily"
    target_vault_name = aws_backup_vault.main.name
    schedule          = "cron(0 3 * * ? *)"
    start_window      = 60
    completion_window  = 180

    lifecycle {
      cold_storage_after = 7
      delete_after       = 30
    }

    # Copy to DR region
    copy_action {
      destination_vault_arn = aws_backup_vault.dr.arn
      lifecycle {
        delete_after = 30
      }
    }
  }

  # Weekly backups retained for 1 year
  rule {
    rule_name         = "weekly"
    target_vault_name = aws_backup_vault.main.name
    schedule          = "cron(0 4 ? * SUN *)"
    start_window      = 60
    completion_window  = 300

    lifecycle {
      cold_storage_after = 30
      delete_after       = 365
    }
  }
}

# Assign resources to backup plan
resource "aws_backup_selection" "production" {
  iam_role_arn = aws_iam_role.backup.arn
  name         = "production-resources"
  plan_id      = aws_backup_plan.production.id

  # Backup everything tagged with Backup=true
  selection_tag {
    type  = "STRINGEQUALS"
    key   = "Backup"
    value = "true"
  }
}

Tag your resources to include them in the backup plan:

resource "aws_db_instance" "production" {
  # ... other config
  tags = {
    Backup      = "true"
    Environment = "production"
  }
}

resource "aws_ebs_volume" "data" {
  # ... other config
  tags = {
    Backup      = "true"
    Environment = "production"
  }
}

Cross-Region Replication

For services where you need faster recovery than restoring from backups, set up active replication:

RDS Cross-Region Read Replica:

# Primary in us-east-1
resource "aws_db_instance" "primary" {
  identifier     = "production-primary"
  engine         = "postgres"
  engine_version = "16.2"
  instance_class = "db.r6g.xlarge"
  multi_az       = true

  backup_retention_period = 7
  backup_window           = "03:00-04:00"
}

# Cross-region replica in us-west-2
resource "aws_db_instance" "dr_replica" {
  provider = aws.dr_region

  identifier          = "production-dr-replica"
  replicate_source_db = aws_db_instance.primary.arn
  instance_class      = "db.r6g.large"  # Can be smaller to save cost

  # Can be promoted to primary during DR
  auto_minor_version_upgrade = true
}

S3 Cross-Region Replication:

resource "aws_s3_bucket" "primary" {
  bucket = "myapp-uploads-primary"
}

resource "aws_s3_bucket_versioning" "primary" {
  bucket = aws_s3_bucket.primary.id
  versioning_configuration {
    status = "Enabled"  # Required for replication
  }
}

resource "aws_s3_bucket_replication_configuration" "primary" {
  bucket = aws_s3_bucket.primary.id
  role   = aws_iam_role.replication.arn

  rule {
    id     = "replicate-all"
    status = "Enabled"

    destination {
      bucket        = aws_s3_bucket.dr.arn
      storage_class = "STANDARD_IA"
    }
  }
}

DynamoDB Global Tables:

resource "aws_dynamodb_table" "sessions" {
  name         = "sessions"
  billing_mode = "PAY_PER_REQUEST"
  hash_key     = "session_id"

  attribute {
    name = "session_id"
    type = "S"
  }

  replica {
    region_name = "us-west-2"
  }
}

Testing Your DR Plan

This is where most organizations fail. A backup strategy without tested recovery procedures is not a strategy - it is a hope.

Quarterly DR drill checklist:

## DR Drill - Q2 2026

### Pre-drill
- [ ] Schedule 4-hour maintenance window
- [ ] Notify stakeholders
- [ ] Document current production state (record counts, latest transactions)

### Database Recovery
- [ ] Restore RDS from latest cross-region snapshot
- [ ] Verify restore completed without errors
- [ ] Compare row counts against production
- [ ] Run application smoke tests against restored DB
- [ ] Record actual restore time: _____ minutes (RTO target: 30 min)
- [ ] Record data loss window: _____ minutes (RPO target: 5 min)

### Application Recovery
- [ ] Deploy application stack in DR region using Terraform
- [ ] Verify DNS failover configuration works
- [ ] Run full integration test suite
- [ ] Record actual application recovery time: _____ minutes

### S3 Data Verification
- [ ] Compare object counts between primary and DR bucket
- [ ] Download and verify random sample of objects
- [ ] Verify replication lag: _____ seconds

### Post-drill
- [ ] Document any issues encountered
- [ ] Update runbooks based on findings
- [ ] File tickets for gaps identified
- [ ] Update RTO/RPO estimates based on actual times

Automate what you can. This script verifies your RDS backup is restorable:

#!/bin/bash
# dr-test-rds-restore.sh
set -euo pipefail

SNAPSHOT_ID=$(aws rds describe-db-snapshots \
  --db-instance-identifier production-primary \
  --query 'reverse(sort_by(DBSnapshots, &SnapshotCreateTime))[0].DBSnapshotIdentifier' \
  --output text)

echo "Restoring from snapshot: $SNAPSHOT_ID"

aws rds restore-db-instance-from-db-snapshot \
  --db-instance-identifier dr-test-restore \
  --db-snapshot-identifier "$SNAPSHOT_ID" \
  --db-instance-class db.t3.medium \
  --no-multi-az

echo "Waiting for instance to be available..."
aws rds wait db-instance-available --db-instance-identifier dr-test-restore

echo "Running verification queries..."
ENDPOINT=$(aws rds describe-db-instances \
  --db-instance-identifier dr-test-restore \
  --query 'DBInstances[0].Endpoint.Address' \
  --output text)

RESTORED_COUNT=$(psql -h "$ENDPOINT" -U dbadmin -d production -tAc "SELECT count(*) FROM users")
PROD_COUNT=$(psql -h prod-endpoint -U dbadmin -d production -tAc "SELECT count(*) FROM users")

echo "Production users: $PROD_COUNT"
echo "Restored users: $RESTORED_COUNT"

# Cleanup
aws rds delete-db-instance \
  --db-instance-identifier dr-test-restore \
  --skip-final-snapshot

Common DR Mistakes to Avoid

1. Backing up data but not configuration. Your database backup is useless if you cannot recreate the VPC, security groups, IAM roles, and application configuration needed to run your service. Infrastructure as code (Terraform, CloudFormation) is a DR requirement, not a nice-to-have.

2. Not encrypting backups. Unencrypted backups in S3 are a security liability. Use KMS encryption on all backup vaults and S3 buckets.

3. Using the same KMS key for primary and DR. If your primary region's KMS key becomes unavailable, you cannot decrypt backups in the DR region. Use separate KMS keys in each region, or use multi-region KMS keys.

4. Ignoring DNS TTL. If you need to failover DNS to a DR region, a 24-hour TTL means clients will keep hitting the dead primary for up to 24 hours. Set TTLs to 60 seconds for any record that might need to change during DR.

5. No communication plan. When production goes down, people panic. Document who gets notified, how (Slack, PagerDuty, phone), and who has the authority to initiate DR procedures. Practice this too.

Need Help with Your DevOps?

Building a disaster recovery strategy that actually works when you need it requires planning, implementation, and regular testing. At InstaDevOps, we help startups and SMBs implement production-grade backup and DR infrastructure - starting at $2,999/mo.

Book a free 15-minute consultation to discuss your backup and disaster recovery needs.

Backup and Disaster Recovery for Startups: RTO/RPO Planning, AWS Backup, and Testing Your DR Plan

Free DevOps Audit Checklist