SRE: Disaster Recovery and Business Continuity

2026-04-03 | Gabriel Garrido | 27 min read
Share:

Support this blog

If you find this content useful, consider supporting the blog.

Introduction

Throughout this SRE series we have built a comprehensive toolkit for running reliable systems. We covered SLIs and SLOs, incident management, observability, chaos engineering, capacity planning, GitOps, secrets management, cost optimization, dependency management, database reliability, release engineering, and security as code. We have metrics, alerts, incident response, and chaos experiments in place. But there is one question we have not fully addressed yet: what happens when everything goes down at once?


“Hope is not a strategy” is a saying you hear often in SRE circles, and nowhere does it apply more than in disaster recovery. A single availability zone going dark, a botched cluster upgrade, a ransomware attack, or even an accidental kubectl delete namespace production can wipe out your entire workload. The question is not if a disaster will happen, but when, and whether you will be ready for it.


In this article we will cover everything you need to build a solid disaster recovery (DR) and business continuity plan for Kubernetes environments. We will go from defining RPO and RTO targets all the way to Velero backups, etcd recovery, multi-region strategies, DR drills, communication plans, and step-by-step runbooks for full cluster recovery.


Let’s get into it.


RPO and RTO: defining your recovery targets

Before you can plan for disaster recovery, you need to answer two fundamental questions:


  • RPO (Recovery Point Objective): How much data can you afford to lose? If your RPO is 1 hour, you need backups at least every hour. If your RPO is zero, you need synchronous replication.
  • RTO (Recovery Time Objective): How long can your service be down? If your RTO is 15 minutes, you need automated failover. If your RTO is 4 hours, manual recovery might be acceptable.

These targets are not technical decisions, they are business decisions. You need to sit down with stakeholders and understand the actual cost of downtime and data loss for each service. A payment processing system has very different requirements than an internal wiki.


Here is a simple business impact analysis template to guide those conversations:


# dr-plan/business-impact-analysis.yaml
services:
  - name: payment-api
    tier: critical
    rpo: "0 minutes"         # Zero data loss
    rto: "5 minutes"         # Automated failover required
    data_classification: pci
    revenue_impact_per_hour: "$50,000"
    dependencies:
      - postgresql-primary
      - redis-sessions
      - stripe-api
    backup_strategy: synchronous-replication
    failover_strategy: active-active

  - name: user-api
    tier: high
    rpo: "15 minutes"
    rto: "30 minutes"
    data_classification: pii
    revenue_impact_per_hour: "$10,000"
    dependencies:
      - postgresql-primary
      - redis-cache
    backup_strategy: streaming-replication
    failover_strategy: active-passive

  - name: blog
    tier: medium
    rpo: "24 hours"
    rto: "4 hours"
    data_classification: public
    revenue_impact_per_hour: "$0"
    dependencies:
      - postgresql-primary
    backup_strategy: daily-snapshots
    failover_strategy: rebuild-from-backup

  - name: internal-tools
    tier: low
    rpo: "24 hours"
    rto: "24 hours"
    data_classification: internal
    revenue_impact_per_hour: "$500"
    dependencies:
      - postgresql-primary
    backup_strategy: daily-snapshots
    failover_strategy: rebuild-from-backup

The key insight here is that not every service needs the same level of protection. Over-engineering DR for a low-tier service wastes money, while under-engineering it for a critical service creates real risk. Tier your services and plan accordingly.


DR plan template

Every organization needs a documented, tested, and regularly updated DR plan. Here is a structured template that covers the essentials:


# dr-plan/disaster-recovery-plan.yaml
metadata:
  version: "2.1"
  last_updated: "2026-03-15"
  next_review: "2026-06-15"
  owner: "platform-team"
  approver: "vp-engineering"

scope:
  environments:
    - production
    - staging
  regions:
    - primary: us-east-1
    - secondary: eu-west-1
  clusters:
    - prod-primary (us-east-1)
    - prod-secondary (eu-west-1)

roles_and_responsibilities:
  incident_commander:
    name: "Rotating on-call lead"
    responsibilities:
      - Declare disaster
      - Coordinate recovery
      - Authorize failover decisions
      - Communicate with leadership

  dr_lead:
    name: "Senior SRE on-call"
    responsibilities:
      - Execute recovery runbooks
      - Verify backup integrity
      - Coordinate infrastructure recovery
      - Run post-recovery validation

  communications_lead:
    name: "Engineering manager on-call"
    responsibilities:
      - Update status page
      - Notify customers
      - Coordinate with support team
      - Send internal updates

  database_lead:
    name: "DBA on-call"
    responsibilities:
      - Verify database backups
      - Execute database recovery
      - Validate data integrity
      - Monitor replication lag

activation_criteria:
  - "Complete loss of primary region availability"
  - "Primary Kubernetes cluster unrecoverable"
  - "Data corruption affecting critical services"
  - "Security breach requiring infrastructure rebuild"
  - "Cloud provider outage exceeding 30 minutes"

communication_channels:
  primary: "Slack #incident-war-room"
  secondary: "PagerDuty conference bridge"
  tertiary: "Personal phone numbers (see emergency contacts doc)"
  status_page: "https://status.example.com"

recovery_priority:
  - tier: 1
    services: [payment-api, auth-service]
    target_rto: "5 minutes"
    action: "Automated DNS failover to secondary region"
  - tier: 2
    services: [user-api, notification-service]
    target_rto: "30 minutes"
    action: "Restore from replica in secondary region"
  - tier: 3
    services: [blog, docs, internal-tools]
    target_rto: "4 hours"
    action: "Rebuild from backups and GitOps repo"

Notice that the plan has a version, an owner, and a scheduled review date. A DR plan that was written two years ago and never updated is worse than no plan at all because it gives you false confidence. Review your DR plan quarterly and update it every time your infrastructure changes.


Velero for Kubernetes backup

Velero is the standard tool for backing up Kubernetes resources and persistent volumes. It can back up your entire cluster state (or specific namespaces) and restore it to the same or a different cluster.


Install Velero with the AWS plugin (works with S3-compatible storage including MinIO):


# Install Velero CLI
wget https://github.com/vmware-tanzu/velero/releases/download/v1.13.0/velero-v1.13.0-linux-arm64.tar.gz
tar -xvf velero-v1.13.0-linux-arm64.tar.gz
sudo mv velero-v1.13.0-linux-arm64/velero /usr/local/bin/

# Install Velero in the cluster
velero install \
  --provider aws \
  --plugins velero/velero-plugin-for-aws:v1.9.0 \
  --bucket velero-backups \
  --secret-file ./credentials-velero \
  --backup-location-config region=us-east-1,s3ForcePathStyle=true,s3Url=https://s3.us-east-1.amazonaws.com \
  --snapshot-location-config region=us-east-1 \
  --use-node-agent \
  --default-volumes-to-fs-backup

Now set up scheduled backups. The key is to have different backup schedules for different tiers of services:


# velero/backup-schedule-critical.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: critical-services-hourly
  namespace: velero
spec:
  schedule: "0 * * * *"  # Every hour
  template:
    includedNamespaces:
      - payment-system
      - auth-system
    includedResources:
      - deployments
      - services
      - configmaps
      - secrets
      - persistentvolumeclaims
      - persistentvolumes
      - ingresses
      - horizontalpodautoscalers
    defaultVolumesToFsBackup: true
    storageLocation: default
    ttl: 168h  # Keep for 7 days
    metadata:
      labels:
        tier: critical
        backup-type: scheduled

# velero/backup-schedule-standard.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: standard-services-daily
  namespace: velero
spec:
  schedule: "0 2 * * *"  # Daily at 2 AM
  template:
    includedNamespaces:
      - default
      - blog
      - monitoring
      - ingress-nginx
    excludedResources:
      - events
      - pods
    defaultVolumesToFsBackup: true
    storageLocation: default
    ttl: 720h  # Keep for 30 days
    metadata:
      labels:
        tier: standard
        backup-type: scheduled

# velero/backup-schedule-full.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
  name: full-cluster-weekly
  namespace: velero
spec:
  schedule: "0 3 * * 0"  # Every Sunday at 3 AM
  template:
    includedNamespaces:
      - "*"
    excludedNamespaces:
      - velero
      - kube-system
    excludedResources:
      - events
      - pods
    defaultVolumesToFsBackup: true
    storageLocation: default
    ttl: 2160h  # Keep for 90 days
    metadata:
      labels:
        backup-type: full-cluster

To restore from a Velero backup, first check what backups are available:


# List available backups
velero backup get

# Describe a specific backup to see what it contains
velero backup describe critical-services-hourly-20260328120000

# Restore to a new namespace (for testing)
velero restore create test-restore \
  --from-backup critical-services-hourly-20260328120000 \
  --namespace-mappings payment-system:payment-system-restored

# Restore to the original namespace (for actual DR)
velero restore create dr-restore \
  --from-backup critical-services-hourly-20260328120000

# Check restore status
velero restore describe dr-restore
velero restore logs dr-restore

One critical thing people miss: you need to regularly test your backups by actually restoring them. A backup that has never been tested is not a backup, it is a hope. Set up a weekly job that restores your latest backup to a test namespace and validates the resources:


# velero/backup-validation-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: validate-velero-backups
  namespace: velero
spec:
  schedule: "0 6 * * 1"  # Every Monday at 6 AM
  jobTemplate:
    spec:
      template:
        spec:
          serviceAccountName: velero-validator
          containers:
            - name: validator
              image: bitnami/kubectl:1.29
              command:
                - /bin/bash
                - -c
                - |
                  set -euo pipefail

                  echo "=== Velero Backup Validation ==="
                  LATEST_BACKUP=$(velero backup get -o json | \
                    jq -r '.items | sort_by(.metadata.creationTimestamp) | last | .metadata.name')

                  echo "Latest backup: ${LATEST_BACKUP}"

                  # Create a test restore
                  velero restore create validation-${LATEST_BACKUP} \
                    --from-backup ${LATEST_BACKUP} \
                    --namespace-mappings default:validation-test

                  # Wait for restore to complete
                  sleep 120

                  # Check restore status
                  RESTORE_STATUS=$(velero restore get validation-${LATEST_BACKUP} -o json | \
                    jq -r '.status.phase')

                  if [ "$RESTORE_STATUS" = "Completed" ]; then
                    echo "PASS: Restore completed successfully"
                  else
                    echo "FAIL: Restore status is ${RESTORE_STATUS}"
                    # Send alert to PagerDuty or Slack
                    curl -X POST "$SLACK_WEBHOOK" \
                      -H 'Content-Type: application/json' \
                      -d "{\"text\": \"Velero backup validation FAILED for ${LATEST_BACKUP}\"}"
                  fi

                  # Clean up the test namespace
                  kubectl delete namespace validation-test --ignore-not-found=true
              env:
                - name: SLACK_WEBHOOK
                  valueFrom:
                    secretKeyRef:
                      name: slack-webhook
                      key: url
          restartPolicy: OnFailure

etcd backup and restore

etcd is the brain of your Kubernetes cluster. It stores all cluster state, including deployments, services, secrets, configmaps, and RBAC policies. If you lose etcd and you do not have a backup, you lose your entire cluster. Everything else can be rebuilt from GitOps, but etcd is the one piece that holds the live state.


Here is a script for automated etcd snapshots:


#!/bin/bash
# etcd-backup.sh - Automated etcd snapshot backup
# Run this as a CronJob on one of the control plane nodes

set -euo pipefail

BACKUP_DIR="/var/backups/etcd"
S3_BUCKET="s3://etcd-backups-prod"
RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
SNAPSHOT_FILE="${BACKUP_DIR}/etcd-snapshot-${TIMESTAMP}.db"

# Create backup directory
mkdir -p "${BACKUP_DIR}"

echo "[$(date)] Starting etcd backup..."

# Take the snapshot
ETCDCTL_API=3 etcdctl snapshot save "${SNAPSHOT_FILE}" \
  --endpoints=https://127.0.0.1:2379 \
  --cacert=/etc/kubernetes/pki/etcd/ca.crt \
  --cert=/etc/kubernetes/pki/etcd/server.crt \
  --key=/etc/kubernetes/pki/etcd/server.key

# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status "${SNAPSHOT_FILE}" \
  --write-out=table

# Get the snapshot size for logging
SNAPSHOT_SIZE=$(du -h "${SNAPSHOT_FILE}" | cut -f1)
echo "[$(date)] Snapshot created: ${SNAPSHOT_FILE} (${SNAPSHOT_SIZE})"

# Upload to S3
aws s3 cp "${SNAPSHOT_FILE}" \
  "${S3_BUCKET}/etcd-snapshot-${TIMESTAMP}.db" \
  --storage-class STANDARD_IA

echo "[$(date)] Snapshot uploaded to ${S3_BUCKET}"

# Calculate checksum and upload it alongside the snapshot
sha256sum "${SNAPSHOT_FILE}" > "${SNAPSHOT_FILE}.sha256"
aws s3 cp "${SNAPSHOT_FILE}.sha256" \
  "${S3_BUCKET}/etcd-snapshot-${TIMESTAMP}.db.sha256"

# Clean up old local backups
find "${BACKUP_DIR}" -name "etcd-snapshot-*.db" -mtime +${RETENTION_DAYS} -delete
find "${BACKUP_DIR}" -name "etcd-snapshot-*.sha256" -mtime +${RETENTION_DAYS} -delete

# Clean up old S3 backups using lifecycle policies
echo "[$(date)] etcd backup completed successfully"

Schedule this as a CronJob on your control plane:


# etcd/backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: etcd-backup
  namespace: kube-system
spec:
  schedule: "0 */6 * * *"  # Every 6 hours
  concurrencyPolicy: Forbid
  successfulJobsHistoryLimit: 3
  failedJobsHistoryLimit: 3
  jobTemplate:
    spec:
      template:
        spec:
          nodeName: control-plane-1  # Pin to a control plane node
          hostNetwork: true
          containers:
            - name: etcd-backup
              image: registry.k8s.io/etcd:3.5.12-0
              command:
                - /bin/sh
                - /scripts/etcd-backup.sh
              volumeMounts:
                - name: etcd-certs
                  mountPath: /etc/kubernetes/pki/etcd
                  readOnly: true
                - name: backup-scripts
                  mountPath: /scripts
                - name: backup-storage
                  mountPath: /var/backups/etcd
          volumes:
            - name: etcd-certs
              hostPath:
                path: /etc/kubernetes/pki/etcd
            - name: backup-scripts
              configMap:
                name: etcd-backup-scripts
                defaultMode: 0755
            - name: backup-storage
              hostPath:
                path: /var/backups/etcd
          restartPolicy: OnFailure
          tolerations:
            - key: node-role.kubernetes.io/control-plane
              operator: Exists
              effect: NoSchedule

Now the critical part, restoring etcd. This is the procedure you follow when your cluster is completely gone:


#!/bin/bash
# etcd-restore.sh - Restore etcd from a snapshot
# WARNING: This replaces ALL cluster state. Only use during disaster recovery.

set -euo pipefail

SNAPSHOT_FILE="$1"
DATA_DIR="/var/lib/etcd-restored"

if [ -z "${SNAPSHOT_FILE}" ]; then
  echo "Usage: $0 <snapshot-file>"
  exit 1
fi

echo "WARNING: This will replace ALL etcd data!"
echo "Snapshot: ${SNAPSHOT_FILE}"
echo "Press Ctrl+C to abort, or wait 10 seconds to continue..."
sleep 10

# Stop the kubelet (which manages etcd as a static pod)
systemctl stop kubelet

# Stop etcd if running
crictl ps | grep etcd && crictl stop $(crictl ps -q --name etcd)

# Verify the snapshot integrity
echo "Verifying snapshot integrity..."
ETCDCTL_API=3 etcdctl snapshot status "${SNAPSHOT_FILE}" \
  --write-out=table

# Restore the snapshot to a new data directory
ETCDCTL_API=3 etcdctl snapshot restore "${SNAPSHOT_FILE}" \
  --data-dir="${DATA_DIR}" \
  --name=control-plane-1 \
  --initial-cluster=control-plane-1=https://10.0.1.10:2380 \
  --initial-cluster-token=etcd-cluster-1 \
  --initial-advertise-peer-urls=https://10.0.1.10:2380

# Back up the old data directory
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
if [ -d /var/lib/etcd ]; then
  mv /var/lib/etcd "/var/lib/etcd-old-${TIMESTAMP}"
fi

# Move the restored data into place
mv "${DATA_DIR}" /var/lib/etcd

# Fix ownership
chown -R etcd:etcd /var/lib/etcd 2>/dev/null || true

# Start kubelet (which will start etcd as a static pod)
systemctl start kubelet

echo "Waiting for etcd to become healthy..."
for i in $(seq 1 60); do
  if ETCDCTL_API=3 etcdctl endpoint health \
    --endpoints=https://127.0.0.1:2379 \
    --cacert=/etc/kubernetes/pki/etcd/ca.crt \
    --cert=/etc/kubernetes/pki/etcd/server.crt \
    --key=/etc/kubernetes/pki/etcd/server.key 2>/dev/null; then
    echo "etcd is healthy!"
    break
  fi
  echo "Waiting... ($i/60)"
  sleep 5
done

echo "etcd restore completed. Verify cluster state with: kubectl get nodes"

One important note about etcd restores: when you restore from a snapshot, you get the cluster state at the time the snapshot was taken. Any resources created after the snapshot will be gone. This is why your RPO for cluster state is determined by your etcd snapshot frequency. If you snapshot every 6 hours, your worst-case data loss for cluster state is 6 hours of changes. However, if you are using GitOps (and you should be), you can re-apply all your manifests from the Git repository to bring the cluster back to current state.


Multi-region and multi-cluster strategies

For services that need very low RTO, you need your workloads running in multiple regions or clusters simultaneously. There are two main approaches:


Active-Active: Both regions serve traffic simultaneously. If one goes down, the other absorbs all traffic. This gives you the lowest possible RTO (just the time for DNS or load balancer health checks to detect the failure) but it is also the most complex to set up and operate.


Active-Passive: One region serves all traffic, the other is on standby. When the active region fails, you failover to the passive region. This is simpler but has a longer RTO because you need to detect the failure, make the failover decision, and potentially warm up the passive region.


Here is a DNS-based failover configuration using external-dns and health checks:


# multi-region/dns-failover.yaml
# Primary region health check
apiVersion: externaldns.k8s.io/v1alpha1
kind: DNSEndpoint
metadata:
  name: app-primary
  namespace: default
  annotations:
    external-dns.alpha.kubernetes.io/aws-region: us-east-1
spec:
  endpoints:
    - dnsName: app.example.com
      recordTTL: 60
      recordType: A
      targets:
        - 10.0.1.100  # Primary region LB
      setIdentifier: primary
      providerSpecific:
        - name: aws/failover
          value: PRIMARY
        - name: aws/health-check-id
          value: "hc-primary-12345"

---
# Secondary region (failover target)
apiVersion: externaldns.k8s.io/v1alpha1
kind: DNSEndpoint
metadata:
  name: app-secondary
  namespace: default
  annotations:
    external-dns.alpha.kubernetes.io/aws-region: eu-west-1
spec:
  endpoints:
    - dnsName: app.example.com
      recordTTL: 60
      recordType: A
      targets:
        - 10.1.1.100  # Secondary region LB
      setIdentifier: secondary
      providerSpecific:
        - name: aws/failover
          value: SECONDARY
        - name: aws/health-check-id
          value: "hc-secondary-67890"

For multi-cluster management, here is a configuration sync setup using ArgoCD ApplicationSets:


# multi-region/argocd-multi-cluster.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: critical-services
  namespace: argocd
spec:
  generators:
    - clusters:
        selector:
          matchLabels:
            environment: production
        values:
          region: "{{metadata.labels.region}}"
  template:
    metadata:
      name: "critical-services-{{name}}"
    spec:
      project: production
      source:
        repoURL: https://github.com/example/k8s-manifests
        targetRevision: main
        path: "clusters/{{values.region}}/critical-services"
      destination:
        server: "{{server}}"
        namespace: production
      syncPolicy:
        automated:
          prune: true
          selfHeal: true
        syncOptions:
          - CreateNamespace=true

With this setup, ArgoCD automatically deploys your critical services to every production cluster. When you add a new cluster, the services get deployed automatically. This is where GitOps really shines for DR: your entire desired state is in Git, and ArgoCD ensures every cluster matches it.


Database DR: cross-region PostgreSQL replication

Databases are usually the hardest part of disaster recovery because they hold state. For PostgreSQL, here is a setup using streaming replication with pgBackRest for cross-region backups:


# database/postgresql-dr.yaml
# Primary PostgreSQL configuration for DR
apiVersion: v1
kind: ConfigMap
metadata:
  name: postgresql-dr-config
  namespace: database
data:
  postgresql.conf: |
    # Replication settings for DR
    wal_level = replica
    max_wal_senders = 10
    wal_keep_size = 1024      # Keep 1GB of WAL for replication lag tolerance
    synchronous_commit = on
    synchronous_standby_names = 'standby_eu_west'

    # Archive settings for point-in-time recovery
    archive_mode = on
    archive_command = 'pgbackrest --stanza=main archive-push %p'
    archive_timeout = 60      # Archive at least every 60 seconds

  pg_hba.conf: |
    # Replication access from secondary region
    hostssl replication replicator 10.1.0.0/16 scram-sha-256
    hostssl replication replicator 10.0.0.0/16 scram-sha-256
    hostssl all all 10.0.0.0/8 scram-sha-256

  pgbackrest.conf: |
    [global]
    repo1-type=s3
    repo1-s3-bucket=pg-backups-primary
    repo1-s3-region=us-east-1
    repo1-s3-endpoint=s3.us-east-1.amazonaws.com
    repo1-retention-full=4
    repo1-retention-diff=14

    # Cross-region backup for DR
    repo2-type=s3
    repo2-s3-bucket=pg-backups-dr
    repo2-s3-region=eu-west-1
    repo2-s3-endpoint=s3.eu-west-1.amazonaws.com
    repo2-retention-full=4
    repo2-retention-diff=14

    [main]
    pg1-path=/var/lib/postgresql/data

And the backup schedule:


# database/pgbackrest-backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: pgbackrest-full-backup
  namespace: database
spec:
  schedule: "0 1 * * 0"  # Full backup every Sunday at 1 AM
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: pgbackrest/pgbackrest:2.50
              command:
                - /bin/bash
                - -c
                - |
                  echo "Starting full backup to both repos..."

                  # Backup to primary region
                  pgbackrest --stanza=main --type=full --repo=1 backup
                  echo "Primary region backup complete"

                  # Backup to DR region
                  pgbackrest --stanza=main --type=full --repo=2 backup
                  echo "DR region backup complete"

                  # Verify both backups
                  pgbackrest --stanza=main --repo=1 info
                  pgbackrest --stanza=main --repo=2 info
          restartPolicy: OnFailure

---
apiVersion: batch/v1
kind: CronJob
metadata:
  name: pgbackrest-diff-backup
  namespace: database
spec:
  schedule: "0 */4 * * *"  # Differential backup every 4 hours
  jobTemplate:
    spec:
      template:
        spec:
          containers:
            - name: backup
              image: pgbackrest/pgbackrest:2.50
              command:
                - /bin/bash
                - -c
                - |
                  pgbackrest --stanza=main --type=diff --repo=1 backup
                  pgbackrest --stanza=main --type=diff --repo=2 backup
                  echo "Differential backups completed"
          restartPolicy: OnFailure

The restore procedure for PostgreSQL when your primary is gone:


#!/bin/bash
# database/pg-dr-restore.sh
# Restore PostgreSQL from pgBackRest backup in DR region

set -euo pipefail

DR_REPO=2  # Use the DR region repository
TARGET_TIME="${1:-}"  # Optional: point-in-time recovery target

echo "=== PostgreSQL DR Restore ==="
echo "Using repository: repo${DR_REPO} (DR region)"

# List available backups
echo "Available backups:"
pgbackrest --stanza=main --repo=${DR_REPO} info

# Stop PostgreSQL if running
pg_ctl stop -D /var/lib/postgresql/data -m fast 2>/dev/null || true

# Clear the data directory
rm -rf /var/lib/postgresql/data/*

if [ -n "${TARGET_TIME}" ]; then
  echo "Restoring to point-in-time: ${TARGET_TIME}"
  pgbackrest --stanza=main --repo=${DR_REPO} \
    --type=time \
    --target="${TARGET_TIME}" \
    --target-action=promote \
    restore
else
  echo "Restoring latest backup..."
  pgbackrest --stanza=main --repo=${DR_REPO} \
    --type=default \
    restore
fi

# Start PostgreSQL
pg_ctl start -D /var/lib/postgresql/data

# Wait for recovery to complete
echo "Waiting for recovery..."
until pg_isready; do
  sleep 2
done

echo "PostgreSQL restored and ready"

# Verify data integrity
psql -c "SELECT count(*) as total_tables FROM information_schema.tables WHERE table_schema = 'public';"
psql -c "SELECT pg_size_pretty(pg_database_size(current_database())) as db_size;"

DR testing and drills

The best DR plan in the world is worthless if you have never tested it. DR drills are how you turn a theoretical plan into a proven capability. There are three levels of DR testing:


  1. Tabletop exercises: The team walks through the DR plan on paper. No actual systems are affected. This is good for finding gaps in documentation and communication plans.
  2. Component drills: You test individual components of the plan, like restoring a Velero backup or failing over DNS. This validates that the tools and procedures work.
  3. Full DR simulation: You simulate a complete disaster and execute the full recovery plan. This is the gold standard, and it is scary, which is exactly why you need to do it.

Here is a tabletop exercise template:


# dr-drills/tabletop-exercise.yaml
exercise:
  name: "Q1 2026 DR Tabletop Exercise"
  date: "2026-03-20"
  duration: "2 hours"
  facilitator: "Senior SRE"
  participants:
    - platform-team
    - database-team
    - application-team
    - engineering-management

scenario:
  description: |
    At 2:30 AM on a Tuesday, the primary cloud region (us-east-1)
    experiences a complete outage. All services in the region are
    unreachable. The cloud provider estimates 4-6 hours for recovery.
    Your payment-api is processing $5,000 per hour in transactions.

  timeline:
    - time: "T+0"
      event: "PagerDuty fires alerts for all services in us-east-1"
      question: "Who gets paged? What is the escalation path?"

    - time: "T+5min"
      event: "On-call engineer confirms the region is down"
      question: "What is the first action? Who makes the failover decision?"

    - time: "T+10min"
      event: "Incident commander declares disaster, initiates DR plan"
      question: "What communication goes out? To whom? Through which channels?"

    - time: "T+15min"
      event: "DR lead begins failover procedure"
      question: "What are the exact steps? Walk through the runbook."

    - time: "T+30min"
      event: "DNS failover complete for tier-1 services"
      question: "How do you verify services are healthy in the DR region?"

    - time: "T+1hr"
      event: "Tier-2 services restored from replicas"
      question: "What data was lost? How do you reconcile?"

    - time: "T+4hr"
      event: "Primary region comes back online"
      question: "Do you fail back immediately? What is the failback procedure?"

  discussion_questions:
    - "Where are the gaps in our current DR plan?"
    - "Do we have all the access and credentials needed for DR?"
    - "What would happen if the person who knows how to do X is unavailable?"
    - "Are our backups actually restorable? When did we last test?"
    - "What is our communication plan for customers?"

For live DR drills, here is a structured approach:


# dr-drills/live-drill-plan.yaml
drill:
  name: "Q1 2026 Live DR Drill"
  date: "2026-03-25"
  time: "10:00 AM - 2:00 PM"
  type: "component"  # Options: tabletop, component, full
  environment: "staging"  # Always start with staging

  pre_drill_checklist:
    - "All participants confirmed and available"
    - "Stakeholders notified about potential staging impact"
    - "Monitoring dashboards open for staging environment"
    - "Rollback procedures reviewed and ready"
    - "DR region/cluster verified accessible"
    - "Latest backups verified available"
    - "Communication channels tested"

  scenarios:
    - name: "Velero backup restore"
      objective: "Verify we can restore a namespace from Velero backup"
      steps:
        - "Delete the test-app namespace in staging"
        - "Restore from latest Velero backup"
        - "Verify all resources are recreated"
        - "Verify the application is functional"
      success_criteria:
        - "All deployments running with correct replica count"
        - "All services and ingresses recreated"
        - "Application responds to health checks"
        - "Persistent data is present and correct"
      max_duration: "30 minutes"

    - name: "etcd snapshot restore"
      objective: "Verify we can restore etcd from a snapshot"
      steps:
        - "Take a fresh etcd snapshot"
        - "Create some test resources (deployment, service, configmap)"
        - "Restore from the snapshot (before the test resources)"
        - "Verify test resources are gone (proving the restore worked)"
        - "Verify pre-existing resources are intact"
      success_criteria:
        - "etcd restore completes without errors"
        - "Cluster is functional after restore"
        - "Test resources are absent (proving point-in-time restore)"
      max_duration: "45 minutes"

    - name: "Database failover"
      objective: "Verify PostgreSQL failover to read replica"
      steps:
        - "Verify replication lag is zero"
        - "Simulate primary failure (stop primary pod)"
        - "Promote read replica to primary"
        - "Update application connection strings"
        - "Verify application writes succeed on new primary"
      success_criteria:
        - "Failover completes within RTO target"
        - "No data loss (RPO target met)"
        - "Application functions normally on new primary"
      max_duration: "30 minutes"

  post_drill:
    - "Restore staging to normal state"
    - "Document all findings"
    - "Create issues for any failures or gaps found"
    - "Update DR plan based on findings"
    - "Share results with the broader team"
    - "Schedule next drill"

You should also tie DR drills into your chaos engineering practice. A chaos experiment that simulates a zone failure is essentially a lightweight DR drill. If you are already running chaos experiments regularly (as we discussed in the chaos engineering article), you are building the muscle memory your team needs for real disasters.


Runbook for full cluster recovery

This is the big one: your cluster is gone and you need to rebuild from scratch. Here is a step-by-step runbook that covers the full recovery process:


# runbooks/full-cluster-recovery.yaml
runbook:
  name: "Full Kubernetes Cluster Recovery"
  version: "1.3"
  last_tested: "2026-03-15"
  estimated_time: "2-4 hours"
  prerequisites:
    - "Access to cloud provider console/CLI"
    - "Access to etcd backup storage (S3)"
    - "Access to Velero backup storage (S3)"
    - "Access to GitOps repository"
    - "Access to container registry"
    - "DNS management access"
    - "TLS certificates or cert-manager configuration"

  phases:
    - phase: 1
      name: "Infrastructure provisioning"
      estimated_time: "30-60 minutes"
      steps:
        - step: 1.1
          action: "Provision new compute nodes"
          command: |
            # Using Terraform (assuming state is in remote backend)
            cd infrastructure/terraform/kubernetes
            terraform plan -var="cluster_name=prod-recovery"
            terraform apply -auto-approve
          verification: |
            # Verify nodes are provisioned
            kubectl get nodes
            # Expected: all nodes in Ready state

        - step: 1.2
          action: "Verify networking"
          command: |
            # Check CNI is functional
            kubectl run nettest --image=busybox --rm -it -- nslookup kubernetes.default
            # Check external connectivity
            kubectl run nettest --image=busybox --rm -it -- wget -qO- https://hub.docker.com
          verification: "DNS resolution and external connectivity working"

        - step: 1.3
          action: "Verify storage provisioner"
          command: |
            kubectl get storageclass
            # Create a test PVC
            kubectl apply -f - <<EOF
            apiVersion: v1
            kind: PersistentVolumeClaim
            metadata:
              name: test-pvc
            spec:
              accessModes: [ReadWriteOnce]
              resources:
                requests:
                  storage: 1Gi
            EOF
            kubectl get pvc test-pvc
          verification: "PVC transitions to Bound state"

    - phase: 2
      name: "Core infrastructure recovery"
      estimated_time: "20-30 minutes"
      steps:
        - step: 2.1
          action: "Restore etcd from backup (if applicable)"
          command: |
            # Download latest snapshot from S3
            aws s3 cp s3://etcd-backups-prod/latest/etcd-snapshot.db /tmp/
            # Verify snapshot
            ETCDCTL_API=3 etcdctl snapshot status /tmp/etcd-snapshot.db
            # Restore (see etcd-restore.sh)
            bash /scripts/etcd-restore.sh /tmp/etcd-snapshot.db
          verification: "kubectl get nodes returns expected node list"

        - step: 2.2
          action: "Install ArgoCD"
          command: |
            kubectl create namespace argocd
            kubectl apply -n argocd -f \
              https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
            # Wait for ArgoCD to be ready
            kubectl wait --for=condition=available deployment/argocd-server \
              -n argocd --timeout=300s
            # Configure the GitOps repository
            argocd repo add https://github.com/example/k8s-manifests \
              --username git --password "${GIT_TOKEN}"
          verification: "ArgoCD UI accessible, repository connected"

        - step: 2.3
          action: "Deploy cert-manager"
          command: |
            helm repo add jetstack https://charts.jetstack.io
            helm install cert-manager jetstack/cert-manager \
              --namespace cert-manager --create-namespace \
              --set installCRDs=true
            # Apply ClusterIssuer
            kubectl apply -f manifests/cert-manager/cluster-issuer.yaml
          verification: "cert-manager pods running, ClusterIssuer ready"

        - step: 2.4
          action: "Deploy ingress controller"
          command: |
            helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
            helm install ingress-nginx ingress-nginx/ingress-nginx \
              --namespace ingress-nginx --create-namespace \
              --values manifests/ingress-nginx/values.yaml
          verification: "Ingress controller has external IP assigned"

    - phase: 3
      name: "Data recovery"
      estimated_time: "30-60 minutes"
      steps:
        - step: 3.1
          action: "Restore databases from backup"
          command: |
            # Deploy PostgreSQL operator
            kubectl apply -f manifests/database/operator.yaml
            # Wait for operator
            kubectl wait --for=condition=available deployment/postgres-operator \
              --timeout=300s
            # Restore from pgBackRest backup
            bash /scripts/pg-dr-restore.sh
          verification: |
            psql -c "SELECT count(*) FROM users;"
            # Compare with expected count from backup manifest

        - step: 3.2
          action: "Restore Velero and recover persistent volumes"
          command: |
            # Install Velero
            velero install --provider aws ...
            # Restore critical namespaces
            velero restore create dr-critical \
              --from-backup critical-services-hourly-latest
            # Verify restore
            velero restore describe dr-critical
          verification: "All PVCs bound, data verified"

    - phase: 4
      name: "Application recovery"
      estimated_time: "30-45 minutes"
      steps:
        - step: 4.1
          action: "Sync all ArgoCD applications"
          command: |
            # Apply the app-of-apps pattern
            kubectl apply -f manifests/argocd/app-of-apps.yaml
            # Force sync all applications
            argocd app sync --all --prune
            # Wait for all apps to be healthy
            argocd app wait --all --health --timeout 600
          verification: "All ArgoCD applications in Synced and Healthy state"

        - step: 4.2
          action: "Verify tier-1 services"
          command: |
            # Check payment-api
            curl -f https://payment-api.example.com/health
            # Check auth-service
            curl -f https://auth.example.com/health
            # Run integration tests against recovered services
            ./scripts/integration-tests.sh --target=production
          verification: "All health checks passing, integration tests green"

        - step: 4.3
          action: "Verify tier-2 and tier-3 services"
          command: |
            # Check all remaining services
            for svc in user-api notifications blog docs; do
              curl -f "https://${svc}.example.com/health" || echo "WARN: ${svc} not ready"
            done
          verification: "All services responding"

    - phase: 5
      name: "DNS and traffic cutover"
      estimated_time: "10-15 minutes"
      steps:
        - step: 5.1
          action: "Update DNS to point to recovered cluster"
          command: |
            # Update Route53 records
            aws route53 change-resource-record-sets \
              --hosted-zone-id Z1234567890 \
              --change-batch file://dns-changes.json

            # Verify DNS propagation
            for domain in app auth payment-api; do
              dig +short ${domain}.example.com
            done
          verification: "DNS resolving to new cluster IPs"

        - step: 5.2
          action: "Gradually increase traffic"
          command: |
            # If using weighted routing, gradually shift traffic
            # Start with 10%, then 50%, then 100%
            aws route53 change-resource-record-sets \
              --hosted-zone-id Z1234567890 \
              --change-batch '{"Changes":[{"Action":"UPSERT","ResourceRecordSet":{"Name":"app.example.com","Type":"A","SetIdentifier":"recovered","Weight":10,"TTL":60,"ResourceRecords":[{"Value":"NEW_IP"}]}}]}'
          verification: "Traffic flowing to recovered cluster, no errors"

    - phase: 6
      name: "Post-recovery validation"
      estimated_time: "30 minutes"
      steps:
        - step: 6.1
          action: "Run full smoke test suite"
          command: |
            ./scripts/smoke-tests.sh --environment=production
          verification: "All smoke tests passing"

        - step: 6.2
          action: "Verify monitoring and alerting"
          command: |
            # Check Prometheus is scraping
            curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets | length'
            # Verify Grafana dashboards
            curl -f http://grafana:3000/api/health
            # Check alert rules are loaded
            curl -s http://prometheus:9090/api/v1/rules | jq '.data.groups | length'
          verification: "Monitoring stack fully operational"

        - step: 6.3
          action: "Document recovery results"
          command: |
            # Create a post-recovery report
            echo "Recovery completed at: $(date)"
            echo "Total recovery time: X hours Y minutes"
            echo "Data loss window: etcd snapshot age + WAL gap"
            echo "Services recovered: all / partial"
            echo "Issues encountered: ..."
          verification: "Report shared with stakeholders"

The runbook is long, and it should be. Every step has a verification step because during a disaster, you cannot afford to skip ahead and hope things work. Every step must be confirmed before moving to the next.


Communication during disasters

Communication is often the weakest link during a disaster. People are stressed, multiple teams are involved, and customers are impacted. Having pre-written communication templates saves valuable time and ensures nothing important gets missed.


Here is a set of communication templates:


# communication/disaster-templates.yaml
templates:
  internal_declaration:
    channel: "#incident-war-room"
    template: |
      @here DISASTER DECLARED - DR Plan Activated

      What happened: [Brief description of the failure]
      Impact: [Which services are affected]
      Severity: [SEV-1]
      Incident Commander: [Name]
      DR Lead: [Name]
      Communications Lead: [Name]

      Current status: Executing DR plan phase 1 (infrastructure provisioning)
      Expected recovery time: [X hours based on RTO targets]

      War room: [Link to video call]
      Status page: https://status.example.com
      DR runbook: [Link to runbook]

      Updates will be posted every 15 minutes in this channel.

  customer_initial:
    channel: "status page"
    template: |
      Title: Service Disruption - [Affected Services]
      Status: Investigating

      We are currently experiencing a disruption affecting
      [list affected services]. Our team has been engaged and is
      actively working on recovery.

      We will provide an update within 30 minutes.

      Affected services:
      - [Service 1]: [Status]
      - [Service 2]: [Status]

  customer_update:
    channel: "status page"
    template: |
      Title: Service Disruption - Update
      Status: Identified / Recovering

      Update: We have identified the issue as [brief, non-technical
      description]. Our team is executing our disaster recovery plan.

      Current progress:
      - Infrastructure: [Restored / In progress]
      - Critical services: [Restored / In progress]
      - All services: [Restored / In progress]

      Estimated time to full recovery: [X hours]
      Next update: [Time]

  customer_resolved:
    channel: "status page"
    template: |
      Title: Service Disruption - Resolved
      Status: Resolved

      The service disruption that began at [start time] has been
      fully resolved as of [resolution time].

      Root cause: [Brief, non-technical description]
      Duration: [X hours Y minutes]
      Data impact: [None / Transactions between X and Y may need review]

      We will be publishing a detailed post-incident report within
      5 business days. We apologize for the disruption and are taking
      steps to prevent similar issues in the future.

  internal_update_cadence:
    description: "How often to post updates during DR"
    schedule:
      - phase: "First hour"
        frequency: "Every 15 minutes"
      - phase: "Hours 2-4"
        frequency: "Every 30 minutes"
      - phase: "After hour 4"
        frequency: "Every hour"
      - phase: "Post-recovery"
        frequency: "Final summary within 1 hour of resolution"

A few key points about disaster communication:


  • Do not wait until you have all the answers to communicate. “We are aware of the issue and investigating” is infinitely better than silence.
  • Use pre-written templates. During a disaster, your brain is not at its best. Templates prevent you from forgetting important details or saying the wrong thing.
  • Separate internal and external communication. Internal messages can be technical and detailed. External messages should be clear, non-technical, and empathetic.
  • Set a cadence and stick to it. Saying “next update in 30 minutes” and then going silent for 2 hours destroys trust. If you have nothing new to say, post “No significant change, still working on recovery.”
  • Assign a dedicated communications person. The people doing the recovery should not also be writing status page updates. Split those responsibilities.

Putting it all together: a DR maturity model

Just like we discussed chaos engineering maturity levels in the chaos engineering article, here is a maturity model for disaster recovery:


  1. Level 0 - Hope: No DR plan, no backups, no idea what would happen. (Surprisingly common)
  2. Level 1 - Documented: DR plan exists on paper but has never been tested. Backups exist but have never been restored.
  3. Level 2 - Tested components: Individual DR components (backup restore, DNS failover) have been tested. Tabletop exercises completed.
  4. Level 3 - Drilled: Full DR simulations have been run. The team has practiced the entire recovery process. RTO and RPO targets have been validated.
  5. Level 4 - Automated: DR failover is automated and can be triggered with a single command. Regular automated DR tests validate the plan continuously.

Most teams are at Level 1 or Level 2. Getting to Level 3 is where the real confidence comes from. You do not need full automation (Level 4) to be prepared, but you absolutely need to have practiced the process at least once.


Closing notes

Disaster recovery is not glamorous work. Nobody gets excited about writing backup scripts and communication templates. But when disaster strikes, and it will eventually, the difference between a team that has practiced recovery and a team that has not is the difference between a few hours of downtime and a catastrophic, company-threatening event.


The key takeaways from this article are:


  • Define RPO and RTO targets based on business impact, not technical convenience.
  • Back up everything and store backups in a different region than your primary infrastructure.
  • Test your backups regularly. A backup that has never been restored is not a backup.
  • Write detailed runbooks with verification steps for every action.
  • Practice, practice, practice. Run DR drills at least quarterly.
  • Prepare communication templates before you need them.

Start small. If you have no DR plan today, start by setting up Velero backups and etcd snapshots. Then write a basic runbook. Then test it. Then iterate. Each step makes you more prepared than you were before, and being slightly prepared is infinitely better than not being prepared at all.


Hope you found this useful and enjoyed reading it, until next time!


Errata

If you spot any error or have any suggestion, please send me a message so it gets fixed.

Also, you can check the source code and changes in the sources here



$ Comments

Online: 0

Please sign in to be able to write comments.

2026-04-03 | Gabriel Garrido