SRE: Disaster Recovery and Business Continuity
Support this blog
If you find this content useful, consider supporting the blog.
Introduction
Throughout this SRE series we have built a comprehensive toolkit for running reliable systems. We covered SLIs and SLOs, incident management, observability, chaos engineering, capacity planning, GitOps, secrets management, cost optimization, dependency management, database reliability, release engineering, and security as code. We have metrics, alerts, incident response, and chaos experiments in place. But there is one question we have not fully addressed yet: what happens when everything goes down at once?
“Hope is not a strategy” is a saying you hear often in SRE circles, and nowhere does it apply more than
in disaster recovery. A single availability zone going dark, a botched cluster upgrade, a ransomware
attack, or even an accidental kubectl delete namespace production can wipe out your entire workload. The
question is not if a disaster will happen, but when, and whether you will be ready for it.
In this article we will cover everything you need to build a solid disaster recovery (DR) and business continuity plan for Kubernetes environments. We will go from defining RPO and RTO targets all the way to Velero backups, etcd recovery, multi-region strategies, DR drills, communication plans, and step-by-step runbooks for full cluster recovery.
Let’s get into it.
RPO and RTO: defining your recovery targets
Before you can plan for disaster recovery, you need to answer two fundamental questions:
- RPO (Recovery Point Objective): How much data can you afford to lose? If your RPO is 1 hour, you need backups at least every hour. If your RPO is zero, you need synchronous replication.
- RTO (Recovery Time Objective): How long can your service be down? If your RTO is 15 minutes, you need automated failover. If your RTO is 4 hours, manual recovery might be acceptable.
These targets are not technical decisions, they are business decisions. You need to sit down with stakeholders and understand the actual cost of downtime and data loss for each service. A payment processing system has very different requirements than an internal wiki.
Here is a simple business impact analysis template to guide those conversations:
# dr-plan/business-impact-analysis.yaml
services:
- name: payment-api
tier: critical
rpo: "0 minutes" # Zero data loss
rto: "5 minutes" # Automated failover required
data_classification: pci
revenue_impact_per_hour: "$50,000"
dependencies:
- postgresql-primary
- redis-sessions
- stripe-api
backup_strategy: synchronous-replication
failover_strategy: active-active
- name: user-api
tier: high
rpo: "15 minutes"
rto: "30 minutes"
data_classification: pii
revenue_impact_per_hour: "$10,000"
dependencies:
- postgresql-primary
- redis-cache
backup_strategy: streaming-replication
failover_strategy: active-passive
- name: blog
tier: medium
rpo: "24 hours"
rto: "4 hours"
data_classification: public
revenue_impact_per_hour: "$0"
dependencies:
- postgresql-primary
backup_strategy: daily-snapshots
failover_strategy: rebuild-from-backup
- name: internal-tools
tier: low
rpo: "24 hours"
rto: "24 hours"
data_classification: internal
revenue_impact_per_hour: "$500"
dependencies:
- postgresql-primary
backup_strategy: daily-snapshots
failover_strategy: rebuild-from-backup
The key insight here is that not every service needs the same level of protection. Over-engineering DR for a low-tier service wastes money, while under-engineering it for a critical service creates real risk. Tier your services and plan accordingly.
DR plan template
Every organization needs a documented, tested, and regularly updated DR plan. Here is a structured template that covers the essentials:
# dr-plan/disaster-recovery-plan.yaml
metadata:
version: "2.1"
last_updated: "2026-03-15"
next_review: "2026-06-15"
owner: "platform-team"
approver: "vp-engineering"
scope:
environments:
- production
- staging
regions:
- primary: us-east-1
- secondary: eu-west-1
clusters:
- prod-primary (us-east-1)
- prod-secondary (eu-west-1)
roles_and_responsibilities:
incident_commander:
name: "Rotating on-call lead"
responsibilities:
- Declare disaster
- Coordinate recovery
- Authorize failover decisions
- Communicate with leadership
dr_lead:
name: "Senior SRE on-call"
responsibilities:
- Execute recovery runbooks
- Verify backup integrity
- Coordinate infrastructure recovery
- Run post-recovery validation
communications_lead:
name: "Engineering manager on-call"
responsibilities:
- Update status page
- Notify customers
- Coordinate with support team
- Send internal updates
database_lead:
name: "DBA on-call"
responsibilities:
- Verify database backups
- Execute database recovery
- Validate data integrity
- Monitor replication lag
activation_criteria:
- "Complete loss of primary region availability"
- "Primary Kubernetes cluster unrecoverable"
- "Data corruption affecting critical services"
- "Security breach requiring infrastructure rebuild"
- "Cloud provider outage exceeding 30 minutes"
communication_channels:
primary: "Slack #incident-war-room"
secondary: "PagerDuty conference bridge"
tertiary: "Personal phone numbers (see emergency contacts doc)"
status_page: "https://status.example.com"
recovery_priority:
- tier: 1
services: [payment-api, auth-service]
target_rto: "5 minutes"
action: "Automated DNS failover to secondary region"
- tier: 2
services: [user-api, notification-service]
target_rto: "30 minutes"
action: "Restore from replica in secondary region"
- tier: 3
services: [blog, docs, internal-tools]
target_rto: "4 hours"
action: "Rebuild from backups and GitOps repo"
Notice that the plan has a version, an owner, and a scheduled review date. A DR plan that was written two years ago and never updated is worse than no plan at all because it gives you false confidence. Review your DR plan quarterly and update it every time your infrastructure changes.
Velero for Kubernetes backup
Velero is the standard tool for backing up Kubernetes resources and persistent volumes. It can back up your entire cluster state (or specific namespaces) and restore it to the same or a different cluster.
Install Velero with the AWS plugin (works with S3-compatible storage including MinIO):
# Install Velero CLI
wget https://github.com/vmware-tanzu/velero/releases/download/v1.13.0/velero-v1.13.0-linux-arm64.tar.gz
tar -xvf velero-v1.13.0-linux-arm64.tar.gz
sudo mv velero-v1.13.0-linux-arm64/velero /usr/local/bin/
# Install Velero in the cluster
velero install \
--provider aws \
--plugins velero/velero-plugin-for-aws:v1.9.0 \
--bucket velero-backups \
--secret-file ./credentials-velero \
--backup-location-config region=us-east-1,s3ForcePathStyle=true,s3Url=https://s3.us-east-1.amazonaws.com \
--snapshot-location-config region=us-east-1 \
--use-node-agent \
--default-volumes-to-fs-backup
Now set up scheduled backups. The key is to have different backup schedules for different tiers of services:
# velero/backup-schedule-critical.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: critical-services-hourly
namespace: velero
spec:
schedule: "0 * * * *" # Every hour
template:
includedNamespaces:
- payment-system
- auth-system
includedResources:
- deployments
- services
- configmaps
- secrets
- persistentvolumeclaims
- persistentvolumes
- ingresses
- horizontalpodautoscalers
defaultVolumesToFsBackup: true
storageLocation: default
ttl: 168h # Keep for 7 days
metadata:
labels:
tier: critical
backup-type: scheduled
# velero/backup-schedule-standard.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: standard-services-daily
namespace: velero
spec:
schedule: "0 2 * * *" # Daily at 2 AM
template:
includedNamespaces:
- default
- blog
- monitoring
- ingress-nginx
excludedResources:
- events
- pods
defaultVolumesToFsBackup: true
storageLocation: default
ttl: 720h # Keep for 30 days
metadata:
labels:
tier: standard
backup-type: scheduled
# velero/backup-schedule-full.yaml
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: full-cluster-weekly
namespace: velero
spec:
schedule: "0 3 * * 0" # Every Sunday at 3 AM
template:
includedNamespaces:
- "*"
excludedNamespaces:
- velero
- kube-system
excludedResources:
- events
- pods
defaultVolumesToFsBackup: true
storageLocation: default
ttl: 2160h # Keep for 90 days
metadata:
labels:
backup-type: full-cluster
To restore from a Velero backup, first check what backups are available:
# List available backups
velero backup get
# Describe a specific backup to see what it contains
velero backup describe critical-services-hourly-20260328120000
# Restore to a new namespace (for testing)
velero restore create test-restore \
--from-backup critical-services-hourly-20260328120000 \
--namespace-mappings payment-system:payment-system-restored
# Restore to the original namespace (for actual DR)
velero restore create dr-restore \
--from-backup critical-services-hourly-20260328120000
# Check restore status
velero restore describe dr-restore
velero restore logs dr-restore
One critical thing people miss: you need to regularly test your backups by actually restoring them. A backup that has never been tested is not a backup, it is a hope. Set up a weekly job that restores your latest backup to a test namespace and validates the resources:
# velero/backup-validation-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: validate-velero-backups
namespace: velero
spec:
schedule: "0 6 * * 1" # Every Monday at 6 AM
jobTemplate:
spec:
template:
spec:
serviceAccountName: velero-validator
containers:
- name: validator
image: bitnami/kubectl:1.29
command:
- /bin/bash
- -c
- |
set -euo pipefail
echo "=== Velero Backup Validation ==="
LATEST_BACKUP=$(velero backup get -o json | \
jq -r '.items | sort_by(.metadata.creationTimestamp) | last | .metadata.name')
echo "Latest backup: ${LATEST_BACKUP}"
# Create a test restore
velero restore create validation-${LATEST_BACKUP} \
--from-backup ${LATEST_BACKUP} \
--namespace-mappings default:validation-test
# Wait for restore to complete
sleep 120
# Check restore status
RESTORE_STATUS=$(velero restore get validation-${LATEST_BACKUP} -o json | \
jq -r '.status.phase')
if [ "$RESTORE_STATUS" = "Completed" ]; then
echo "PASS: Restore completed successfully"
else
echo "FAIL: Restore status is ${RESTORE_STATUS}"
# Send alert to PagerDuty or Slack
curl -X POST "$SLACK_WEBHOOK" \
-H 'Content-Type: application/json' \
-d "{\"text\": \"Velero backup validation FAILED for ${LATEST_BACKUP}\"}"
fi
# Clean up the test namespace
kubectl delete namespace validation-test --ignore-not-found=true
env:
- name: SLACK_WEBHOOK
valueFrom:
secretKeyRef:
name: slack-webhook
key: url
restartPolicy: OnFailure
etcd backup and restore
etcd is the brain of your Kubernetes cluster. It stores all cluster state, including deployments, services, secrets, configmaps, and RBAC policies. If you lose etcd and you do not have a backup, you lose your entire cluster. Everything else can be rebuilt from GitOps, but etcd is the one piece that holds the live state.
Here is a script for automated etcd snapshots:
#!/bin/bash
# etcd-backup.sh - Automated etcd snapshot backup
# Run this as a CronJob on one of the control plane nodes
set -euo pipefail
BACKUP_DIR="/var/backups/etcd"
S3_BUCKET="s3://etcd-backups-prod"
RETENTION_DAYS=30
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
SNAPSHOT_FILE="${BACKUP_DIR}/etcd-snapshot-${TIMESTAMP}.db"
# Create backup directory
mkdir -p "${BACKUP_DIR}"
echo "[$(date)] Starting etcd backup..."
# Take the snapshot
ETCDCTL_API=3 etcdctl snapshot save "${SNAPSHOT_FILE}" \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Verify the snapshot
ETCDCTL_API=3 etcdctl snapshot status "${SNAPSHOT_FILE}" \
--write-out=table
# Get the snapshot size for logging
SNAPSHOT_SIZE=$(du -h "${SNAPSHOT_FILE}" | cut -f1)
echo "[$(date)] Snapshot created: ${SNAPSHOT_FILE} (${SNAPSHOT_SIZE})"
# Upload to S3
aws s3 cp "${SNAPSHOT_FILE}" \
"${S3_BUCKET}/etcd-snapshot-${TIMESTAMP}.db" \
--storage-class STANDARD_IA
echo "[$(date)] Snapshot uploaded to ${S3_BUCKET}"
# Calculate checksum and upload it alongside the snapshot
sha256sum "${SNAPSHOT_FILE}" > "${SNAPSHOT_FILE}.sha256"
aws s3 cp "${SNAPSHOT_FILE}.sha256" \
"${S3_BUCKET}/etcd-snapshot-${TIMESTAMP}.db.sha256"
# Clean up old local backups
find "${BACKUP_DIR}" -name "etcd-snapshot-*.db" -mtime +${RETENTION_DAYS} -delete
find "${BACKUP_DIR}" -name "etcd-snapshot-*.sha256" -mtime +${RETENTION_DAYS} -delete
# Clean up old S3 backups using lifecycle policies
echo "[$(date)] etcd backup completed successfully"
Schedule this as a CronJob on your control plane:
# etcd/backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: etcd-backup
namespace: kube-system
spec:
schedule: "0 */6 * * *" # Every 6 hours
concurrencyPolicy: Forbid
successfulJobsHistoryLimit: 3
failedJobsHistoryLimit: 3
jobTemplate:
spec:
template:
spec:
nodeName: control-plane-1 # Pin to a control plane node
hostNetwork: true
containers:
- name: etcd-backup
image: registry.k8s.io/etcd:3.5.12-0
command:
- /bin/sh
- /scripts/etcd-backup.sh
volumeMounts:
- name: etcd-certs
mountPath: /etc/kubernetes/pki/etcd
readOnly: true
- name: backup-scripts
mountPath: /scripts
- name: backup-storage
mountPath: /var/backups/etcd
volumes:
- name: etcd-certs
hostPath:
path: /etc/kubernetes/pki/etcd
- name: backup-scripts
configMap:
name: etcd-backup-scripts
defaultMode: 0755
- name: backup-storage
hostPath:
path: /var/backups/etcd
restartPolicy: OnFailure
tolerations:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
Now the critical part, restoring etcd. This is the procedure you follow when your cluster is completely gone:
#!/bin/bash
# etcd-restore.sh - Restore etcd from a snapshot
# WARNING: This replaces ALL cluster state. Only use during disaster recovery.
set -euo pipefail
SNAPSHOT_FILE="$1"
DATA_DIR="/var/lib/etcd-restored"
if [ -z "${SNAPSHOT_FILE}" ]; then
echo "Usage: $0 <snapshot-file>"
exit 1
fi
echo "WARNING: This will replace ALL etcd data!"
echo "Snapshot: ${SNAPSHOT_FILE}"
echo "Press Ctrl+C to abort, or wait 10 seconds to continue..."
sleep 10
# Stop the kubelet (which manages etcd as a static pod)
systemctl stop kubelet
# Stop etcd if running
crictl ps | grep etcd && crictl stop $(crictl ps -q --name etcd)
# Verify the snapshot integrity
echo "Verifying snapshot integrity..."
ETCDCTL_API=3 etcdctl snapshot status "${SNAPSHOT_FILE}" \
--write-out=table
# Restore the snapshot to a new data directory
ETCDCTL_API=3 etcdctl snapshot restore "${SNAPSHOT_FILE}" \
--data-dir="${DATA_DIR}" \
--name=control-plane-1 \
--initial-cluster=control-plane-1=https://10.0.1.10:2380 \
--initial-cluster-token=etcd-cluster-1 \
--initial-advertise-peer-urls=https://10.0.1.10:2380
# Back up the old data directory
TIMESTAMP=$(date +%Y%m%d-%H%M%S)
if [ -d /var/lib/etcd ]; then
mv /var/lib/etcd "/var/lib/etcd-old-${TIMESTAMP}"
fi
# Move the restored data into place
mv "${DATA_DIR}" /var/lib/etcd
# Fix ownership
chown -R etcd:etcd /var/lib/etcd 2>/dev/null || true
# Start kubelet (which will start etcd as a static pod)
systemctl start kubelet
echo "Waiting for etcd to become healthy..."
for i in $(seq 1 60); do
if ETCDCTL_API=3 etcdctl endpoint health \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key 2>/dev/null; then
echo "etcd is healthy!"
break
fi
echo "Waiting... ($i/60)"
sleep 5
done
echo "etcd restore completed. Verify cluster state with: kubectl get nodes"
One important note about etcd restores: when you restore from a snapshot, you get the cluster state at the time the snapshot was taken. Any resources created after the snapshot will be gone. This is why your RPO for cluster state is determined by your etcd snapshot frequency. If you snapshot every 6 hours, your worst-case data loss for cluster state is 6 hours of changes. However, if you are using GitOps (and you should be), you can re-apply all your manifests from the Git repository to bring the cluster back to current state.
Multi-region and multi-cluster strategies
For services that need very low RTO, you need your workloads running in multiple regions or clusters simultaneously. There are two main approaches:
Active-Active: Both regions serve traffic simultaneously. If one goes down, the other absorbs all traffic. This gives you the lowest possible RTO (just the time for DNS or load balancer health checks to detect the failure) but it is also the most complex to set up and operate.
Active-Passive: One region serves all traffic, the other is on standby. When the active region fails, you failover to the passive region. This is simpler but has a longer RTO because you need to detect the failure, make the failover decision, and potentially warm up the passive region.
Here is a DNS-based failover configuration using external-dns and health checks:
# multi-region/dns-failover.yaml
# Primary region health check
apiVersion: externaldns.k8s.io/v1alpha1
kind: DNSEndpoint
metadata:
name: app-primary
namespace: default
annotations:
external-dns.alpha.kubernetes.io/aws-region: us-east-1
spec:
endpoints:
- dnsName: app.example.com
recordTTL: 60
recordType: A
targets:
- 10.0.1.100 # Primary region LB
setIdentifier: primary
providerSpecific:
- name: aws/failover
value: PRIMARY
- name: aws/health-check-id
value: "hc-primary-12345"
---
# Secondary region (failover target)
apiVersion: externaldns.k8s.io/v1alpha1
kind: DNSEndpoint
metadata:
name: app-secondary
namespace: default
annotations:
external-dns.alpha.kubernetes.io/aws-region: eu-west-1
spec:
endpoints:
- dnsName: app.example.com
recordTTL: 60
recordType: A
targets:
- 10.1.1.100 # Secondary region LB
setIdentifier: secondary
providerSpecific:
- name: aws/failover
value: SECONDARY
- name: aws/health-check-id
value: "hc-secondary-67890"
For multi-cluster management, here is a configuration sync setup using ArgoCD ApplicationSets:
# multi-region/argocd-multi-cluster.yaml
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
name: critical-services
namespace: argocd
spec:
generators:
- clusters:
selector:
matchLabels:
environment: production
values:
region: "{{metadata.labels.region}}"
template:
metadata:
name: "critical-services-{{name}}"
spec:
project: production
source:
repoURL: https://github.com/example/k8s-manifests
targetRevision: main
path: "clusters/{{values.region}}/critical-services"
destination:
server: "{{server}}"
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
With this setup, ArgoCD automatically deploys your critical services to every production cluster. When you add a new cluster, the services get deployed automatically. This is where GitOps really shines for DR: your entire desired state is in Git, and ArgoCD ensures every cluster matches it.
Database DR: cross-region PostgreSQL replication
Databases are usually the hardest part of disaster recovery because they hold state. For PostgreSQL, here is a setup using streaming replication with pgBackRest for cross-region backups:
# database/postgresql-dr.yaml
# Primary PostgreSQL configuration for DR
apiVersion: v1
kind: ConfigMap
metadata:
name: postgresql-dr-config
namespace: database
data:
postgresql.conf: |
# Replication settings for DR
wal_level = replica
max_wal_senders = 10
wal_keep_size = 1024 # Keep 1GB of WAL for replication lag tolerance
synchronous_commit = on
synchronous_standby_names = 'standby_eu_west'
# Archive settings for point-in-time recovery
archive_mode = on
archive_command = 'pgbackrest --stanza=main archive-push %p'
archive_timeout = 60 # Archive at least every 60 seconds
pg_hba.conf: |
# Replication access from secondary region
hostssl replication replicator 10.1.0.0/16 scram-sha-256
hostssl replication replicator 10.0.0.0/16 scram-sha-256
hostssl all all 10.0.0.0/8 scram-sha-256
pgbackrest.conf: |
[global]
repo1-type=s3
repo1-s3-bucket=pg-backups-primary
repo1-s3-region=us-east-1
repo1-s3-endpoint=s3.us-east-1.amazonaws.com
repo1-retention-full=4
repo1-retention-diff=14
# Cross-region backup for DR
repo2-type=s3
repo2-s3-bucket=pg-backups-dr
repo2-s3-region=eu-west-1
repo2-s3-endpoint=s3.eu-west-1.amazonaws.com
repo2-retention-full=4
repo2-retention-diff=14
[main]
pg1-path=/var/lib/postgresql/data
And the backup schedule:
# database/pgbackrest-backup-cronjob.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: pgbackrest-full-backup
namespace: database
spec:
schedule: "0 1 * * 0" # Full backup every Sunday at 1 AM
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: pgbackrest/pgbackrest:2.50
command:
- /bin/bash
- -c
- |
echo "Starting full backup to both repos..."
# Backup to primary region
pgbackrest --stanza=main --type=full --repo=1 backup
echo "Primary region backup complete"
# Backup to DR region
pgbackrest --stanza=main --type=full --repo=2 backup
echo "DR region backup complete"
# Verify both backups
pgbackrest --stanza=main --repo=1 info
pgbackrest --stanza=main --repo=2 info
restartPolicy: OnFailure
---
apiVersion: batch/v1
kind: CronJob
metadata:
name: pgbackrest-diff-backup
namespace: database
spec:
schedule: "0 */4 * * *" # Differential backup every 4 hours
jobTemplate:
spec:
template:
spec:
containers:
- name: backup
image: pgbackrest/pgbackrest:2.50
command:
- /bin/bash
- -c
- |
pgbackrest --stanza=main --type=diff --repo=1 backup
pgbackrest --stanza=main --type=diff --repo=2 backup
echo "Differential backups completed"
restartPolicy: OnFailure
The restore procedure for PostgreSQL when your primary is gone:
#!/bin/bash
# database/pg-dr-restore.sh
# Restore PostgreSQL from pgBackRest backup in DR region
set -euo pipefail
DR_REPO=2 # Use the DR region repository
TARGET_TIME="${1:-}" # Optional: point-in-time recovery target
echo "=== PostgreSQL DR Restore ==="
echo "Using repository: repo${DR_REPO} (DR region)"
# List available backups
echo "Available backups:"
pgbackrest --stanza=main --repo=${DR_REPO} info
# Stop PostgreSQL if running
pg_ctl stop -D /var/lib/postgresql/data -m fast 2>/dev/null || true
# Clear the data directory
rm -rf /var/lib/postgresql/data/*
if [ -n "${TARGET_TIME}" ]; then
echo "Restoring to point-in-time: ${TARGET_TIME}"
pgbackrest --stanza=main --repo=${DR_REPO} \
--type=time \
--target="${TARGET_TIME}" \
--target-action=promote \
restore
else
echo "Restoring latest backup..."
pgbackrest --stanza=main --repo=${DR_REPO} \
--type=default \
restore
fi
# Start PostgreSQL
pg_ctl start -D /var/lib/postgresql/data
# Wait for recovery to complete
echo "Waiting for recovery..."
until pg_isready; do
sleep 2
done
echo "PostgreSQL restored and ready"
# Verify data integrity
psql -c "SELECT count(*) as total_tables FROM information_schema.tables WHERE table_schema = 'public';"
psql -c "SELECT pg_size_pretty(pg_database_size(current_database())) as db_size;"
DR testing and drills
The best DR plan in the world is worthless if you have never tested it. DR drills are how you turn a theoretical plan into a proven capability. There are three levels of DR testing:
- Tabletop exercises: The team walks through the DR plan on paper. No actual systems are affected. This is good for finding gaps in documentation and communication plans.
- Component drills: You test individual components of the plan, like restoring a Velero backup or failing over DNS. This validates that the tools and procedures work.
- Full DR simulation: You simulate a complete disaster and execute the full recovery plan. This is the gold standard, and it is scary, which is exactly why you need to do it.
Here is a tabletop exercise template:
# dr-drills/tabletop-exercise.yaml
exercise:
name: "Q1 2026 DR Tabletop Exercise"
date: "2026-03-20"
duration: "2 hours"
facilitator: "Senior SRE"
participants:
- platform-team
- database-team
- application-team
- engineering-management
scenario:
description: |
At 2:30 AM on a Tuesday, the primary cloud region (us-east-1)
experiences a complete outage. All services in the region are
unreachable. The cloud provider estimates 4-6 hours for recovery.
Your payment-api is processing $5,000 per hour in transactions.
timeline:
- time: "T+0"
event: "PagerDuty fires alerts for all services in us-east-1"
question: "Who gets paged? What is the escalation path?"
- time: "T+5min"
event: "On-call engineer confirms the region is down"
question: "What is the first action? Who makes the failover decision?"
- time: "T+10min"
event: "Incident commander declares disaster, initiates DR plan"
question: "What communication goes out? To whom? Through which channels?"
- time: "T+15min"
event: "DR lead begins failover procedure"
question: "What are the exact steps? Walk through the runbook."
- time: "T+30min"
event: "DNS failover complete for tier-1 services"
question: "How do you verify services are healthy in the DR region?"
- time: "T+1hr"
event: "Tier-2 services restored from replicas"
question: "What data was lost? How do you reconcile?"
- time: "T+4hr"
event: "Primary region comes back online"
question: "Do you fail back immediately? What is the failback procedure?"
discussion_questions:
- "Where are the gaps in our current DR plan?"
- "Do we have all the access and credentials needed for DR?"
- "What would happen if the person who knows how to do X is unavailable?"
- "Are our backups actually restorable? When did we last test?"
- "What is our communication plan for customers?"
For live DR drills, here is a structured approach:
# dr-drills/live-drill-plan.yaml
drill:
name: "Q1 2026 Live DR Drill"
date: "2026-03-25"
time: "10:00 AM - 2:00 PM"
type: "component" # Options: tabletop, component, full
environment: "staging" # Always start with staging
pre_drill_checklist:
- "All participants confirmed and available"
- "Stakeholders notified about potential staging impact"
- "Monitoring dashboards open for staging environment"
- "Rollback procedures reviewed and ready"
- "DR region/cluster verified accessible"
- "Latest backups verified available"
- "Communication channels tested"
scenarios:
- name: "Velero backup restore"
objective: "Verify we can restore a namespace from Velero backup"
steps:
- "Delete the test-app namespace in staging"
- "Restore from latest Velero backup"
- "Verify all resources are recreated"
- "Verify the application is functional"
success_criteria:
- "All deployments running with correct replica count"
- "All services and ingresses recreated"
- "Application responds to health checks"
- "Persistent data is present and correct"
max_duration: "30 minutes"
- name: "etcd snapshot restore"
objective: "Verify we can restore etcd from a snapshot"
steps:
- "Take a fresh etcd snapshot"
- "Create some test resources (deployment, service, configmap)"
- "Restore from the snapshot (before the test resources)"
- "Verify test resources are gone (proving the restore worked)"
- "Verify pre-existing resources are intact"
success_criteria:
- "etcd restore completes without errors"
- "Cluster is functional after restore"
- "Test resources are absent (proving point-in-time restore)"
max_duration: "45 minutes"
- name: "Database failover"
objective: "Verify PostgreSQL failover to read replica"
steps:
- "Verify replication lag is zero"
- "Simulate primary failure (stop primary pod)"
- "Promote read replica to primary"
- "Update application connection strings"
- "Verify application writes succeed on new primary"
success_criteria:
- "Failover completes within RTO target"
- "No data loss (RPO target met)"
- "Application functions normally on new primary"
max_duration: "30 minutes"
post_drill:
- "Restore staging to normal state"
- "Document all findings"
- "Create issues for any failures or gaps found"
- "Update DR plan based on findings"
- "Share results with the broader team"
- "Schedule next drill"
You should also tie DR drills into your chaos engineering practice. A chaos experiment that simulates a zone failure is essentially a lightweight DR drill. If you are already running chaos experiments regularly (as we discussed in the chaos engineering article), you are building the muscle memory your team needs for real disasters.
Runbook for full cluster recovery
This is the big one: your cluster is gone and you need to rebuild from scratch. Here is a step-by-step runbook that covers the full recovery process:
# runbooks/full-cluster-recovery.yaml
runbook:
name: "Full Kubernetes Cluster Recovery"
version: "1.3"
last_tested: "2026-03-15"
estimated_time: "2-4 hours"
prerequisites:
- "Access to cloud provider console/CLI"
- "Access to etcd backup storage (S3)"
- "Access to Velero backup storage (S3)"
- "Access to GitOps repository"
- "Access to container registry"
- "DNS management access"
- "TLS certificates or cert-manager configuration"
phases:
- phase: 1
name: "Infrastructure provisioning"
estimated_time: "30-60 minutes"
steps:
- step: 1.1
action: "Provision new compute nodes"
command: |
# Using Terraform (assuming state is in remote backend)
cd infrastructure/terraform/kubernetes
terraform plan -var="cluster_name=prod-recovery"
terraform apply -auto-approve
verification: |
# Verify nodes are provisioned
kubectl get nodes
# Expected: all nodes in Ready state
- step: 1.2
action: "Verify networking"
command: |
# Check CNI is functional
kubectl run nettest --image=busybox --rm -it -- nslookup kubernetes.default
# Check external connectivity
kubectl run nettest --image=busybox --rm -it -- wget -qO- https://hub.docker.com
verification: "DNS resolution and external connectivity working"
- step: 1.3
action: "Verify storage provisioner"
command: |
kubectl get storageclass
# Create a test PVC
kubectl apply -f - <<EOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: test-pvc
spec:
accessModes: [ReadWriteOnce]
resources:
requests:
storage: 1Gi
EOF
kubectl get pvc test-pvc
verification: "PVC transitions to Bound state"
- phase: 2
name: "Core infrastructure recovery"
estimated_time: "20-30 minutes"
steps:
- step: 2.1
action: "Restore etcd from backup (if applicable)"
command: |
# Download latest snapshot from S3
aws s3 cp s3://etcd-backups-prod/latest/etcd-snapshot.db /tmp/
# Verify snapshot
ETCDCTL_API=3 etcdctl snapshot status /tmp/etcd-snapshot.db
# Restore (see etcd-restore.sh)
bash /scripts/etcd-restore.sh /tmp/etcd-snapshot.db
verification: "kubectl get nodes returns expected node list"
- step: 2.2
action: "Install ArgoCD"
command: |
kubectl create namespace argocd
kubectl apply -n argocd -f \
https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
# Wait for ArgoCD to be ready
kubectl wait --for=condition=available deployment/argocd-server \
-n argocd --timeout=300s
# Configure the GitOps repository
argocd repo add https://github.com/example/k8s-manifests \
--username git --password "${GIT_TOKEN}"
verification: "ArgoCD UI accessible, repository connected"
- step: 2.3
action: "Deploy cert-manager"
command: |
helm repo add jetstack https://charts.jetstack.io
helm install cert-manager jetstack/cert-manager \
--namespace cert-manager --create-namespace \
--set installCRDs=true
# Apply ClusterIssuer
kubectl apply -f manifests/cert-manager/cluster-issuer.yaml
verification: "cert-manager pods running, ClusterIssuer ready"
- step: 2.4
action: "Deploy ingress controller"
command: |
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm install ingress-nginx ingress-nginx/ingress-nginx \
--namespace ingress-nginx --create-namespace \
--values manifests/ingress-nginx/values.yaml
verification: "Ingress controller has external IP assigned"
- phase: 3
name: "Data recovery"
estimated_time: "30-60 minutes"
steps:
- step: 3.1
action: "Restore databases from backup"
command: |
# Deploy PostgreSQL operator
kubectl apply -f manifests/database/operator.yaml
# Wait for operator
kubectl wait --for=condition=available deployment/postgres-operator \
--timeout=300s
# Restore from pgBackRest backup
bash /scripts/pg-dr-restore.sh
verification: |
psql -c "SELECT count(*) FROM users;"
# Compare with expected count from backup manifest
- step: 3.2
action: "Restore Velero and recover persistent volumes"
command: |
# Install Velero
velero install --provider aws ...
# Restore critical namespaces
velero restore create dr-critical \
--from-backup critical-services-hourly-latest
# Verify restore
velero restore describe dr-critical
verification: "All PVCs bound, data verified"
- phase: 4
name: "Application recovery"
estimated_time: "30-45 minutes"
steps:
- step: 4.1
action: "Sync all ArgoCD applications"
command: |
# Apply the app-of-apps pattern
kubectl apply -f manifests/argocd/app-of-apps.yaml
# Force sync all applications
argocd app sync --all --prune
# Wait for all apps to be healthy
argocd app wait --all --health --timeout 600
verification: "All ArgoCD applications in Synced and Healthy state"
- step: 4.2
action: "Verify tier-1 services"
command: |
# Check payment-api
curl -f https://payment-api.example.com/health
# Check auth-service
curl -f https://auth.example.com/health
# Run integration tests against recovered services
./scripts/integration-tests.sh --target=production
verification: "All health checks passing, integration tests green"
- step: 4.3
action: "Verify tier-2 and tier-3 services"
command: |
# Check all remaining services
for svc in user-api notifications blog docs; do
curl -f "https://${svc}.example.com/health" || echo "WARN: ${svc} not ready"
done
verification: "All services responding"
- phase: 5
name: "DNS and traffic cutover"
estimated_time: "10-15 minutes"
steps:
- step: 5.1
action: "Update DNS to point to recovered cluster"
command: |
# Update Route53 records
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890 \
--change-batch file://dns-changes.json
# Verify DNS propagation
for domain in app auth payment-api; do
dig +short ${domain}.example.com
done
verification: "DNS resolving to new cluster IPs"
- step: 5.2
action: "Gradually increase traffic"
command: |
# If using weighted routing, gradually shift traffic
# Start with 10%, then 50%, then 100%
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890 \
--change-batch '{"Changes":[{"Action":"UPSERT","ResourceRecordSet":{"Name":"app.example.com","Type":"A","SetIdentifier":"recovered","Weight":10,"TTL":60,"ResourceRecords":[{"Value":"NEW_IP"}]}}]}'
verification: "Traffic flowing to recovered cluster, no errors"
- phase: 6
name: "Post-recovery validation"
estimated_time: "30 minutes"
steps:
- step: 6.1
action: "Run full smoke test suite"
command: |
./scripts/smoke-tests.sh --environment=production
verification: "All smoke tests passing"
- step: 6.2
action: "Verify monitoring and alerting"
command: |
# Check Prometheus is scraping
curl -s http://prometheus:9090/api/v1/targets | jq '.data.activeTargets | length'
# Verify Grafana dashboards
curl -f http://grafana:3000/api/health
# Check alert rules are loaded
curl -s http://prometheus:9090/api/v1/rules | jq '.data.groups | length'
verification: "Monitoring stack fully operational"
- step: 6.3
action: "Document recovery results"
command: |
# Create a post-recovery report
echo "Recovery completed at: $(date)"
echo "Total recovery time: X hours Y minutes"
echo "Data loss window: etcd snapshot age + WAL gap"
echo "Services recovered: all / partial"
echo "Issues encountered: ..."
verification: "Report shared with stakeholders"
The runbook is long, and it should be. Every step has a verification step because during a disaster, you cannot afford to skip ahead and hope things work. Every step must be confirmed before moving to the next.
Communication during disasters
Communication is often the weakest link during a disaster. People are stressed, multiple teams are involved, and customers are impacted. Having pre-written communication templates saves valuable time and ensures nothing important gets missed.
Here is a set of communication templates:
# communication/disaster-templates.yaml
templates:
internal_declaration:
channel: "#incident-war-room"
template: |
@here DISASTER DECLARED - DR Plan Activated
What happened: [Brief description of the failure]
Impact: [Which services are affected]
Severity: [SEV-1]
Incident Commander: [Name]
DR Lead: [Name]
Communications Lead: [Name]
Current status: Executing DR plan phase 1 (infrastructure provisioning)
Expected recovery time: [X hours based on RTO targets]
War room: [Link to video call]
Status page: https://status.example.com
DR runbook: [Link to runbook]
Updates will be posted every 15 minutes in this channel.
customer_initial:
channel: "status page"
template: |
Title: Service Disruption - [Affected Services]
Status: Investigating
We are currently experiencing a disruption affecting
[list affected services]. Our team has been engaged and is
actively working on recovery.
We will provide an update within 30 minutes.
Affected services:
- [Service 1]: [Status]
- [Service 2]: [Status]
customer_update:
channel: "status page"
template: |
Title: Service Disruption - Update
Status: Identified / Recovering
Update: We have identified the issue as [brief, non-technical
description]. Our team is executing our disaster recovery plan.
Current progress:
- Infrastructure: [Restored / In progress]
- Critical services: [Restored / In progress]
- All services: [Restored / In progress]
Estimated time to full recovery: [X hours]
Next update: [Time]
customer_resolved:
channel: "status page"
template: |
Title: Service Disruption - Resolved
Status: Resolved
The service disruption that began at [start time] has been
fully resolved as of [resolution time].
Root cause: [Brief, non-technical description]
Duration: [X hours Y minutes]
Data impact: [None / Transactions between X and Y may need review]
We will be publishing a detailed post-incident report within
5 business days. We apologize for the disruption and are taking
steps to prevent similar issues in the future.
internal_update_cadence:
description: "How often to post updates during DR"
schedule:
- phase: "First hour"
frequency: "Every 15 minutes"
- phase: "Hours 2-4"
frequency: "Every 30 minutes"
- phase: "After hour 4"
frequency: "Every hour"
- phase: "Post-recovery"
frequency: "Final summary within 1 hour of resolution"
A few key points about disaster communication:
- Do not wait until you have all the answers to communicate. “We are aware of the issue and investigating” is infinitely better than silence.
- Use pre-written templates. During a disaster, your brain is not at its best. Templates prevent you from forgetting important details or saying the wrong thing.
- Separate internal and external communication. Internal messages can be technical and detailed. External messages should be clear, non-technical, and empathetic.
- Set a cadence and stick to it. Saying “next update in 30 minutes” and then going silent for 2 hours destroys trust. If you have nothing new to say, post “No significant change, still working on recovery.”
- Assign a dedicated communications person. The people doing the recovery should not also be writing status page updates. Split those responsibilities.
Putting it all together: a DR maturity model
Just like we discussed chaos engineering maturity levels in the chaos engineering article, here is a maturity model for disaster recovery:
- Level 0 - Hope: No DR plan, no backups, no idea what would happen. (Surprisingly common)
- Level 1 - Documented: DR plan exists on paper but has never been tested. Backups exist but have never been restored.
- Level 2 - Tested components: Individual DR components (backup restore, DNS failover) have been tested. Tabletop exercises completed.
- Level 3 - Drilled: Full DR simulations have been run. The team has practiced the entire recovery process. RTO and RPO targets have been validated.
- Level 4 - Automated: DR failover is automated and can be triggered with a single command. Regular automated DR tests validate the plan continuously.
Most teams are at Level 1 or Level 2. Getting to Level 3 is where the real confidence comes from. You do not need full automation (Level 4) to be prepared, but you absolutely need to have practiced the process at least once.
Closing notes
Disaster recovery is not glamorous work. Nobody gets excited about writing backup scripts and communication templates. But when disaster strikes, and it will eventually, the difference between a team that has practiced recovery and a team that has not is the difference between a few hours of downtime and a catastrophic, company-threatening event.
The key takeaways from this article are:
- Define RPO and RTO targets based on business impact, not technical convenience.
- Back up everything and store backups in a different region than your primary infrastructure.
- Test your backups regularly. A backup that has never been restored is not a backup.
- Write detailed runbooks with verification steps for every action.
- Practice, practice, practice. Run DR drills at least quarterly.
- Prepare communication templates before you need them.
Start small. If you have no DR plan today, start by setting up Velero backups and etcd snapshots. Then write a basic runbook. Then test it. Then iterate. Each step makes you more prepared than you were before, and being slightly prepared is infinitely better than not being prepared at all.
Hope you found this useful and enjoyed reading it, until next time!
Errata
If you spot any error or have any suggestion, please send me a message so it gets fixed.
Also, you can check the source code and changes in the sources here
$ Comments
Online: 0Please sign in to be able to write comments.