SRE: Chaos Engineering, Breaking Things on Purpose

2026-03-02 | Gabriel Garrido | 15 min read
Share:

Support this blog

If you find this content useful, consider supporting the blog.

Introduction

In the previous articles we covered SLIs and SLOs, incident management, and observability. You have metrics, alerts, traces, runbooks, and postmortem processes. But how do you know any of it actually works before a real incident hits?


That is where chaos engineering comes in. The idea is simple: intentionally inject failures into your system to verify that your resilience mechanisms, monitoring, alerting, and incident response processes work as expected. It is like a fire drill, but for your infrastructure.


In this article we will cover the principles of chaos engineering, how to set up Litmus and Chaos Mesh in Kubernetes, how to plan and run game days, and how to build a culture where breaking things on purpose is not just accepted but encouraged.


Let’s get into it.


Why break things on purpose?

Complex systems fail in complex ways. You cannot predict every failure mode by reading code or architecture diagrams. The only way to truly understand how your system behaves under failure is to actually make it fail.


Chaos engineering helps you:


  • Discover unknown failure modes before they bite you in production at 3am
  • Validate your monitoring and alerting does your SLO alert actually fire when latency spikes?
  • Test your runbooks can the on-call engineer actually follow them under pressure?
  • Build confidence knowing your system can handle a pod crash or network partition makes you sleep better
  • Reduce MTTR practicing incident response makes you faster when real incidents happen

The Netflix engineering team, who pioneered chaos engineering with Chaos Monkey, put it best: “The best way to avoid failure is to fail constantly.”


The chaos engineering process

Chaos engineering is not just randomly killing pods. It is a disciplined process:


  1. Define steady state: What does “normal” look like? Use your SLIs (from article 1) as the baseline.
  2. Hypothesize: “If we kill one pod, the remaining pods should handle the load and the SLO should not be violated.”
  3. Inject failure: Actually kill the pod (or whatever failure you are testing).
  4. Observe: Watch your metrics, traces, and logs. Did the system behave as expected?
  5. Learn: If it did not behave as expected, you found a weakness. Fix it before a real failure finds it for you.

Always start small. Kill one pod, not the whole deployment. Add 100ms of latency, not 30 seconds. The goal is controlled experiments, not uncontrolled chaos.


Chaos Mesh: chaos engineering for Kubernetes

Chaos Mesh is a CNCF project that provides a comprehensive set of chaos experiments for Kubernetes. It is easy to install and has a nice web UI for managing experiments.


Install it with Helm:


helm repo add chaos-mesh https://charts.chaos-mesh.org
helm repo update

helm install chaos-mesh chaos-mesh/chaos-mesh \
  --namespace chaos-mesh \
  --create-namespace \
  --set chaosDaemon.runtime=containerd \
  --set chaosDaemon.socketPath=/run/containerd/containerd.sock

Now let’s define some experiments. All experiments are Kubernetes custom resources, so they fit perfectly into a GitOps workflow with ArgoCD.


1. Pod failure: kill a random pod


# chaos/pod-kill.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: PodChaos
metadata:
  name: tr-web-pod-kill
  namespace: default
spec:
  action: pod-kill
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      app: tr-web
  scheduler:
    cron: "@every 2h"  # Kill a pod every 2 hours
  duration: "60s"

This kills one random tr-web pod every 2 hours. If your deployment has multiple replicas and a proper readiness probe, users should not notice anything. If they do, you found a problem.


2. Network latency: add artificial delay


# chaos/network-delay.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: tr-web-network-delay
  namespace: default
spec:
  action: delay
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      app: tr-web
  delay:
    latency: "200ms"
    jitter: "50ms"
    correlation: "25"
  direction: to
  target:
    selector:
      namespaces:
        - default
      labelSelectors:
        app: postgresql
    mode: all
  duration: "5m"

This adds 200ms of latency (with 50ms jitter) between your web pods and the database for 5 minutes. This is incredibly useful for testing timeout configurations and retry logic.


3. Network partition: isolate a service


# chaos/network-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: tr-web-partition
  namespace: default
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      app: tr-web
  direction: both
  target:
    selector:
      namespaces:
        - default
      labelSelectors:
        app: postgresql
    mode: all
  duration: "2m"

This completely cuts network traffic between your web pods and the database. Does your app crash? Does it show a friendly error page? Does it recover when the network comes back? These are important questions.


4. CPU stress: simulate resource contention


# chaos/cpu-stress.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: StressChaos
metadata:
  name: tr-web-cpu-stress
  namespace: default
spec:
  mode: one
  selector:
    namespaces:
      - default
    labelSelectors:
      app: tr-web
  stressors:
    cpu:
      workers: 2
      load: 80
  duration: "5m"

This burns 80% CPU in one pod. With proper resource limits and HPA, your cluster should handle this gracefully.


5. DNS failure: break name resolution


# chaos/dns-failure.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: DNSChaos
metadata:
  name: tr-web-dns-failure
  namespace: default
spec:
  action: error
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      app: tr-web
  patterns:
    - "api.github.com"
  duration: "5m"

This makes DNS resolution fail for api.github.com from your web pods. Remember how we fixed the GitHub sponsors API issue with a dedicated Hackney pool? This experiment verifies that fix actually works, the database connections should not be affected even when GitHub is unreachable.


Litmus: experiment workflows

Litmus is another CNCF chaos engineering project that focuses on experiment workflows. While Chaos Mesh is great for individual experiments, Litmus excels at orchestrating multi-step chaos scenarios.


Install Litmus:


helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm
helm repo update

helm install litmus litmuschaos/litmus \
  --namespace litmus \
  --create-namespace

A Litmus workflow lets you chain multiple chaos experiments together with validation steps:


# litmus/workflow-resilience-test.yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  name: tr-web-resilience-test
  namespace: litmus
spec:
  entrypoint: resilience-test
  templates:
    - name: resilience-test
      steps:
        # Step 1: Verify steady state
        - - name: verify-baseline
            template: check-slo

        # Step 2: Kill a pod
        - - name: pod-kill
            template: pod-kill-experiment

        # Step 3: Verify SLO is still met
        - - name: verify-after-pod-kill
            template: check-slo

        # Step 4: Add network latency
        - - name: network-delay
            template: network-delay-experiment

        # Step 5: Verify latency SLO
        - - name: verify-after-delay
            template: check-latency-slo

        # Step 6: Clean up and final check
        - - name: final-verification
            template: check-slo

    - name: check-slo
      container:
        image: curlimages/curl:latest
        command:
          - /bin/sh
          - -c
          - |
            # Query Prometheus for current SLI
            AVAILABILITY=$(curl -s "http://prometheus:9090/api/v1/query?query=sli:availability:ratio_rate5m" \
              | jq -r '.data.result[0].value[1]')

            echo "Current availability SLI: $AVAILABILITY"

            if (( $(echo "$AVAILABILITY < 0.999" | bc -l) )); then
              echo "FAIL: Availability below SLO target"
              exit 1
            fi

            echo "PASS: Availability within SLO target"

    - name: check-latency-slo
      container:
        image: curlimages/curl:latest
        command:
          - /bin/sh
          - -c
          - |
            LATENCY=$(curl -s "http://prometheus:9090/api/v1/query?query=sli:latency:ratio_rate5m" \
              | jq -r '.data.result[0].value[1]')

            echo "Current latency SLI: $LATENCY"

            # During chaos, we allow a slightly relaxed SLO
            if (( $(echo "$LATENCY < 0.95" | bc -l) )); then
              echo "FAIL: Latency severely degraded during chaos"
              exit 1
            fi

            echo "PASS: Latency within acceptable range during chaos"

    - name: pod-kill-experiment
      container:
        image: litmuschaos/litmus-checker:latest
        # ... pod kill configuration

    - name: network-delay-experiment
      container:
        image: litmuschaos/litmus-checker:latest
        # ... network delay configuration

This workflow verifies that your service stays within SLO targets even while being subjected to chaos. If any verification step fails, you know you have a resilience gap to fix.


Game days: structured chaos

A game day is a scheduled event where the team intentionally injects failures and practices incident response. It is like a fire drill, but everyone knows it is happening (mostly).


Here is how to plan and run a game day:


Before the game day (1 week ahead)


  • Choose a date and time (during business hours, never on a Friday)
  • Define the scenarios you want to test (2-3 per game day, no more)
  • Notify stakeholders that things might break
  • Assign roles: facilitator, chaos operator, observers
  • Prepare the experiments (have the YAML files ready)
  • Review runbooks for the scenarios you will test

Game day checklist template:


# game-days/2026-02-25-checklist.md
# Game Day: February 25, 2026

## Pre-game
- [ ] All participants confirmed
- [ ] Stakeholders notified
- [ ] Monitoring dashboards open
- [ ] Runbooks accessible
- [ ] Rollback procedures ready
- [ ] Communication channel created (#gameday-2026-02-25)

## Scenario 1: Pod failure recovery
- **Hypothesis**: Killing 1 of 3 tr-web pods should not cause any user-visible errors
- **Experiment**: `chaos/pod-kill.yaml`
- **Success criteria**: Availability SLI stays above 99.9%
- **Duration**: 10 minutes
- **Results**: [ PASS / FAIL ]
- **Notes**: ___

## Scenario 2: Database latency spike
- **Hypothesis**: 200ms extra latency to DB should trigger the latency SLO alert but not the availability alert
- **Experiment**: `chaos/network-delay.yaml`
- **Success criteria**: Latency alert fires within 5 minutes, app remains functional
- **Duration**: 15 minutes
- **Results**: [ PASS / FAIL ]
- **Notes**: ___

## Scenario 3: External dependency failure
- **Hypothesis**: GitHub API being unreachable should not affect blog page load times
- **Experiment**: `chaos/dns-failure.yaml`
- **Success criteria**: Blog pages load normally, only sponsor section is empty
- **Duration**: 10 minutes
- **Results**: [ PASS / FAIL ]
- **Notes**: ___

## Post-game
- [ ] All experiments cleaned up
- [ ] Systems back to steady state
- [ ] Game day retro completed
- [ ] Action items created as GitHub issues
- [ ] Results shared with the team

During the game day


  • The facilitator keeps time and coordinates
  • The chaos operator applies experiments
  • Observers watch dashboards and logs (using the observability stack from article 3)
  • The on-call engineer responds as if it were a real incident
  • Everyone takes notes

After the game day


Run a retro (just like a postmortem but for the exercise). What worked? What did not? What surprised you? Create action items for anything that needs fixing.


Steady state validation with automated chaos

Once you are comfortable with game days, you can start running automated chaos experiments in production. This is the advanced level of chaos engineering.


The key is to tie chaos experiments to your SLO monitoring. If an experiment causes an SLO violation, it stops automatically:


# chaos/continuous-chaos.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: Schedule
metadata:
  name: continuous-pod-kill
  namespace: default
spec:
  schedule: "0 */4 * * *"  # Every 4 hours
  type: PodChaos
  historyLimit: 5
  concurrencyPolicy: Forbid
  podChaos:
    action: pod-kill
    mode: one
    selector:
      namespaces:
        - default
      labelSelectors:
        app: tr-web
    duration: "30s"

Combine this with an Alertmanager silence that suppresses the chaos-related page alert, but still tracks the SLO impact:


# Only silence the page alert, not the SLO recording
# This way you can see the SLO impact without getting paged
amtool silence add --alertmanager.url=http://alertmanager:9093 \
  --author="chaos-bot" \
  --comment="Scheduled chaos experiment" \
  --duration="5m" \
  alertname="TrWebPodKilled"

Chaos engineering for Elixir/BEAM applications

The BEAM VM has some unique characteristics that affect chaos engineering:


Supervision trees handle many failures automatically. When you kill an Elixir process, the supervisor restarts it. This is great for resilience but means you need to test harder failures (like network partitions or resource exhaustion) to find real issues.


Hot code reloading can mask deployment issues. If your app uses hot code reloading in production, you should also test cold restarts.


Distribution (Erlang clustering) is sensitive to network issues. If your nodes are clustered (like our app with RELEASE_DISTRIBUTION=name), test what happens when nodes lose connectivity:


# chaos/cluster-partition.yaml
apiVersion: chaos-mesh.org/v1alpha1
kind: NetworkChaos
metadata:
  name: beam-cluster-partition
  namespace: default
spec:
  action: partition
  mode: all
  selector:
    namespaces:
      - default
    labelSelectors:
      app: tr-web
    fieldSelectors:
      metadata.name: tr-web-0
  direction: both
  target:
    selector:
      namespaces:
        - default
      labelSelectors:
        app: tr-web
      fieldSelectors:
        metadata.name: tr-web-1
    mode: all
  duration: "5m"

This partitions two nodes of your Erlang cluster. Does your app handle the netsplit gracefully? Does it recover when connectivity returns? These are important questions for clustered BEAM applications.


What to test first

If you are just starting with chaos engineering, here is a prioritized list:


  1. Single pod failure: Can your service handle losing one instance? (This is the minimum)
  2. Dependency timeout: What happens when an external service responds slowly?
  3. DNS failure: Can your app handle name resolution failures gracefully?
  4. Resource exhaustion: What happens when you hit CPU or memory limits?
  5. Network partition: Can your service handle being cut off from a dependency?
  6. Disk pressure: What happens when disk space runs low?
  7. Clock skew: What happens when time drifts between nodes?

Start with #1 and work your way down. Each experiment should be repeated regularly, not just once.


Safety guardrails

Chaos engineering can go wrong if you are not careful. Here are non-negotiable safety rules:


  • Always have a kill switch. Every experiment must be stoppable immediately.
  • Start in staging. Never run a new experiment in production for the first time.
  • Blast radius control. Affect one pod, not all pods. One service, not all services.
  • Time-bounded. Every experiment has a duration. No open-ended chaos.
  • Monitor continuously. If SLOs are violated beyond acceptable thresholds, abort.
  • Business hours only (for manual experiments). Do not do game days on Fridays at 5pm.
  • Communicate. Everyone who needs to know should know that chaos is happening.

Putting it all together

Here is the chaos engineering maturity model:


  1. Level 0 - No chaos: You hope things work. (Most teams start here)
  2. Level 1 - Manual game days: Quarterly game days with pre-planned scenarios
  3. Level 2 - Automated chaos in staging: Regular chaos experiments run automatically in staging
  4. Level 3 - Automated chaos in production: Continuous chaos in production with SLO-based guardrails
  5. Level 4 - Chaos as CI: Chaos experiments run as part of your deployment pipeline

You do not need to reach Level 4 to get value. Even Level 1 (quarterly game days) will dramatically improve your team’s confidence and incident response speed.


Closing notes

Chaos engineering is not about breaking things for fun. It is about building confidence that your systems can handle the failures that will inevitably occur. Every experiment that passes tells you “this failure mode is handled.” Every experiment that fails tells you “fix this before a real failure finds it.”


The tools we covered, Chaos Mesh, Litmus, game day checklists, are all free and work great in Kubernetes. Start with a simple pod-kill experiment in staging and build from there. The hardest part is not the tooling, it is getting organizational buy-in to intentionally break things. But once you show the team the first bug you found through chaos, they will be convinced.


This wraps up our four-part SRE series. We went from measuring reliability (SLIs/SLOs) to responding to failures (incident management) to seeing what is happening (observability) to proactively finding weaknesses (chaos engineering). Together, these practices give you a solid foundation for running reliable systems.


Hope you found this useful and enjoyed reading it, until next time!


Errata

If you spot any error or have any suggestion, please send me a message so it gets fixed.

Also, you can check the source code and changes in the sources here



$ Comments

Online: 0

Please sign in to be able to write comments.

2026-03-02 | Gabriel Garrido