SRE: Release Engineering and Progressive Delivery
Support this blog
If you find this content useful, consider supporting the blog.
Introduction
Throughout this SRE series we have covered a lot of ground: SLIs and SLOs, incident management, observability, chaos engineering, capacity planning, GitOps, secrets management, cost optimization, dependency management, and database reliability. We have SLOs, alerts, runbooks, observability pipelines, chaos experiments, and GitOps workflows. But none of that matters if your deployments keep causing outages.
Deployments are the number one cause of incidents in most organizations. Every time you push new code to production, you are introducing change, and change is where failures live. Release engineering is the discipline of making deployments safe, predictable, and boring. Progressive delivery takes that further by gradually rolling out changes to small subsets of users, validating at each step, and automatically rolling back when something goes wrong.
In this article we will cover canary deployments with Argo Rollouts, blue-green deployments, feature flags in Elixir, automatic rollback, deployment SLOs, ArgoCD sync hooks, GitOps-driven releases, and release cadence policies.
Let’s get into it.
Canary deployments with Argo Rollouts
A canary deployment sends a small percentage of traffic to the new version first. If the canary stays healthy, you gradually increase traffic. If it gets sick, you pull it back before anyone else is affected.
Argo Rollouts is a Kubernetes controller that replaces the standard Deployment with a Rollout CRD giving you fine-grained control over the rollout process. Install it first:
# Install Argo Rollouts
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml
# Install the kubectl plugin
brew install argoproj/tap/kubectl-argo-rollouts
Now define a canary Rollout for our Elixir application:
# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: tr-web
spec:
replicas: 4
selector:
matchLabels:
app: tr-web
template:
metadata:
labels:
app: tr-web
spec:
containers:
- name: tr-web
image: kainlite/tr:v1.2.0
ports:
- containerPort: 4000
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "1000m"
memory: "512Mi"
readinessProbe:
httpGet:
path: /healthz
port: 4000
initialDelaySeconds: 10
strategy:
canary:
canaryService: tr-web-canary
stableService: tr-web-stable
trafficRouting:
nginx:
stableIngress: tr-web-ingress
steps:
- setWeight: 5
- pause: { duration: 2m }
- analysis:
templates:
- templateName: canary-success-rate
args:
- name: service-name
value: tr-web-canary
- setWeight: 20
- pause: { duration: 3m }
- analysis:
templates:
- templateName: canary-success-rate
- setWeight: 50
- pause: { duration: 5m }
- setWeight: 100
The steps section defines the rollout process:
- 5% traffic goes to the new version, then pause for 2 minutes
- Analysis runs checking error rate against our SLO
- 20% traffic if analysis passed, bump up and pause 3 minutes
- 50% traffic for 5 minutes
- 100% traffic full promotion if everything looks good
You also need stable and canary services:
# services.yaml
apiVersion: v1
kind: Service
metadata:
name: tr-web-stable
spec:
selector:
app: tr-web
ports:
- port: 80
targetPort: 4000
---
apiVersion: v1
kind: Service
metadata:
name: tr-web-canary
spec:
selector:
app: tr-web
ports:
- port: 80
targetPort: 4000
To manage rollouts use the kubectl plugin:
# Watch the rollout
kubectl argo rollouts get rollout tr-web --watch
# Manually promote a paused rollout
kubectl argo rollouts promote tr-web
# Abort and go back to stable
kubectl argo rollouts abort tr-web
Blue-green deployments
Blue-green runs two complete environments side by side. “Blue” is the current version, “green” is the new one. You deploy green, test it, and switch all traffic at once. If something breaks, you switch back to blue.
The tradeoff versus canary is simplicity (no gradual shifting) but you need double the resources during deployment and all users move at once. Here is a blue-green Rollout:
# blue-green-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: tr-web-bluegreen
spec:
replicas: 4
selector:
matchLabels:
app: tr-web
template:
metadata:
labels:
app: tr-web
spec:
containers:
- name: tr-web
image: kainlite/tr:v1.2.0
ports:
- containerPort: 4000
readinessProbe:
httpGet:
path: /healthz
port: 4000
strategy:
blueGreen:
activeService: tr-web-active
previewService: tr-web-preview
autoPromotionEnabled: false
scaleDownDelaySeconds: 30
prePromotionAnalysis:
templates:
- templateName: bluegreen-smoke-test
args:
- name: service-name
value: tr-web-preview
postPromotionAnalysis:
templates:
- templateName: canary-success-rate
args:
- name: service-name
value: tr-web-active
When you update the image tag:
- New pods are created alongside existing ones
- Preview service points to new pods for testing
- Pre-promotion analysis runs smoke tests against preview
- Manual promotion required since
autoPromotionEnabledis false- Traffic switches all at once from blue to green
- Old pods scale down after
scaleDownDelaySeconds
Feature flags
Feature flags let you decouple deployment from release. You deploy code but the feature is hidden behind a flag you can toggle at runtime without a new deployment.
Here is a simple feature flag system in Elixir using ETS:
# lib/tr/feature_flags.ex
defmodule Tr.FeatureFlags do
use GenServer
@table :feature_flags
def start_link(opts \\ []) do
GenServer.start_link(__MODULE__, opts, name: __MODULE__)
end
def enabled?(feature) when is_atom(feature) do
case :ets.lookup(@table, feature) do
[{^feature, %{enabled: true, percentage: 100}}] -> true
[{^feature, %{enabled: true, percentage: pct}}] -> :rand.uniform(100) <= pct
_ -> false
end
end
def enabled?(feature, user_id) when is_atom(feature) do
case :ets.lookup(@table, feature) do
[{^feature, %{enabled: true, percentage: 100}}] -> true
[{^feature, %{enabled: true, percentage: pct}}] ->
hash = :erlang.phash2({feature, user_id}, 100)
hash < pct
_ -> false
end
end
def enable(feature, percentage \\ 100) when is_atom(feature) do
GenServer.call(__MODULE__, {:enable, feature, percentage})
end
def disable(feature) when is_atom(feature) do
GenServer.call(__MODULE__, {:disable, feature})
end
@impl true
def init(_opts) do
table = :ets.new(@table, [:named_table, :set, :public, read_concurrency: true])
load_defaults()
{:ok, %{table: table}}
end
@impl true
def handle_call({:enable, feature, percentage}, _from, state) do
:ets.insert(@table, {feature, %{enabled: true, percentage: percentage}})
{:reply, :ok, state}
end
@impl true
def handle_call({:disable, feature}, _from, state) do
:ets.insert(@table, {feature, %{enabled: false, percentage: 0}})
{:reply, :ok, state}
end
defp load_defaults do
defaults = Application.get_env(:tr, :feature_flags, [])
Enum.each(defaults, fn {name, config} ->
:ets.insert(@table, {name, config})
end)
end
end
Configure defaults and use it in your views:
# config/config.exs
config :tr, :feature_flags, [
new_search_ui: %{enabled: false, percentage: 0},
dark_mode: %{enabled: true, percentage: 100},
experimental_editor: %{enabled: true, percentage: 10}
]
# In a LiveView
def render(assigns) do
~H"""
<%= if Tr.FeatureFlags.enabled?(:new_search_ui) do %>
<.new_search_component />
<% else %>
<.legacy_search_component />
<% end %>
"""
end
The enabled?/2 variant uses consistent hashing so user 42 always gets the same result at any percentage.
You can progressively roll out:
Tr.FeatureFlags.enable(:new_search_ui, 25) # 25% of users
Tr.FeatureFlags.enable(:new_search_ui, 50) # 50% of users
Tr.FeatureFlags.enable(:new_search_ui, 100) # everyone
Tr.FeatureFlags.disable(:new_search_ui) # kill switch
Rollback automation
The fastest way to recover from a bad deployment is to roll back. With proper automation, this can happen in under a minute without human intervention.
With Argo Rollouts, rollback is automatic when analysis fails. The rollout is aborted and traffic shifts back to the stable version. For ArgoCD deployments, you can automate rollback in your CI pipeline:
# .github/workflows/deploy.yaml
name: Deploy
on:
push:
branches: [main]
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Deploy to production
run: |
kubectl set image deployment/tr-web \
tr-web=kainlite/tr:${{ github.sha }}
- name: Wait for rollout
id: rollout
continue-on-error: true
run: kubectl rollout status deployment/tr-web --timeout=180s
- name: Run smoke tests
id: smoke
if: steps.rollout.outcome == 'success'
continue-on-error: true
run: |
for i in $(seq 1 5); do
STATUS=$(curl -s -o /dev/null -w '%{http_code}' \
https://segfault.pw/healthz)
if [ "$STATUS" != "200" ]; then exit 1; fi
sleep 2
done
- name: Rollback on failure
if: steps.rollout.outcome == 'failure' || steps.smoke.outcome == 'failure'
run: |
echo "Deployment failed, rolling back..."
kubectl rollout undo deployment/tr-web
exit 1
You can also use kubectl directly for quick rollbacks:
# Kubernetes native rollback
kubectl rollout undo deployment/tr-web
# ArgoCD rollback to previous revision
argocd app history tr-web
argocd app rollback tr-web <previous-revision>
The key principle is that rollbacks should be automatic, fast, and require zero human decision-making.
Deployment SLOs
In the SLIs and SLOs article we defined SLOs for our services. Now we use those same SLOs as deployment gates. If a canary violates the SLO, the deployment stops.
Argo Rollouts uses AnalysisTemplates to query Prometheus and decide whether a deployment is healthy:
# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: canary-success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 30s
count: 5
successCondition: result[0] >= 0.99
failureLimit: 2
provider:
prometheus:
address: http://prometheus.monitoring.svc:9090
query: |
sum(rate(
http_requests_total{service="{{args.service-name}}", status!~"5.."}[2m]
)) /
sum(rate(
http_requests_total{service="{{args.service-name}}"}[2m]
))
This template:
- Queries Prometheus every 30 seconds for the success rate
- Runs 5 measurements for enough data to decide
- Requires 99% success rate matching our SLO
- Allows 2 failures before marking analysis as failed
You can also gate deployments on error budget. If less than 20% of your 30-day error budget remains, block the deployment:
# analysis-error-budget.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: error-budget-gate
spec:
metrics:
- name: error-budget-remaining
interval: 1m
count: 1
successCondition: result[0] > 0.2
provider:
prometheus:
address: http://prometheus.monitoring.svc:9090
query: |
1 - (
(1 - (
sum(rate(http_requests_total{service="tr-web", status!~"5.."}[30d])) /
sum(rate(http_requests_total{service="tr-web"}[30d]))
)) / (1 - 0.999)
)
Combine multiple analyses in your rollout steps for comprehensive validation:
steps:
- setWeight: 10
- pause: { duration: 2m }
- analysis:
templates:
- templateName: canary-success-rate
- templateName: canary-latency
args:
- name: service-name
value: tr-web-canary
Pre and post sync hooks
ArgoCD supports resource hooks that run at specific points during sync. These are perfect for database migrations before deployment, smoke tests after, and notifications at various stages.
Pre-sync hook for database migrations:
# migration-hook.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: tr-web-migrate
annotations:
argocd.argoproj.io/hook: PreSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
template:
spec:
containers:
- name: migrate
image: kainlite/tr:v1.2.0
command: ["/app/bin/tr"]
args: ["eval", "Tr.Release.migrate()"]
envFrom:
- secretRef:
name: tr-web-env
restartPolicy: Never
backoffLimit: 3
Post-sync hook for smoke tests:
# smoke-test-hook.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: tr-web-smoke-test
annotations:
argocd.argoproj.io/hook: PostSync
argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
template:
spec:
containers:
- name: smoke-test
image: curlimages/curl:latest
command: ["/bin/sh", "-c"]
args:
- |
STATUS=$(curl -s -o /dev/null -w '%{http_code}' http://tr-web-stable/healthz)
if [ "$STATUS" != "200" ]; then exit 1; fi
echo "Smoke tests passed!"
restartPolicy: Never
backoffLimit: 1
Failure notification hook:
# notification-hook.yaml
apiVersion: batch/v1
kind: Job
metadata:
name: tr-web-notify
annotations:
argocd.argoproj.io/hook: SyncFail
argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
template:
spec:
containers:
- name: notify
image: curlimages/curl:latest
command: ["/bin/sh", "-c"]
args:
- |
curl -X POST "${SLACK_WEBHOOK_URL}" \
-H 'Content-Type: application/json' \
-d '{"text": "Sync FAILED for tr-web in production!"}'
envFrom:
- secretRef:
name: slack-webhook
restartPolicy: Never
The available hook types are:
- PreSync runs before sync (migrations, backups)
- Sync runs during sync alongside other resources
- PostSync runs after all resources are synced and healthy
- SyncFail runs when sync fails (alert notifications)
GitOps-driven releases
With GitOps, every deployment is a git commit. This gives you a complete audit trail and the ability to use git revert as a rollback mechanism.
The ArgoCD Image Updater detects new container images and updates the git repository automatically:
# argocd-image-updater annotations on the Application
metadata:
annotations:
argocd-image-updater.argoproj.io/image-list: tr=kainlite/tr
argocd-image-updater.argoproj.io/tr.update-strategy: semver
argocd-image-updater.argoproj.io/tr.allow-tags: regexp:^v[0-9]+\.[0-9]+\.[0-9]+$
argocd-image-updater.argoproj.io/write-back-method: git
argocd-image-updater.argoproj.io/git-branch: main
For a PR-based flow with review before production, use a GitHub Action that creates a promotion PR:
# .github/workflows/promote.yaml
name: Promote to Production
on:
workflow_run:
workflows: ["Build and Push"]
types: [completed]
branches: [main]
jobs:
promote:
runs-on: ubuntu-latest
if: ${{ github.event.workflow_run.conclusion == 'success' }}
steps:
- uses: actions/checkout@v4
with:
repository: kainlite/tr-infra
token: ${{ secrets.INFRA_REPO_TOKEN }}
- name: Update image tag
run: |
cd k8s/overlays/production
kustomize edit set image \
kainlite/tr=kainlite/tr:${{ github.event.workflow_run.head_sha }}
- name: Create PR
uses: peter-evans/create-pull-request@v6
with:
commit-message: "chore: bump tr-web to ${{ github.event.workflow_run.head_sha }}"
title: "Deploy tr-web ${{ github.event.workflow_run.head_sha }}"
branch: deploy/tr-web-${{ github.event.workflow_run.head_sha }}
base: main
Use Kustomize overlays for environment promotion:
# k8s/overlays/staging/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base
images:
- name: kainlite/tr
newTag: abc123-staging
namespace: staging
# k8s/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../base
images:
- name: kainlite/tr
newTag: v1.2.0
namespace: default
The full workflow:
- Developer pushes code to the application repo
- CI builds and tests, pushes a container image
- Image updater detects new image and updates staging
- Staging tests pass including canary analysis
- PR is created to promote to production
- Team reviews and merges the PR
- ArgoCD syncs with the Argo Rollout strategy
- Canary analysis validates against SLOs
- Full rollout completes if healthy
Every step is traceable through git. If something goes wrong, git revert the promotion PR and ArgoCD
rolls back.
Release cadence and freezes
Great tooling is important, but you also need policies around when you deploy. ArgoCD supports sync windows:
# argocd-project.yaml
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
name: production
namespace: argocd
spec:
syncWindows:
# Allow syncs Monday-Thursday, 9am to 4pm UTC
- kind: allow
schedule: "0 9 * * 1-4"
duration: 7h
applications: ["*"]
# No Friday afternoon deploys
- kind: deny
schedule: "0 14 * * 5"
duration: 10h
applications: ["*"]
# End of year freeze (Dec 20 to Jan 1)
- kind: deny
schedule: "0 0 20 12 *"
duration: 288h
applications: ["*"]
# Always allow manual syncs for emergencies
- kind: allow
schedule: "* * * * *"
duration: 24h
applications: ["*"]
manualSync: true
Practical guidelines:
- Deploy often, deploy small: smaller changes are easier to debug
- No Friday afternoon deploys: unless you enjoy weekend pages
- Holiday freezes: plan them in advance, communicate clearly
- Emergency exceptions: always have a process for critical hotfixes
- Deploy windows: deploy only when someone is around to watch
You can also enforce this in CI:
# check-deploy-window.sh
#!/bin/bash
set -euo pipefail
HOUR=$(date -u +%H)
DAY=$(date -u +%u) # 1=Monday, 7=Sunday
if [ "$DAY" -ge 6 ]; then
echo "Deploy blocked: no weekend deployments"; exit 1
fi
if [ "$DAY" -eq 5 ] && [ "$HOUR" -ge 14 ]; then
echo "Deploy blocked: no Friday afternoon deployments"; exit 1
fi
if [ "$HOUR" -lt 9 ] || [ "$HOUR" -ge 16 ]; then
echo "Deploy blocked: outside window (09:00-16:00 UTC)"; exit 1
fi
echo "Deploy window open, proceeding..."
The balance is between safety and velocity. Too many restrictions and your team stops deploying, which actually makes deployments riskier because each one contains more changes.
Closing notes
Release engineering is about making deployments boring. When you have canary deployments that validate against your SLOs, blue-green strategies with instant rollback, feature flags for decoupling deployment from release, and GitOps pipelines with full audit trails, deployments become routine operations instead of scary events.
Start with one piece, maybe canary deployments with a simple error rate analysis, and build from there. The goal is not zero deployments, it is zero deployment-caused incidents. Ship fast, ship safely, and let automation catch problems before your users do.
Hope you found this useful and enjoyed reading it, until next time!
Errata
If you spot any error or have any suggestion, please send me a message so it gets fixed.
Also, you can check the source code and changes in the sources here
$ Comments
Online: 0Please sign in to be able to write comments.