SRE: Release Engineering and Progressive Delivery

2026-03-21 | Gabriel Garrido | 11 min read
Share:

Support this blog

If you find this content useful, consider supporting the blog.

Introduction

Throughout this SRE series we have covered a lot of ground: SLIs and SLOs, incident management, observability, chaos engineering, capacity planning, GitOps, secrets management, cost optimization, dependency management, and database reliability. We have SLOs, alerts, runbooks, observability pipelines, chaos experiments, and GitOps workflows. But none of that matters if your deployments keep causing outages.


Deployments are the number one cause of incidents in most organizations. Every time you push new code to production, you are introducing change, and change is where failures live. Release engineering is the discipline of making deployments safe, predictable, and boring. Progressive delivery takes that further by gradually rolling out changes to small subsets of users, validating at each step, and automatically rolling back when something goes wrong.


In this article we will cover canary deployments with Argo Rollouts, blue-green deployments, feature flags in Elixir, automatic rollback, deployment SLOs, ArgoCD sync hooks, GitOps-driven releases, and release cadence policies.


Let’s get into it.


Canary deployments with Argo Rollouts

A canary deployment sends a small percentage of traffic to the new version first. If the canary stays healthy, you gradually increase traffic. If it gets sick, you pull it back before anyone else is affected.


Argo Rollouts is a Kubernetes controller that replaces the standard Deployment with a Rollout CRD giving you fine-grained control over the rollout process. Install it first:


# Install Argo Rollouts
kubectl create namespace argo-rollouts
kubectl apply -n argo-rollouts -f https://github.com/argoproj/argo-rollouts/releases/latest/download/install.yaml

# Install the kubectl plugin
brew install argoproj/tap/kubectl-argo-rollouts

Now define a canary Rollout for our Elixir application:


# rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: tr-web
spec:
  replicas: 4
  selector:
    matchLabels:
      app: tr-web
  template:
    metadata:
      labels:
        app: tr-web
    spec:
      containers:
        - name: tr-web
          image: kainlite/tr:v1.2.0
          ports:
            - containerPort: 4000
          resources:
            requests:
              cpu: "250m"
              memory: "256Mi"
            limits:
              cpu: "1000m"
              memory: "512Mi"
          readinessProbe:
            httpGet:
              path: /healthz
              port: 4000
            initialDelaySeconds: 10
  strategy:
    canary:
      canaryService: tr-web-canary
      stableService: tr-web-stable
      trafficRouting:
        nginx:
          stableIngress: tr-web-ingress
      steps:
        - setWeight: 5
        - pause: { duration: 2m }
        - analysis:
            templates:
              - templateName: canary-success-rate
            args:
              - name: service-name
                value: tr-web-canary
        - setWeight: 20
        - pause: { duration: 3m }
        - analysis:
            templates:
              - templateName: canary-success-rate
        - setWeight: 50
        - pause: { duration: 5m }
        - setWeight: 100

The steps section defines the rollout process:


  1. 5% traffic goes to the new version, then pause for 2 minutes
  2. Analysis runs checking error rate against our SLO
  3. 20% traffic if analysis passed, bump up and pause 3 minutes
  4. 50% traffic for 5 minutes
  5. 100% traffic full promotion if everything looks good

You also need stable and canary services:


# services.yaml
apiVersion: v1
kind: Service
metadata:
  name: tr-web-stable
spec:
  selector:
    app: tr-web
  ports:
    - port: 80
      targetPort: 4000
---
apiVersion: v1
kind: Service
metadata:
  name: tr-web-canary
spec:
  selector:
    app: tr-web
  ports:
    - port: 80
      targetPort: 4000

To manage rollouts use the kubectl plugin:


# Watch the rollout
kubectl argo rollouts get rollout tr-web --watch

# Manually promote a paused rollout
kubectl argo rollouts promote tr-web

# Abort and go back to stable
kubectl argo rollouts abort tr-web

Blue-green deployments

Blue-green runs two complete environments side by side. “Blue” is the current version, “green” is the new one. You deploy green, test it, and switch all traffic at once. If something breaks, you switch back to blue.


The tradeoff versus canary is simplicity (no gradual shifting) but you need double the resources during deployment and all users move at once. Here is a blue-green Rollout:


# blue-green-rollout.yaml
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: tr-web-bluegreen
spec:
  replicas: 4
  selector:
    matchLabels:
      app: tr-web
  template:
    metadata:
      labels:
        app: tr-web
    spec:
      containers:
        - name: tr-web
          image: kainlite/tr:v1.2.0
          ports:
            - containerPort: 4000
          readinessProbe:
            httpGet:
              path: /healthz
              port: 4000
  strategy:
    blueGreen:
      activeService: tr-web-active
      previewService: tr-web-preview
      autoPromotionEnabled: false
      scaleDownDelaySeconds: 30
      prePromotionAnalysis:
        templates:
          - templateName: bluegreen-smoke-test
        args:
          - name: service-name
            value: tr-web-preview
      postPromotionAnalysis:
        templates:
          - templateName: canary-success-rate
        args:
          - name: service-name
            value: tr-web-active

When you update the image tag:


  1. New pods are created alongside existing ones
  2. Preview service points to new pods for testing
  3. Pre-promotion analysis runs smoke tests against preview
  4. Manual promotion required since autoPromotionEnabled is false
  5. Traffic switches all at once from blue to green
  6. Old pods scale down after scaleDownDelaySeconds

Feature flags

Feature flags let you decouple deployment from release. You deploy code but the feature is hidden behind a flag you can toggle at runtime without a new deployment.


Here is a simple feature flag system in Elixir using ETS:


# lib/tr/feature_flags.ex
defmodule Tr.FeatureFlags do
  use GenServer

  @table :feature_flags

  def start_link(opts \\ []) do
    GenServer.start_link(__MODULE__, opts, name: __MODULE__)
  end

  def enabled?(feature) when is_atom(feature) do
    case :ets.lookup(@table, feature) do
      [{^feature, %{enabled: true, percentage: 100}}] -> true
      [{^feature, %{enabled: true, percentage: pct}}] -> :rand.uniform(100) <= pct
      _ -> false
    end
  end

  def enabled?(feature, user_id) when is_atom(feature) do
    case :ets.lookup(@table, feature) do
      [{^feature, %{enabled: true, percentage: 100}}] -> true
      [{^feature, %{enabled: true, percentage: pct}}] ->
        hash = :erlang.phash2({feature, user_id}, 100)
        hash < pct
      _ -> false
    end
  end

  def enable(feature, percentage \\ 100) when is_atom(feature) do
    GenServer.call(__MODULE__, {:enable, feature, percentage})
  end

  def disable(feature) when is_atom(feature) do
    GenServer.call(__MODULE__, {:disable, feature})
  end

  @impl true
  def init(_opts) do
    table = :ets.new(@table, [:named_table, :set, :public, read_concurrency: true])
    load_defaults()
    {:ok, %{table: table}}
  end

  @impl true
  def handle_call({:enable, feature, percentage}, _from, state) do
    :ets.insert(@table, {feature, %{enabled: true, percentage: percentage}})
    {:reply, :ok, state}
  end

  @impl true
  def handle_call({:disable, feature}, _from, state) do
    :ets.insert(@table, {feature, %{enabled: false, percentage: 0}})
    {:reply, :ok, state}
  end

  defp load_defaults do
    defaults = Application.get_env(:tr, :feature_flags, [])
    Enum.each(defaults, fn {name, config} ->
      :ets.insert(@table, {name, config})
    end)
  end
end

Configure defaults and use it in your views:


# config/config.exs
config :tr, :feature_flags, [
  new_search_ui: %{enabled: false, percentage: 0},
  dark_mode: %{enabled: true, percentage: 100},
  experimental_editor: %{enabled: true, percentage: 10}
]

# In a LiveView
def render(assigns) do
  ~H"""
  <%= if Tr.FeatureFlags.enabled?(:new_search_ui) do %>
    <.new_search_component />
  <% else %>
    <.legacy_search_component />
  <% end %>
  """
end

The enabled?/2 variant uses consistent hashing so user 42 always gets the same result at any percentage. You can progressively roll out:


Tr.FeatureFlags.enable(:new_search_ui, 25)  # 25% of users
Tr.FeatureFlags.enable(:new_search_ui, 50)  # 50% of users
Tr.FeatureFlags.enable(:new_search_ui, 100) # everyone
Tr.FeatureFlags.disable(:new_search_ui)     # kill switch

Rollback automation

The fastest way to recover from a bad deployment is to roll back. With proper automation, this can happen in under a minute without human intervention.


With Argo Rollouts, rollback is automatic when analysis fails. The rollout is aborted and traffic shifts back to the stable version. For ArgoCD deployments, you can automate rollback in your CI pipeline:


# .github/workflows/deploy.yaml
name: Deploy
on:
  push:
    branches: [main]

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Deploy to production
        run: |
          kubectl set image deployment/tr-web \
            tr-web=kainlite/tr:${{ github.sha }}

      - name: Wait for rollout
        id: rollout
        continue-on-error: true
        run: kubectl rollout status deployment/tr-web --timeout=180s

      - name: Run smoke tests
        id: smoke
        if: steps.rollout.outcome == 'success'
        continue-on-error: true
        run: |
          for i in $(seq 1 5); do
            STATUS=$(curl -s -o /dev/null -w '%{http_code}' \
              https://segfault.pw/healthz)
            if [ "$STATUS" != "200" ]; then exit 1; fi
            sleep 2
          done

      - name: Rollback on failure
        if: steps.rollout.outcome == 'failure' || steps.smoke.outcome == 'failure'
        run: |
          echo "Deployment failed, rolling back..."
          kubectl rollout undo deployment/tr-web
          exit 1

You can also use kubectl directly for quick rollbacks:


# Kubernetes native rollback
kubectl rollout undo deployment/tr-web

# ArgoCD rollback to previous revision
argocd app history tr-web
argocd app rollback tr-web <previous-revision>

The key principle is that rollbacks should be automatic, fast, and require zero human decision-making.


Deployment SLOs

In the SLIs and SLOs article we defined SLOs for our services. Now we use those same SLOs as deployment gates. If a canary violates the SLO, the deployment stops.


Argo Rollouts uses AnalysisTemplates to query Prometheus and decide whether a deployment is healthy:


# analysis-template.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: canary-success-rate
spec:
  args:
    - name: service-name
  metrics:
    - name: success-rate
      interval: 30s
      count: 5
      successCondition: result[0] >= 0.99
      failureLimit: 2
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc:9090
          query: |
            sum(rate(
              http_requests_total{service="{{args.service-name}}", status!~"5.."}[2m]
            )) /
            sum(rate(
              http_requests_total{service="{{args.service-name}}"}[2m]
            ))

This template:


  • Queries Prometheus every 30 seconds for the success rate
  • Runs 5 measurements for enough data to decide
  • Requires 99% success rate matching our SLO
  • Allows 2 failures before marking analysis as failed

You can also gate deployments on error budget. If less than 20% of your 30-day error budget remains, block the deployment:


# analysis-error-budget.yaml
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
  name: error-budget-gate
spec:
  metrics:
    - name: error-budget-remaining
      interval: 1m
      count: 1
      successCondition: result[0] > 0.2
      provider:
        prometheus:
          address: http://prometheus.monitoring.svc:9090
          query: |
            1 - (
              (1 - (
                sum(rate(http_requests_total{service="tr-web", status!~"5.."}[30d])) /
                sum(rate(http_requests_total{service="tr-web"}[30d]))
              )) / (1 - 0.999)
            )

Combine multiple analyses in your rollout steps for comprehensive validation:


steps:
  - setWeight: 10
  - pause: { duration: 2m }
  - analysis:
      templates:
        - templateName: canary-success-rate
        - templateName: canary-latency
      args:
        - name: service-name
          value: tr-web-canary

Pre and post sync hooks

ArgoCD supports resource hooks that run at specific points during sync. These are perfect for database migrations before deployment, smoke tests after, and notifications at various stages.


Pre-sync hook for database migrations:


# migration-hook.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: tr-web-migrate
  annotations:
    argocd.argoproj.io/hook: PreSync
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
  template:
    spec:
      containers:
        - name: migrate
          image: kainlite/tr:v1.2.0
          command: ["/app/bin/tr"]
          args: ["eval", "Tr.Release.migrate()"]
          envFrom:
            - secretRef:
                name: tr-web-env
      restartPolicy: Never
  backoffLimit: 3

Post-sync hook for smoke tests:


# smoke-test-hook.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: tr-web-smoke-test
  annotations:
    argocd.argoproj.io/hook: PostSync
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
  template:
    spec:
      containers:
        - name: smoke-test
          image: curlimages/curl:latest
          command: ["/bin/sh", "-c"]
          args:
            - |
              STATUS=$(curl -s -o /dev/null -w '%{http_code}' http://tr-web-stable/healthz)
              if [ "$STATUS" != "200" ]; then exit 1; fi
              echo "Smoke tests passed!"
      restartPolicy: Never
  backoffLimit: 1

Failure notification hook:


# notification-hook.yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: tr-web-notify
  annotations:
    argocd.argoproj.io/hook: SyncFail
    argocd.argoproj.io/hook-delete-policy: HookSucceeded
spec:
  template:
    spec:
      containers:
        - name: notify
          image: curlimages/curl:latest
          command: ["/bin/sh", "-c"]
          args:
            - |
              curl -X POST "${SLACK_WEBHOOK_URL}" \
                -H 'Content-Type: application/json' \
                -d '{"text": "Sync FAILED for tr-web in production!"}'
          envFrom:
            - secretRef:
                name: slack-webhook
      restartPolicy: Never

The available hook types are:


  • PreSync runs before sync (migrations, backups)
  • Sync runs during sync alongside other resources
  • PostSync runs after all resources are synced and healthy
  • SyncFail runs when sync fails (alert notifications)

GitOps-driven releases

With GitOps, every deployment is a git commit. This gives you a complete audit trail and the ability to use git revert as a rollback mechanism.


The ArgoCD Image Updater detects new container images and updates the git repository automatically:


# argocd-image-updater annotations on the Application
metadata:
  annotations:
    argocd-image-updater.argoproj.io/image-list: tr=kainlite/tr
    argocd-image-updater.argoproj.io/tr.update-strategy: semver
    argocd-image-updater.argoproj.io/tr.allow-tags: regexp:^v[0-9]+\.[0-9]+\.[0-9]+$
    argocd-image-updater.argoproj.io/write-back-method: git
    argocd-image-updater.argoproj.io/git-branch: main

For a PR-based flow with review before production, use a GitHub Action that creates a promotion PR:


# .github/workflows/promote.yaml
name: Promote to Production
on:
  workflow_run:
    workflows: ["Build and Push"]
    types: [completed]
    branches: [main]

jobs:
  promote:
    runs-on: ubuntu-latest
    if: ${{ github.event.workflow_run.conclusion == 'success' }}
    steps:
      - uses: actions/checkout@v4
        with:
          repository: kainlite/tr-infra
          token: ${{ secrets.INFRA_REPO_TOKEN }}

      - name: Update image tag
        run: |
          cd k8s/overlays/production
          kustomize edit set image \
            kainlite/tr=kainlite/tr:${{ github.event.workflow_run.head_sha }}

      - name: Create PR
        uses: peter-evans/create-pull-request@v6
        with:
          commit-message: "chore: bump tr-web to ${{ github.event.workflow_run.head_sha }}"
          title: "Deploy tr-web ${{ github.event.workflow_run.head_sha }}"
          branch: deploy/tr-web-${{ github.event.workflow_run.head_sha }}
          base: main

Use Kustomize overlays for environment promotion:


# k8s/overlays/staging/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../../base
images:
  - name: kainlite/tr
    newTag: abc123-staging
namespace: staging

# k8s/overlays/production/kustomization.yaml
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  - ../../base
images:
  - name: kainlite/tr
    newTag: v1.2.0
namespace: default

The full workflow:


  1. Developer pushes code to the application repo
  2. CI builds and tests, pushes a container image
  3. Image updater detects new image and updates staging
  4. Staging tests pass including canary analysis
  5. PR is created to promote to production
  6. Team reviews and merges the PR
  7. ArgoCD syncs with the Argo Rollout strategy
  8. Canary analysis validates against SLOs
  9. Full rollout completes if healthy

Every step is traceable through git. If something goes wrong, git revert the promotion PR and ArgoCD rolls back.


Release cadence and freezes

Great tooling is important, but you also need policies around when you deploy. ArgoCD supports sync windows:


# argocd-project.yaml
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata:
  name: production
  namespace: argocd
spec:
  syncWindows:
    # Allow syncs Monday-Thursday, 9am to 4pm UTC
    - kind: allow
      schedule: "0 9 * * 1-4"
      duration: 7h
      applications: ["*"]

    # No Friday afternoon deploys
    - kind: deny
      schedule: "0 14 * * 5"
      duration: 10h
      applications: ["*"]

    # End of year freeze (Dec 20 to Jan 1)
    - kind: deny
      schedule: "0 0 20 12 *"
      duration: 288h
      applications: ["*"]

    # Always allow manual syncs for emergencies
    - kind: allow
      schedule: "* * * * *"
      duration: 24h
      applications: ["*"]
      manualSync: true

Practical guidelines:


  • Deploy often, deploy small: smaller changes are easier to debug
  • No Friday afternoon deploys: unless you enjoy weekend pages
  • Holiday freezes: plan them in advance, communicate clearly
  • Emergency exceptions: always have a process for critical hotfixes
  • Deploy windows: deploy only when someone is around to watch

You can also enforce this in CI:


# check-deploy-window.sh
#!/bin/bash
set -euo pipefail

HOUR=$(date -u +%H)
DAY=$(date -u +%u)  # 1=Monday, 7=Sunday

if [ "$DAY" -ge 6 ]; then
  echo "Deploy blocked: no weekend deployments"; exit 1
fi

if [ "$DAY" -eq 5 ] && [ "$HOUR" -ge 14 ]; then
  echo "Deploy blocked: no Friday afternoon deployments"; exit 1
fi

if [ "$HOUR" -lt 9 ] || [ "$HOUR" -ge 16 ]; then
  echo "Deploy blocked: outside window (09:00-16:00 UTC)"; exit 1
fi

echo "Deploy window open, proceeding..."

The balance is between safety and velocity. Too many restrictions and your team stops deploying, which actually makes deployments riskier because each one contains more changes.


Closing notes

Release engineering is about making deployments boring. When you have canary deployments that validate against your SLOs, blue-green strategies with instant rollback, feature flags for decoupling deployment from release, and GitOps pipelines with full audit trails, deployments become routine operations instead of scary events.


Start with one piece, maybe canary deployments with a simple error rate analysis, and build from there. The goal is not zero deployments, it is zero deployment-caused incidents. Ship fast, ship safely, and let automation catch problems before your users do.


Hope you found this useful and enjoyed reading it, until next time!


Errata

If you spot any error or have any suggestion, please send me a message so it gets fixed.

Also, you can check the source code and changes in the sources here



$ Comments

Online: 0

Please sign in to be able to write comments.

2026-03-21 | Gabriel Garrido