SRE: SLIs, SLOs, and Automations That Actually Help
Introduction
In this article we will explore the practical side of Site Reliability Engineering (SRE), specifically how to define Service Level Indicators (SLIs) and Service Level Objectives (SLOs) as code, deploy them using ArgoCD, and leverage MCP servers and automations to make the whole process less painful.
If you have been doing operations or platform engineering for a while, you probably already know that monitoring alone is not enough. Having a dashboard full of green lights does not mean your users are happy. SLIs and SLOs give you a framework to measure what actually matters and make informed decisions about reliability vs. feature velocity.
Let’s get into it.
What is SRE anyway?
Site Reliability Engineering is a discipline that applies software engineering practices to operations problems. Google popularized the concept, but the core idea is simple: treat your infrastructure and operational processes as code, measure what matters, and use error budgets to balance reliability with the speed of shipping new features.
The key components are:
- SLIs (Service Level Indicators): Metrics that measure the quality of your service from the user’s perspective
- SLOs (Service Level Objectives): Targets you set for your SLIs (e.g., “99.9% of requests should succeed”)
- Error Budgets: The acceptable amount of unreliability (100% - SLO target)
- SLAs (Service Level Agreements): Business contracts based on SLOs (we won’t focus on these here)
Understanding SLIs
An SLI is a carefully defined quantitative measure of some aspect of the level of service provided. The most common SLIs are:
- Availability: The proportion of requests that succeed
- Latency: The proportion of requests that are faster than a threshold
- Quality: The proportion of responses that are not degraded
The important thing here is the “proportion” part. SLIs are expressed as ratios:
SLI = good events / total events
For example, for an HTTP service:
# Availability SLI
availability = (total_requests - 5xx_errors) / total_requests
# Latency SLI
latency = requests_faster_than_300ms / total_requests
This is much more useful than raw metrics because it directly reflects user experience. A spike in errors that lasts 5 seconds is very different from one that lasts 5 minutes, and the ratio captures that difference over a time window.
Understanding SLOs
An SLO is the target value for an SLI over a specific time window. For example:
- “99.9% of HTTP requests should return a non-error response over a 30-day rolling window”
- “99% of requests should complete in less than 300ms over a 30-day rolling window”
The SLO gives you an error budget. If your SLO is 99.9%, your error budget is 0.1%. Over 30 days, that means you can afford roughly 43 minutes of total downtime. This is incredibly powerful because it turns reliability into a measurable resource you can spend. Want to do a risky deployment? Check your error budget first.
Putting SLIs into code with Prometheus
Now let’s get practical. The most common way to implement SLIs is with Prometheus metrics. If you are running workloads in Kubernetes, you probably already have Prometheus or a compatible system collecting metrics.
For a typical web service, you want to expose a histogram that tracks request duration and status:
# If your application uses Prometheus client, expose something like:
# histogram: http_request_duration_seconds (with labels: method, path, status)
# counter: http_requests_total (with labels: method, path, status)
# For our Phoenix/Elixir app, we rely on phoenix_telemetry and peep to expose these.
# But the concept applies to any language.
With those metrics in Prometheus, you can define recording rules that calculate the SLI ratios. Here is an example of Prometheus recording rules for an HTTP availability SLI:
# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: sli-availability
namespace: monitoring
spec:
groups:
- name: sli.availability
interval: 30s
rules:
# Total requests rate over 5m window
- record: sli:http_requests:rate5m
expr: sum(rate(http_requests_total[5m]))
# Error requests rate over 5m window (5xx responses)
- record: sli:http_errors:rate5m
expr: sum(rate(http_requests_total{status=~"5.."}[5m]))
# Availability SLI (ratio of successful requests)
- record: sli:availability:ratio_rate5m
expr: |
1 - (sli:http_errors:rate5m / sli:http_requests:rate5m)
And for a latency SLI:
# prometheus-rules-latency.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: sli-latency
namespace: monitoring
spec:
groups:
- name: sli.latency
interval: 30s
rules:
# Requests faster than 300ms
- record: sli:http_request_duration:rate5m
expr: sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
# All requests
- record: sli:http_request_duration_total:rate5m
expr: sum(rate(http_request_duration_seconds_count[5m]))
# Latency SLI
- record: sli:latency:ratio_rate5m
expr: |
sli:http_request_duration:rate5m / sli:http_request_duration_total:rate5m
These recording rules pre-compute the SLI ratios so you can use them in alerting and dashboards without running expensive queries every time.
SLOs as code with Sloth
Writing Prometheus recording rules and alert rules by hand for every SLO gets tedious fast. That’s where Sloth comes in. Sloth is a tool that generates all the Prometheus rules you need from a simple SLO definition.
Here is an SLO definition for our service:
# slos/tr-web.yaml
apiVersion: sloth.slok.dev/v1
kind: PrometheusServiceLevel
metadata:
name: tr-web
namespace: default
spec:
service: "tr-web"
labels:
team: "platform"
slos:
- name: "requests-availability"
objective: 99.9
description: "99.9% of HTTP requests should succeed"
sli:
events:
error_query: sum(rate(http_requests_total{status=~"5..",service="tr-web"}[{{.window}}]))
total_query: sum(rate(http_requests_total{service="tr-web"}[{{.window}}]))
alerting:
name: TrWebHighErrorRate
labels:
severity: critical
team: platform
annotations:
summary: "High error rate on tr-web"
page_alert:
labels:
severity: critical
ticket_alert:
labels:
severity: warning
- name: "requests-latency"
objective: 99.0
description: "99% of requests should be faster than 300ms"
sli:
events:
error_query: |
sum(rate(http_request_duration_seconds_count{service="tr-web"}[{{.window}}]))
-
sum(rate(http_request_duration_seconds_bucket{le="0.3",service="tr-web"}[{{.window}}]))
total_query: sum(rate(http_request_duration_seconds_count{service="tr-web"}[{{.window}}]))
alerting:
name: TrWebHighLatency
labels:
severity: warning
team: platform
annotations:
summary: "High latency on tr-web"
page_alert:
labels:
severity: critical
ticket_alert:
labels:
severity: warning
Then you generate the Prometheus rules:
sloth generate -i slos/tr-web.yaml -o prometheus-rules/tr-web-slo.yaml
Sloth generates multi-window, multi-burn-rate alerts following the Google SRE book recommendations. You get fast-burn alerts (something is very wrong right now) and slow-burn alerts (you are consuming error budget faster than expected). This is a massive improvement over manually crafting alert thresholds.
Deploying SLOs with ArgoCD
Now that we have our SLO definitions and generated Prometheus rules as YAML files, we can deploy them the GitOps way using ArgoCD. If you read my previous article about GitOps, this will feel familiar.
The idea is simple: store your SLO definitions and generated rules in a Git repository, and let ArgoCD sync them to your cluster.
Here is the repository structure:
slo-configs/
├── slos/
│ ├── tr-web.yaml # Sloth SLO definitions
│ └── api-gateway.yaml
├── generated/
│ ├── tr-web-slo.yaml # Generated PrometheusRule resources
│ └── api-gateway-slo.yaml
├── dashboards/
│ ├── tr-web-slo.json # Grafana dashboard JSON
│ └── api-gateway-slo.json
└── argocd/
└── application.yaml # ArgoCD Application manifest
The ArgoCD Application manifest:
# argocd/application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: slo-configs
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/kainlite/slo-configs
targetRevision: HEAD
path: generated
destination:
server: https://kubernetes.default.svc
namespace: monitoring
syncPolicy:
automated:
prune: true
selfHeal: true
syncOptions:
- CreateNamespace=true
With this setup, every time you update an SLO definition, regenerate the rules, and push to Git, ArgoCD automatically applies the changes to your cluster. No manual kubectl commands, no forgetting to apply that one file you changed last week.
You can also set up a CI step to automatically regenerate the Prometheus rules when the SLO definitions change:
# .github/workflows/generate-slos.yaml
name: Generate SLO Rules
on:
push:
paths:
- 'slos/**'
jobs:
generate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install Sloth
run: |
curl -L https://github.com/slok/sloth/releases/latest/download/sloth-linux-amd64 -o sloth
chmod +x sloth
- name: Generate rules
run: |
for slo in slos/*.yaml; do
name=$(basename "$slo" .yaml)
./sloth generate -i "$slo" -o "generated/${name}-slo.yaml"
done
- name: Commit and push
run: |
git config user.name "github-actions"
git config user.email "[email protected]"
git add generated/
git diff --staged --quiet || git commit -m "chore: regenerate SLO rules"
git push
Now you have a fully automated pipeline: edit an SLO definition, push, CI generates the rules, ArgoCD deploys them. Beautiful.
MCP servers for SRE automation
This is where things get really interesting. Model Context Protocol (MCP) servers allow you to give AI assistants like Claude access to your infrastructure tools. Imagine being able to ask “what’s my current error budget for tr-web?” and getting an actual answer from your live Prometheus data.
An MCP server is essentially an API that exposes tools an AI can call. You can build one that wraps your Prometheus and Kubernetes APIs:
// mcp-sre-server/src/main.rs
// A simplified example of an MCP server for SRE queries
use mcp_server::{Server, Tool, ToolResult};
#[derive(Tool)]
#[tool(name = "query_error_budget", description = "Query remaining error budget for a service")]
struct QueryErrorBudget {
service: String,
slo_name: String,
}
impl QueryErrorBudget {
async fn execute(&self) -> ToolResult {
let query = format!(
r#"1 - (
sli:availability:ratio_rate30d{{service="{}"}}
) / (1 - {}.0/100)"#,
self.service, self.objective
);
let result = prometheus_query(&query).await?;
ToolResult::text(format!(
"Error budget for {}/{}: {:.2}% remaining",
self.service, self.slo_name, result * 100.0
))
}
}
#[derive(Tool)]
#[tool(name = "list_slo_violations", description = "List SLOs that are currently burning too fast")]
struct ListSloViolations;
impl ListSloViolations {
async fn execute(&self) -> ToolResult {
let query = r#"ALERTS{alertname=~".*SLO.*", alertstate="firing"}"#;
let alerts = prometheus_query(query).await?;
ToolResult::text(format!("Active SLO violations:\n{}", alerts))
}
}
#[derive(Tool)]
#[tool(name = "get_deployment_risk", description = "Assess deployment risk based on current error budget")]
struct GetDeploymentRisk {
service: String,
}
impl GetDeploymentRisk {
async fn execute(&self) -> ToolResult {
let budget = get_error_budget(&self.service).await?;
let recent_deploys = get_recent_deploys(&self.service).await?;
let risk = match budget {
b if b > 0.5 => "LOW - plenty of error budget remaining",
b if b > 0.2 => "MEDIUM - error budget is getting low",
b if b > 0.0 => "HIGH - very little error budget left",
_ => "CRITICAL - error budget exhausted, consider freezing deploys",
};
ToolResult::text(format!(
"Deployment risk for {}: {}\nBudget remaining: {:.1}%\nRecent deploys: {}",
self.service, risk, budget * 100.0, recent_deploys
))
}
}
With this MCP server running, you can configure Claude Code or any MCP-compatible client to connect to it. Then you get natural language access to your SRE data:
- “What’s the error budget for tr-web?” → Queries Prometheus, returns remaining budget
- “Is it safe to deploy right now?” → Checks error budget + recent incidents
- “Which SLOs are at risk this week?” → Lists SLOs with high burn rates
- “Show me the latency trend for the last 24h” → Queries Prometheus and summarizes
You can also build MCP tools that integrate with ArgoCD:
#[derive(Tool)]
#[tool(name = "argocd_sync_status", description = "Check ArgoCD sync status for SLO configs")]
struct ArgoCDSyncStatus;
impl ArgoCDSyncStatus {
async fn execute(&self) -> ToolResult {
let output = Command::new("argocd")
.args(["app", "get", "slo-configs", "-o", "json"])
.output()
.await?;
let app: ArgoApp = serde_json::from_slice(&output.stdout)?;
ToolResult::text(format!(
"SLO configs sync status: {}\nHealth: {}\nLast sync: {}",
app.status.sync.status,
app.status.health.status,
app.status.sync.compared_to.revision
))
}
}
#[derive(Tool)]
#[tool(name = "rollback_deployment", description = "Rollback a service deployment via ArgoCD")]
struct RollbackDeployment {
service: String,
revision: Option<String>,
}
impl RollbackDeployment {
async fn execute(&self) -> ToolResult {
// This would be gated behind confirmation in a real setup
let revision = self.revision.as_deref().unwrap_or("HEAD~1");
let output = Command::new("argocd")
.args(["app", "rollback", &self.service, "--revision", revision])
.output()
.await?;
ToolResult::text(format!("Rollback initiated for {} to {}", self.service, revision))
}
}
The MCP server config in your Claude Code settings would look something like:
{
"mcpServers": {
"sre-tools": {
"command": "mcp-sre-server",
"args": ["--prometheus-url", "http://prometheus:9090", "--argocd-url", "https://argocd.example.com"],
"env": {
"ARGOCD_AUTH_TOKEN": "your-token-here"
}
}
}
}
Automations that tie it all together
The real power comes when you combine SLOs, ArgoCD, and MCP servers into automated workflows. Here are some patterns that work well in practice:
1. Automated deployment gates
Use error budgets as deployment gates. If the error budget is below a threshold, block deployments automatically:
# In your CI pipeline
- name: Check error budget
run: |
BUDGET=$(curl -s "http://prometheus:9090/api/v1/query?query=error_budget_remaining{service='tr-web'}" \
| jq -r '.data.result[0].value[1]')
if (( $(echo "$BUDGET < 0.1" | bc -l) )); then
echo "Error budget below 10%, blocking deployment"
exit 1
fi
2. Automated incident creation
When an SLO is breached, automatically create an issue or incident:
# alertmanager-config.yaml
receivers:
- name: slo-breach
webhook_configs:
- url: http://incident-bot:8080/create
send_resolved: true
route:
routes:
- match:
severity: critical
type: slo_breach
receiver: slo-breach
3. Weekly SLO reports
Automate weekly SLO reporting to keep the team informed:
# A CronJob that queries Prometheus and sends a summary to Slack
apiVersion: batch/v1
kind: CronJob
metadata:
name: slo-weekly-report
namespace: monitoring
spec:
schedule: "0 9 * * 1" # Every Monday at 9am
jobTemplate:
spec:
template:
spec:
containers:
- name: reporter
image: kainlite/slo-reporter:latest
env:
- name: PROMETHEUS_URL
value: "http://prometheus:9090"
- name: SLACK_WEBHOOK
valueFrom:
secretKeyRef:
name: slack-webhook
key: url
restartPolicy: Never
4. Error budget-based feature freeze
This is one of the most powerful SRE patterns. When error budget is exhausted, the team should shift focus from features to reliability work:
- Budget > 50%: Ship features freely
- Budget 20-50%: Be cautious with risky changes
- Budget 5-20%: Focus on reliability improvements
- Budget < 5%: Feature freeze, all hands on reliability
You can automate this by having your MCP server update a status page or Slack channel with the current budget level, so everyone on the team knows where things stand without having to check dashboards.
Putting it all together
Here is a summary of what we built:
- SLIs as Prometheus metrics: Recording rules that calculate availability and latency ratios
- SLOs with Sloth: Declarative SLO definitions that generate multi-window, multi-burn-rate alerts
- GitOps with ArgoCD: SLO configs stored in Git, automatically synced to the cluster
- MCP servers: Natural language interface to query error budgets, check deployment risk, and manage ArgoCD
- Automations: Deployment gates, incident creation, weekly reports, and error budget policies
The beauty of this approach is that each piece is simple on its own, but together they create a system where reliability is measurable, automated, and part of the team’s daily workflow rather than an afterthought.
Closing notes
SRE does not have to be complicated. Start with one SLI for your most important service, set a reasonable SLO, and build from there. The tooling we covered (Prometheus, Sloth, ArgoCD, MCP servers) is all open source and battle-tested.
The key takeaway is this: measure what matters to your users, set targets, and let automation handle the rest. Your future self during the next on-call rotation will thank you.
Hope you found this useful and enjoyed reading it, until next time!
Errata
If you spot any error or have any suggestion, please send me a message so it gets fixed.
Also, you can check the source code and changes in the sources here
-
Comments
Online: 0
Please sign in to be able to write comments.