SRE: SLIs, SLOs, and Automations That Actually Help
We will explore how to define SLIs and SLOs as code, deploy them with ArgoCD, and use MCP servers to automate SRE workflows...
SRE: Incident Management, On-Call, and Postmortems as Code
We will explore how to build an effective incident management workflow, set up on-call rotations that don't burn people out, write runbooks as code, and run blameless postmortems...
SRE: Observability Deep Dive: Traces, Logs, and Metrics
We will explore the three pillars of observability, how to instrument your applications with OpenTelemetry, build useful dashboards in Grafana, and set up log aggregation that actually helps during incidents...
SRE: Chaos Engineering, Breaking Things on Purpose
We will explore chaos engineering in Kubernetes using Litmus and Chaos Mesh, how to plan and run game days, and why breaking things on purpose is the best way to build reliable systems...