DevOps from Zero to Hero: Observability in Kubernetes

2026-06-02 | Gabriel Garrido | 23 min read

On this page

Introduction
The three pillars of observability
Logs: structured logging
Metrics: counting what matters
Installing Prometheus and Grafana on EKS
PromQL basics: querying your metrics
Building a Grafana dashboard
Traces: following a request across services
The observability workflow in practice
Alerting basics
Connecting the dots: logs, metrics, and traces together
What to observe: a starter checklist
Advanced topics
Closing notes
Errata

Support this blog

If you find this content useful, consider supporting the blog.

Introduction#

Welcome to article fifteen of the DevOps from Zero to Hero series. In the previous articles we deployed our TypeScript API to Kubernetes and packaged it with Helm. Everything is running, the pods are green, and life is good. But then someone asks: "Is the API actually healthy? How do we know if response times are getting worse? What happened at 3am when users started complaining?"

Without observability, you are flying blind. You deployed your app, but you have no idea what is happening inside it. Observability gives you the ability to understand the internal state of your system by examining the data it produces. It is the difference between "something is broken" and "the /orders endpoint is returning 500 errors because the database connection pool is exhausted."

In this article we will cover the three pillars of observability (logs, metrics, and traces), set up Prometheus and Grafana on EKS using Helm, build a basic dashboard, instrument our TypeScript API with structured logging and a metrics endpoint, configure a simple alert, and walk through the observability workflow you will use during real incidents. This is a beginner-friendly introduction. If you want to go deeper into topics like SLO-based alerting, Loki for log aggregation, or advanced OpenTelemetry patterns, check out the SRE Observability Deep Dive from the SRE series.

Let's get into it.

The three pillars of observability#

Observability is built on three types of telemetry data. Each one answers a different question, and you need all three to debug production issues effectively.

Logs: Discrete events that tell you what happened. "Request abc123 failed with a 500 error at 14:32:05." Logs give you the richest context because they can include arbitrary details like request bodies, stack traces, and user IDs.

Metrics: Numerical measurements over time. "The API handled 150 requests per second with a p99 latency of 200ms." Metrics are cheap to store, fast to query, and perfect for dashboards and alerts.

Traces: The path a request takes through your system. "This request hit the API gateway, then the orders service, then the database, and the slow part was the database query." Traces are essential when you have multiple services talking to each other.

Think of it this way: metrics tell you something is wrong, traces tell you where in the system it is wrong, and logs tell you why it is wrong. Here is the flow:

# The observability workflow during an incident:
#
# 1. ALERT (from metrics): "Error rate > 5% for the last 5 minutes"
#    -> You know SOMETHING is wrong
#
# 2. DASHBOARD (metrics): Check Grafana, see /orders endpoint has high error rate
#    -> You know WHAT is wrong
#
# 3. TRACES: Find failing requests, see they all fail at the database call
#    -> You know WHERE it is wrong
#
# 4. LOGS: Check the database service logs: "ERROR: too many connections"
#    -> You know WHY it is wrong

We will cover each pillar in detail, starting with logs because they are the most familiar.

Logs: structured logging#

If you have ever used console.log("something broke") in production, you know the problem. When you have thousands of log lines flowing through your system, finding the relevant one is like searching for a needle in a haystack. Unstructured logs (plain text strings) are hard to search, hard to filter, and hard to aggregate.

Structured logging solves this by writing logs as JSON objects with consistent fields. Instead of:

[2026-06-02 14:32:05] ERROR: Failed to process order 12345 for user [email protected]

You write:

{
  "timestamp": "2026-06-02T14:32:05.123Z",
  "level": "error",
  "message": "Failed to process order",
  "orderId": "12345",
  "userId": "[email protected]",
  "service": "orders-api",
  "traceId": "abc123def456",
  "duration_ms": 1523
}

Now you can search for all errors related to a specific user, a specific order, or a specific trace. You can count how many errors happened per service. You can correlate logs with traces using the traceId field. This is the power of structured logging.

Log levels define the severity of a log entry. Use them consistently:

error: Something failed and needs attention. A request returned a 500, a database query timed out, an external API is unreachable.

warn: Something unexpected happened but the system handled it. A retry succeeded, a cache miss occurred, a deprecated endpoint was called.

info: Normal operations worth recording. A request was processed successfully, a user logged in, a background job completed.

debug: Detailed information useful during development. Request payloads, SQL queries, internal state. Disable this in production unless you are actively debugging.

Let's add structured logging to our TypeScript API using pino, which is the fastest JSON logger for Node.js:

# Install pino and the pretty-printer for local development
npm install pino pino-http
npm install -D pino-pretty

// src/logger.ts
import pino from "pino";

const logger = pino({
  level: process.env.LOG_LEVEL || "info",
  // In production, output raw JSON. Locally, use pino-pretty for readability.
  transport:
    process.env.NODE_ENV !== "production"
      ? { target: "pino-pretty", options: { colorize: true } }
      : undefined,
  // Add default fields to every log entry
  base: {
    service: "task-api",
    version: process.env.APP_VERSION || "unknown",
  },
});

export default logger;

// src/app.ts
import express from "express";
import pinoHttp from "pino-http";
import logger from "./logger";

const app = express();

// Automatically log every HTTP request with method, URL, status, and duration
app.use(pinoHttp({ logger }));

app.get("/tasks", async (req, res) => {
  try {
    const tasks = await db.query("SELECT * FROM tasks");
    // Info-level log with structured context
    logger.info({ taskCount: tasks.length }, "Tasks retrieved successfully");
    res.json(tasks);
  } catch (error) {
    // Error-level log with the error object and request context
    logger.error(
      { err: error, path: req.path, method: req.method },
      "Failed to retrieve tasks"
    );
    res.status(500).json({ error: "Internal server error" });
  }
});

With pino-http, every request automatically gets a log entry like this:

{
  "level": 30,
  "time": 1748870525123,
  "service": "task-api",
  "req": { "method": "GET", "url": "/tasks" },
  "res": { "statusCode": 200 },
  "responseTime": 45,
  "msg": "request completed"
}

This is exactly the kind of data you can search and filter in a log aggregation system like Loki, Elasticsearch, or CloudWatch Logs. You can query things like "show me all requests where responseTime > 1000" or "show me all error-level logs from the task-api service in the last hour."

Metrics: counting what matters#

While logs tell you about individual events, metrics tell you about the overall behavior of your system over time. Metrics are numerical measurements collected at regular intervals.

There are three core metric types you need to know:

Counter: A value that only goes up. Examples: total number of HTTP requests, total number of errors, total bytes transferred. You usually care about the rate of change (requests per second) rather than the raw value.

Gauge: A value that can go up and down. Examples: current CPU usage, memory usage, number of active connections, queue depth. Gauges represent the current state of something.

Histogram: Measures the distribution of values. Examples: request duration, response size. Histograms let you answer questions like "what is the 99th percentile latency?" which is far more useful than the average.

Prometheus is the standard metrics system in the Kubernetes ecosystem. It works with a pull model: instead of your application pushing metrics to a server, Prometheus scrapes your application's metrics endpoint at regular intervals (usually every 15 or 30 seconds).

Here is how the flow works:

Your App (/metrics endpoint)
  |
  v
Prometheus (scrapes every 15s, stores time-series data)
  |
  v
Grafana (queries Prometheus, renders dashboards)
  |
  v
Alertmanager (receives alerts from Prometheus, sends notifications)

Let's add a /metrics endpoint to our TypeScript API using the prom-client library:

npm install prom-client

// src/metrics.ts
import client from "prom-client";

// Create a registry to hold all metrics
const register = new client.Registry();

// Add default Node.js metrics (CPU, memory, event loop lag, etc.)
client.collectDefaultMetrics({ register });

// Custom counter: total HTTP requests, labeled by method, path, and status
export const httpRequestsTotal = new client.Counter({
  name: "http_requests_total",
  help: "Total number of HTTP requests",
  labelNames: ["method", "path", "status"] as const,
  registers: [register],
});

// Custom histogram: request duration in seconds
export const httpRequestDuration = new client.Histogram({
  name: "http_request_duration_seconds",
  help: "Duration of HTTP requests in seconds",
  labelNames: ["method", "path", "status"] as const,
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  registers: [register],
});

// Custom gauge: number of active database connections
export const dbActiveConnections = new client.Gauge({
  name: "db_active_connections",
  help: "Number of active database connections",
  registers: [register],
});

export { register };

// src/middleware/metrics.ts
import { Request, Response, NextFunction } from "express";
import { httpRequestsTotal, httpRequestDuration } from "../metrics";

export function metricsMiddleware(
  req: Request,
  res: Response,
  next: NextFunction
) {
  const start = Date.now();

  res.on("finish", () => {
    const duration = (Date.now() - start) / 1000;
    const path = req.route?.path || req.path;
    const labels = {
      method: req.method,
      path: path,
      status: res.statusCode.toString(),
    };

    httpRequestsTotal.inc(labels);
    httpRequestDuration.observe(labels, duration);
  });

  next();
}

// src/app.ts - add the metrics endpoint and middleware
import { register } from "./metrics";
import { metricsMiddleware } from "./middleware/metrics";

// Apply metrics middleware to all routes
app.use(metricsMiddleware);

// Expose metrics for Prometheus to scrape
app.get("/metrics", async (_req, res) => {
  res.set("Content-Type", register.contentType);
  res.end(await register.metrics());
});

When Prometheus scrapes /metrics, it gets output like this:

# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET",path="/tasks",status="200"} 1523
http_requests_total{method="POST",path="/tasks",status="201"} 47
http_requests_total{method="GET",path="/tasks",status="500"} 3

# HELP http_request_duration_seconds Duration of HTTP requests in seconds
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{method="GET",path="/tasks",status="200",le="0.05"} 1200
http_request_duration_seconds_bucket{method="GET",path="/tasks",status="200",le="0.1"} 1450
http_request_duration_seconds_bucket{method="GET",path="/tasks",status="200",le="0.25"} 1510
http_request_duration_seconds_bucket{method="GET",path="/tasks",status="200",le="+Inf"} 1523

For Prometheus to discover this endpoint in Kubernetes, you add annotations to your pod or service:

# In your Helm chart's deployment template or values
metadata:
  annotations:
    prometheus.io/scrape: "true"
    prometheus.io/port: "3000"
    prometheus.io/path: "/metrics"

Installing Prometheus and Grafana on EKS#

The easiest way to get Prometheus and Grafana running on Kubernetes is the kube-prometheus-stack Helm chart. This single chart installs Prometheus, Grafana, Alertmanager, node-exporter (for host metrics), kube-state-metrics (for Kubernetes object metrics), and a bunch of pre-configured dashboards and alerting rules.

# Add the Prometheus community Helm repository
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

# Create a namespace for monitoring
kubectl create namespace monitoring

# Install the kube-prometheus-stack
helm install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  --set grafana.adminPassword=your-secure-password \
  --set prometheus.prometheusSpec.retention=7d \
  --set prometheus.prometheusSpec.storageSpec.volumeClaimTemplate.spec.resources.requests.storage=20Gi

That is it. A single Helm command and you have a full monitoring stack. Let's verify everything is running:

# Check all pods in the monitoring namespace
kubectl get pods -n monitoring

# Expected output:
# NAME                                                     READY   STATUS    RESTARTS   AGE
# alertmanager-monitoring-kube-prometheus-alertmanager-0    2/2     Running   0          2m
# monitoring-grafana-6c4f8d5b7-x2k4f                      3/3     Running   0          2m
# monitoring-kube-prometheus-operator-7d9f5b8c9-abc12      1/1     Running   0          2m
# monitoring-kube-state-metrics-5f8d9b7c6-def34            1/1     Running   0          2m
# monitoring-prometheus-node-exporter-ghij5                1/1     Running   0          2m
# prometheus-monitoring-kube-prometheus-prometheus-0        2/2     Running   0          2m

To access Grafana locally, use port-forwarding:

# Forward Grafana to localhost:3001
kubectl port-forward svc/monitoring-grafana 3001:80 -n monitoring

# Open http://localhost:3001 in your browser
# Login: admin / your-secure-password

For production, you would expose Grafana through an Ingress with TLS. Here is a quick values file for a production-like setup:

# monitoring-values.yaml
grafana:
  adminPassword: "${GRAFANA_ADMIN_PASSWORD}"
  ingress:
    enabled: true
    ingressClassName: alb
    hosts:
      - grafana.yourdomain.com
    tls:
      - secretName: grafana-tls
        hosts:
          - grafana.yourdomain.com

prometheus:
  prometheusSpec:
    retention: 15d
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          resources:
            requests:
              storage: 50Gi
    # Tell Prometheus to scrape pods with the standard annotations
    podMonitorSelectorNilUsesHelmValues: false
    serviceMonitorSelectorNilUsesHelmValues: false

alertmanager:
  alertmanagerSpec:
    storage:
      volumeClaimTemplate:
        spec:
          storageClassName: gp3
          resources:
            requests:
              storage: 5Gi

# Install with the production values
helm upgrade --install monitoring prometheus-community/kube-prometheus-stack \
  --namespace monitoring \
  -f monitoring-values.yaml

PromQL basics: querying your metrics#

PromQL is the query language for Prometheus. It looks strange at first, but you only need to learn a handful of patterns to cover most use cases.

Instant vector - select the current value of a metric:

# All HTTP requests from the task-api
http_requests_total{service="task-api"}

# Only 500 errors
http_requests_total{service="task-api", status="500"}

Rate - the most important function. Calculates the per-second rate of increase for counters over a time window:

# Requests per second over the last 5 minutes
rate(http_requests_total[5m])

# Error rate (500s only) per second
rate(http_requests_total{status="500"}[5m])

Aggregation - combine multiple time series:

# Total requests per second across all instances
sum(rate(http_requests_total[5m]))

# Requests per second grouped by status code
sum by (status) (rate(http_requests_total[5m]))

# Error percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100

Histogram quantiles - calculate percentiles:

# p99 latency (99th percentile)
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# p50 latency (median)
histogram_quantile(0.50, rate(http_request_duration_seconds_bucket[5m]))

# p99 latency per endpoint
histogram_quantile(0.99, sum by (path, le) (rate(http_request_duration_seconds_bucket[5m])))

Here are some queries you will use all the time:

# CPU usage by pod (percentage)
sum by (pod) (rate(container_cpu_usage_seconds_total{namespace="task-api"}[5m])) * 100

# Memory usage by pod (megabytes)
sum by (pod) (container_memory_working_set_bytes{namespace="task-api"}) / 1024 / 1024

# Pod restarts (a restart usually means something crashed)
increase(kube_pod_container_status_restarts_total{namespace="task-api"}[1h])

# Available replicas vs desired replicas (are all pods healthy?)
kube_deployment_status_replicas_available{namespace="task-api"}
/
kube_deployment_spec_replicas{namespace="task-api"}

Building a Grafana dashboard#

Grafana comes with hundreds of pre-built dashboards you can import. For Kubernetes, the kube-prometheus-stack already includes dashboards for node metrics, pod metrics, and cluster overview. But you will also want a custom dashboard for your application.

Importing a community dashboard:

Open Grafana and go to Dashboards > Import.
Enter a dashboard ID from grafana.com/dashboards. For example, dashboard 315 is a popular Kubernetes cluster monitoring dashboard.
Select your Prometheus data source and click Import.

That gives you a ready-made dashboard in seconds. Now let's build a custom one for our API.

Creating a custom dashboard:

Go to Dashboards > New Dashboard > Add visualization.
Select your Prometheus data source.
For the first panel, enter this PromQL query:

sum by (status) (rate(http_requests_total{service="task-api"}[5m]))

Set the panel title to "Request Rate by Status Code".
Choose the "Time series" visualization type.
Under Legend, set it to {{status}} so each line is labeled with its status code.

Add more panels for the metrics that matter most:

Request rate: sum(rate(http_requests_total{service="task-api"}[5m])) as a stat panel showing total RPS.

Error rate percentage: The error percentage query from earlier, displayed as a gauge with thresholds (green < 1%, yellow < 5%, red >= 5%).

p99 latency: histogram_quantile(0.99, sum by (le) (rate(http_request_duration_seconds_bucket{service="task-api"}[5m]))) as a time series chart.

Active database connections: db_active_connections{service="task-api"} as a gauge.

Pod CPU and memory: The container queries from the previous section.

A good dashboard follows the USE method (Utilization, Saturation, Errors) or the RED method (Rate, Errors, Duration). For an API, the RED method is the most practical:

RED Dashboard Layout:
+---------------------+-------------------+--------------------+
| Request Rate (RPS)  | Error Rate (%)    | p99 Latency (ms)   |
| [stat panel]        | [gauge panel]     | [stat panel]       |
+---------------------+-------------------+--------------------+
| Request Rate by Status Code (time series)                    |
+--------------------------------------------------------------+
| Latency Distribution: p50, p90, p99 (time series)            |
+--------------------------------------------------------------+
| Error Log Stream (if using Loki)                             |
+--------------------------------------------------------------+

Once you are happy with the dashboard, save it and note the JSON model. You can export it and store it in your Git repository so it can be provisioned automatically. The kube-prometheus-stack supports dashboard provisioning through ConfigMaps:

# grafana-dashboard-configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: task-api-dashboard
  namespace: monitoring
  labels:
    grafana_dashboard: "1"
data:
  task-api.json: |
    {
      "dashboard": {
        "title": "Task API",
        "panels": [ ... ]
      }
    }

Traces: following a request across services#

Logs tell you what happened in a single service. Traces tell you what happened across multiple services for a single request. Every trace is made up of spans, and each span represents a unit of work: an HTTP handler, a database query, an external API call.

Here is what a trace looks like:

Trace ID: abc123def456
|
|-- Span: API Gateway (15ms)
|   |-- Span: Authentication middleware (2ms)
|   |-- Span: Orders Service HTTP call (180ms)
|       |-- Span: Database query: SELECT * FROM orders (150ms)  <-- the bottleneck!
|       |-- Span: Cache write (3ms)
|
Total duration: 200ms

Without tracing, you would see that the API Gateway took 200ms but you would have no idea that the bottleneck was a slow database query inside the Orders Service. With tracing, you can see the exact breakdown.

OpenTelemetry (OTel) is the standard for instrumenting applications with traces (and metrics and logs). It provides SDKs for every major language and a vendor-neutral way to export telemetry data. Let's add basic tracing to our TypeScript API:

# Install OpenTelemetry packages
npm install @opentelemetry/api \
  @opentelemetry/sdk-node \
  @opentelemetry/auto-instrumentations-node \
  @opentelemetry/exporter-trace-otlp-http

// src/tracing.ts - must be imported before anything else
import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-http";
import { Resource } from "@opentelemetry/resources";
import {
  ATTR_SERVICE_NAME,
  ATTR_SERVICE_VERSION,
} from "@opentelemetry/semantic-conventions";

const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: "task-api",
    [ATTR_SERVICE_VERSION]: process.env.APP_VERSION || "0.1.0",
  }),
  traceExporter: new OTLPTraceExporter({
    // Send traces to an OTel Collector or Jaeger
    url:
      process.env.OTEL_EXPORTER_OTLP_ENDPOINT ||
      "http://otel-collector:4318/v1/traces",
  }),
  instrumentations: [
    getNodeAutoInstrumentations({
      // Auto-instrument Express, HTTP, and database clients
      "@opentelemetry/instrumentation-express": { enabled: true },
      "@opentelemetry/instrumentation-http": { enabled: true },
      "@opentelemetry/instrumentation-pg": { enabled: true },
    }),
  ],
});

sdk.start();
console.log("OpenTelemetry tracing initialized");

// Graceful shutdown
process.on("SIGTERM", () => {
  sdk.shutdown().then(() => process.exit(0));
});

// src/index.ts - import tracing FIRST
import "./tracing";
import app from "./app";

const port = process.env.PORT || 3000;
app.listen(port, () => {
  console.log(`Server running on port ${port}`);
});

With auto-instrumentation, every incoming HTTP request, outgoing HTTP call, and database query automatically gets a span. The SDK propagates the trace context through HTTP headers (traceparent), so when service A calls service B, both services' spans are linked under the same trace ID.

For custom spans when you need more detail:

// src/services/orders.ts
import { trace } from "@opentelemetry/api";

const tracer = trace.getTracer("task-api");

export async function processOrder(orderId: string) {
  // Create a custom span for this operation
  return tracer.startActiveSpan("processOrder", async (span) => {
    try {
      span.setAttribute("order.id", orderId);

      // Each sub-operation can have its own span
      const order = await tracer.startActiveSpan(
        "fetchOrder",
        async (fetchSpan) => {
          const result = await db.query("SELECT * FROM orders WHERE id = $1", [
            orderId,
          ]);
          fetchSpan.end();
          return result;
        }
      );

      await tracer.startActiveSpan(
        "validatePayment",
        async (paymentSpan) => {
          await paymentService.validate(order.paymentId);
          paymentSpan.end();
        }
      );

      span.setAttribute("order.status", "processed");
      span.end();
      return order;
    } catch (error) {
      span.recordException(error as Error);
      span.setStatus({ code: 2, message: (error as Error).message });
      span.end();
      throw error;
    }
  });
}

To view traces, you need a trace backend. For development, Jaeger is the easiest to set up:

# Run Jaeger locally with Docker
docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4318:4318 \
  jaegertracing/all-in-one:latest

# Open http://localhost:16686 to view traces

In a Kubernetes cluster, you can deploy Jaeger alongside the OpenTelemetry Collector using the Jaeger Operator or a Helm chart. The kube-prometheus-stack does not include tracing out of the box, but Grafana can connect to Jaeger as a data source and display traces alongside your metrics dashboards.

The observability workflow in practice#

Let's walk through a realistic scenario to see how all three pillars work together.

Scenario: Users report that creating tasks is slow.

Step 1: Check the dashboard. Open your Grafana RED dashboard. You notice that the p99 latency for POST /tasks has jumped from 100ms to 3 seconds in the last 30 minutes. The error rate is still low, so requests are succeeding but they are slow.

Step 2: Narrow down with metrics. Add a PromQL query to check if the problem is specific to one pod or all pods:

histogram_quantile(0.99,
  sum by (pod, le) (
    rate(http_request_duration_seconds_bucket{path="/tasks", method="POST"}[5m])
  )
)

All pods show the same slow latency, so the issue is not a single unhealthy pod.

Step 3: Find a slow trace. Go to Jaeger (or Grafana Tempo) and search for traces where the operation is POST /tasks and the duration is greater than 2 seconds. You find several traces and open one. The trace shows:

POST /tasks (3.1s)
  |-- Express middleware (2ms)
  |-- insertTask (3.05s)
      |-- pg.query: INSERT INTO tasks... (3.04s)  <-- the problem

The database INSERT is taking 3 seconds. That is abnormal.

Step 4: Check the logs. Search your logs for database-related errors in the last 30 minutes:

{
  "level": "warn",
  "message": "Slow query detected",
  "query": "INSERT INTO tasks...",
  "duration_ms": 3041,
  "service": "task-api",
  "connection_pool_active": 19,
  "connection_pool_max": 20
}

The connection pool is almost full. You check further and find that a background job that runs every 30 minutes is holding connections open longer than expected. You fix the background job, and latency returns to normal.

This is the observability workflow: alert or symptom, dashboard, trace, logs, root cause. Each pillar narrowed the problem until you found the answer.

Alerting basics#

Dashboards are useful for investigation, but you need alerts to know when something is wrong before your users tell you. Prometheus supports alerting rules that evaluate PromQL expressions and fire alerts when conditions are met.

Here is a PrometheusRule resource for a simple alert:

# alert-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: task-api-alerts
  namespace: monitoring
  labels:
    release: monitoring  # Must match the kube-prometheus-stack release name
spec:
  groups:
    - name: task-api
      rules:
        # Alert when error rate exceeds 5% for 5 minutes
        - alert: HighErrorRate
          expr: |
            sum(rate(http_requests_total{service="task-api", status=~"5.."}[5m]))
            /
            sum(rate(http_requests_total{service="task-api"}[5m]))
            > 0.05
          for: 5m
          labels:
            severity: warning
          annotations:
            summary: "High error rate on task-api"
            description: >
              The task-api error rate is {{ $value | humanizePercentage }}
              over the last 5 minutes.

        # Alert when p99 latency exceeds 1 second for 10 minutes
        - alert: HighLatency
          expr: |
            histogram_quantile(0.99,
              sum by (le) (rate(http_request_duration_seconds_bucket{service="task-api"}[5m]))
            ) > 1
          for: 10m
          labels:
            severity: warning
          annotations:
            summary: "High p99 latency on task-api"
            description: >
              The task-api p99 latency is {{ $value | humanizeDuration }}
              over the last 5 minutes.

        # Alert when a pod has restarted more than 3 times in an hour
        - alert: PodCrashLooping
          expr: |
            increase(kube_pod_container_status_restarts_total{
              namespace="task-api"
            }[1h]) > 3
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Pod crash-looping in task-api namespace"
            description: >
              Pod {{ $labels.pod }} has restarted {{ $value }} times
              in the last hour.

Apply the rule and Prometheus picks it up automatically:

kubectl apply -f alert-rules.yaml

Alertmanager receives alerts from Prometheus and routes them to the right destination: Slack, PagerDuty, email, or a webhook. The kube-prometheus-stack includes Alertmanager. Here is a basic configuration that sends alerts to a Slack channel:

# In your monitoring-values.yaml, add Alertmanager configuration
alertmanager:
  config:
    global:
      slack_api_url: "https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK"
    route:
      receiver: "slack-notifications"
      group_by: ["alertname", "namespace"]
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 4h
    receivers:
      - name: "slack-notifications"
        slack_configs:
          - channel: "#alerts"
            send_resolved: true
            title: '{{ .GroupLabels.alertname }}'
            text: >-
              {{ range .Alerts }}
              *{{ .Annotations.summary }}*
              {{ .Annotations.description }}
              {{ end }}

The key settings to understand:

group_by: Groups related alerts together so you get one notification instead of fifty when something goes wrong.

group_wait: How long to wait before sending the first notification after a group is created. Gives time for related alerts to arrive and get grouped.

repeat_interval: How often to re-send an unresolved alert. You do not want to get paged every 30 seconds for the same issue.

send_resolved: Sends a notification when the alert clears. Nice to know when the problem is fixed without checking manually.

Connecting the dots: logs, metrics, and traces together#

The real power of observability comes when you connect all three pillars. The key is the trace ID. When a request enters your system, it gets a unique trace ID. If you include that trace ID in your logs and your metrics labels, you can jump from a log entry to the corresponding trace, or from an alert to the exact logs that explain what happened.

Here is how to add the trace ID to your structured logs:

// src/middleware/traceContext.ts
import { trace, context } from "@opentelemetry/api";
import { Request, Response, NextFunction } from "express";
import logger from "../logger";

export function traceContextMiddleware(
  req: Request,
  _res: Response,
  next: NextFunction
) {
  const span = trace.getSpan(context.active());
  if (span) {
    const spanContext = span.spanContext();
    // Attach trace ID to the request logger so all logs in this request
    // include the trace ID automatically
    req.log = logger.child({
      traceId: spanContext.traceId,
      spanId: spanContext.spanId,
    });
  }
  next();
}

Now every log entry from a request includes the trace ID:

{
  "level": "error",
  "message": "Failed to process order",
  "traceId": "abc123def456789",
  "spanId": "def456789abc123",
  "orderId": "12345",
  "service": "task-api"
}

In Grafana, you can configure a data link from your log panel (Loki) to your trace panel (Jaeger or Tempo). Click on a log entry and jump directly to the trace. This is the single most useful feature for debugging production issues.

What to observe: a starter checklist#

When you are just getting started, it is easy to get overwhelmed by the number of things you could measure. Here is a practical starting point:

For every API endpoint: Request rate, error rate, and latency (the RED method). These three metrics cover most problems.

For your infrastructure: CPU usage, memory usage, disk usage, and network I/O per pod. The kube-prometheus-stack gives you these for free.

For your database: Active connections, query duration, and connection pool utilization. These are the most common source of application performance issues.

For your application health: Pod restarts, deployment replica status, and container readiness. These tell you if Kubernetes is struggling to keep your app running.

Start with these and add more metrics as you encounter specific problems. Do not try to measure everything on day one.

Advanced topics#

We covered the essentials in this article, but observability goes much deeper. Here are topics worth exploring once you are comfortable with the basics:

SLO-based alerting: Instead of alerting on raw thresholds ("latency > 1s"), define Service Level Objectives and alert on error budget burn rate. This avoids noisy alerts and focuses on what matters to users.

Log aggregation with Loki: Loki is the logging equivalent of Prometheus. It indexes log metadata (labels) and stores the log content compressed, making it much cheaper than Elasticsearch for Kubernetes logging.

Distributed tracing at scale with Tempo: Grafana Tempo is a trace backend designed to work seamlessly with Grafana, Loki, and Prometheus. It supports trace-to-log and trace-to-metric correlation out of the box.

Trace-based testing: Use traces to verify that your services communicate correctly in integration tests. Tools like Tracetest let you write assertions against trace data.

Custom metrics for business logic: Track things like orders processed, revenue per minute, or user signups. These business metrics are often more valuable than technical metrics.

For a comprehensive deep dive into all of these topics, check out the SRE Observability Deep Dive. It covers OpenTelemetry instrumentation patterns, Loki setup, Grafana Tempo, SLO-based alerting with Pyrra, and production-grade observability architectures.

Closing notes#

Observability is not optional. Once your application is running in production, you need to know what it is doing, how it is performing, and when something goes wrong. The three pillars (logs, metrics, and traces) give you complementary views into your system's behavior.

In this article we covered what observability is and why it matters, the three pillars and when to use each one, structured logging with pino, Prometheus metrics with prom-client, installing Prometheus and Grafana with the kube-prometheus-stack, basic PromQL queries for common scenarios, building Grafana dashboards, distributed tracing with OpenTelemetry, alerting with PrometheusRule and Alertmanager, and the observability workflow for debugging production issues.

The key takeaway is that observability is a workflow, not a tool. You do not just install Prometheus and call it done. You instrument your application, build dashboards that answer real questions, set up alerts that notify you before your users do, and practice the alert-dashboard-trace-log flow until it becomes second nature.

In the next article we will cover CI/CD pipelines for Kubernetes, bringing together everything we have built so far into an automated deployment workflow.

Hope you found this useful and enjoyed reading it, until next time!

Errata#

If you spot any error or have any suggestion, please send me a message so it gets fixed.

Also, you can check the source code and changes in the sources here

$ Comments

Online: 0

Please sign in to be able to write comments.

2026-06-02 | Gabriel Garrido

$ Related Posts

> DevOps from Zero to Hero: Security Hardening (2026-06-11)

> DevOps from Zero to Hero: Database Migrations and Zero-Downtime Deployments (2026-06-08)

> DevOps from Zero to Hero: CI/CD, The Complete Pipeline (2026-06-05)