Monitoring & Observability

01

Quick Reference

Key signals, essential tools, and monitoring methodologies at a glance.

Key Signals at a Glance — The Four Golden Signals

Latency

P95 / P99

Response time for successful requests. Distinguish between successful and failed request latency.

histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

Traffic

req/s

Demand on your system measured in requests per second, transactions, or sessions.

sum(rate(http_requests_total[5m]))

Errors

% rate

Rate of requests that fail, either explicitly (5xx) or implicitly (wrong content, slow responses).

sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100

Saturation

% util

How full your service is. CPU, memory, disk I/O, network bandwidth utilization.

node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes * 100

Essential Tools Quick Reference

Metrics Prometheus

Open-source metrics collection and alerting toolkit. Pull-based model with powerful PromQL query language.

# Check targets status
curl localhost:9090/api/v1/targets

# Instant query
curl 'localhost:9090/api/v1/query?query=up'

# Range query (last 1h, 15s step)
curl 'localhost:9090/api/v1/query_range?query=rate(http_requests_total[5m])&start=2024-01-01T00:00:00Z&end=2024-01-01T01:00:00Z&step=15s'

Visualization Grafana

Open-source analytics and visualization platform. Dashboards for Prometheus, Loki, Tempo, and 100+ data sources.

# Start Grafana (Docker)
docker run -d -p 3000:3000 \
  --name grafana \
  grafana/grafana-oss

# Provision dashboards via API
curl -X POST -H "Content-Type: application/json" \
  -d @dashboard.json \
  http://admin:admin@localhost:3000/api/dashboards/db

Logs Loki

Log aggregation system inspired by Prometheus. Indexes labels, not content. Cost-efficient at scale.

# LogQL — query logs by labels
{job="api-server"} |= "error"

# Rate of log lines with errors
rate({job="api-server"} |= "error" [5m])

# Extract and aggregate fields
{job="api"} | json | status >= 500
  | line_format "{{.method}} {{.path}}"

Tracing Tempo

Distributed tracing backend. Accepts Jaeger, Zipkin, and OpenTelemetry formats. Minimal indexing, object storage.

# Search traces by service name
curl 'localhost:3200/api/search?tags=service.name%3Dapi-server'

# Get trace by ID
curl 'localhost:3200/api/traces/2f70c82d4a2c7e8a'

# TraceQL query
{ span.http.status_code >= 500 }
  | select(span.http.url)

Standard OpenTelemetry

Vendor-neutral instrumentation standard. APIs, SDKs, and the Collector for metrics, logs, and traces.

# Install OTel Python SDK
pip install opentelemetry-api \
  opentelemetry-sdk \
  opentelemetry-exporter-otlp

# Run the OTel Collector
docker run -p 4317:4317 -p 4318:4318 \
  -v ./otel-config.yaml:/etc/otelcol/config.yaml \
  otel/opentelemetry-collector-contrib

Alerting Alertmanager

Handles deduplication, grouping, routing, silencing, and inhibition of alerts from Prometheus and other sources.

# Check active alerts
curl localhost:9093/api/v2/alerts

# Silence an alert (JSON body)
curl -X POST localhost:9093/api/v2/silences \
  -d '{"matchers":[{"name":"alertname",
  "value":"HighLatency","isRegex":false}],
  "startsAt":"2024-01-01T00:00:00Z",
  "endsAt":"2024-01-01T06:00:00Z",
  "createdBy":"oncall","comment":"deploy"}'

Monitoring Methodology Comparison

Method	Focus	Best For	Key Metrics	Origin
Golden Signals	System health	SRE monitoring	`Latency`, `Traffic`, `Errors`, `Saturation`	Google SRE Book
RED Method	User experience	Services / APIs	`Rate`, `Errors`, `Duration`	Tom Wilkie (Grafana)
USE Method	Resource health	Infrastructure / Hardware	`Utilization`, `Saturation`, `Errors`	Brendan Gregg
LETS	Holistic view	Combined approach	`Latency`, `Errors`, `Traffic`, `Saturation`	Community evolution

Which method should I use?

Use RED for microservices and user-facing APIs. Use USE for infrastructure (CPU, disks, network). Use Golden Signals as a universal starting point that covers both. Most teams combine approaches: RED for services, USE for the machines running them.

02

The Three Pillars

Metrics, logs, and traces form the foundation of observability. Each pillar provides a different lens into system behavior.

Metrics Aggregated

Numeric measurements collected at regular intervals and aggregated over time. Compact, fast to query, ideal for dashboards and alerting.

Data model: Time-series (timestamp + value + labels)
Storage cost: Low (fixed bytes per sample)
Query speed: Very fast (pre-aggregated)
Best for: Trends, alerting, dashboards
Tools: Prometheus, Datadog, CloudWatch, InfluxDB

Logs Discrete

Timestamped text records of discrete events. Rich context per event, variable size. Essential for debugging and audit trails.

Data model: Timestamped text (structured or unstructured)
Storage cost: High (unbounded per event)
Query speed: Moderate (full-text search)
Best for: Debugging, audit trails, event investigation
Tools: Loki, ELK Stack, CloudWatch Logs, Splunk

Traces Distributed

End-to-end request paths through distributed systems. Shows the causal chain of operations across service boundaries.

Data model: Trace → Spans (parent-child tree)
Storage cost: Medium (sampled, per-request)
Query speed: Fast by ID, slower by search
Best for: Latency analysis, bottleneck detection
Tools: Jaeger, Tempo, Zipkin, AWS X-Ray

Metrics — Numeric Signals Over Time

Metrics are the workhorses of monitoring. Each data point is a numeric value with a timestamp and a set of key-value labels. Because they are fixed-size and pre-aggregated, they are extremely storage-efficient and fast to query.

Time-series structure: A metric is identified by its name and label set. Each unique combination of name + labels is a separate time series.

# Prometheus exposition format
# HELP http_requests_total Total number of HTTP requests
# TYPE http_requests_total counter
http_requests_total{method="GET", handler="/api/users", status="200"} 14523
http_requests_total{method="POST", handler="/api/users", status="201"} 892
http_requests_total{method="GET", handler="/api/users", status="500"} 37

# Each unique label combination = one time series
# High cardinality labels (user_id, request_id) are dangerous
# Stick to bounded label values (method, status, handler)

Example metrics for a web service:

http_requests_total — Total request count (counter)
http_request_duration_seconds — Request latency (histogram)
http_requests_in_flight — Current active requests (gauge)
process_cpu_seconds_total — CPU time consumed (counter)
process_resident_memory_bytes — Memory usage (gauge)

Logs — Discrete Event Records

Logs capture what happened at a specific moment in time. They range from simple unstructured text lines to richly structured JSON objects. The key distinction is between structured (machine-parseable) and unstructured (human-readable) formats.

Structured Logging (JSON)

{
  "timestamp": "2024-03-15T14:23:45.123Z",
  "level": "ERROR",
  "service": "payment-api",
  "trace_id": "abc123def456",
  "span_id": "789ghi012",
  "message": "Payment processing failed",
  "error": {
    "type": "PaymentGatewayTimeout",
    "message": "Upstream timeout after 30s",
    "stack": "PaymentGatewayTimeout: at processPayment (payment.js:142)"
  },
  "context": {
    "user_id": "usr_8x7k2m",
    "order_id": "ord_3f9a1c",
    "amount_cents": 4999,
    "currency": "USD",
    "gateway": "stripe",
    "retry_count": 2
  }
}

Unstructured Logging (Plain Text)

# Traditional syslog / application logs
2024-03-15 14:23:45 ERROR [payment-api] Payment processing failed for order ord_3f9a1c: upstream timeout after 30s (retry 2/3)
2024-03-15 14:23:45 WARN  [payment-api] Circuit breaker for stripe gateway at 80% threshold
2024-03-15 14:23:46 INFO  [payment-api] Falling back to backup payment gateway for user usr_8x7k2m

Always prefer structured logging

Structured logs (JSON) are dramatically easier to query, filter, and aggregate. They enable LogQL/Splunk field extraction without regex parsing. The cost of structured logging is slightly larger payloads, but the operational benefits far outweigh the storage cost.

Traces — Request Paths Through Distributed Systems

A trace represents the complete journey of a single request through a distributed system. It consists of multiple spans, each representing a unit of work. Spans have parent-child relationships that form a tree structure, showing the causal chain of operations.

Trace ID: 7a3f2b1c4d5e6f89 api-gateway |====================| 240ms auth-svc |====| 35ms user-svc |===========| 85ms user-db |=======| 62ms cache |=| 8ms order-svc |==========| 90ms order-db |=====| 45ms payment |====| 30ms

Key trace concepts:

Trace: A collection of spans sharing a single trace ID, representing one end-to-end request
Span: A single operation within a trace (e.g., an HTTP call, a DB query, a function invocation)
Context propagation: Trace/span IDs are passed via HTTP headers (traceparent, b3) across service boundaries
Sampling: Most systems sample traces (1-10%) to manage storage costs while retaining statistical significance
Span attributes: Key-value metadata on each span (http.method, db.statement, error)

When to Use Which Pillar

Decision guide for choosing the right signal

"Is the system healthy right now?" → Metrics (dashboards, alerts)
"What happened at 3:42 AM?" → Logs (event investigation)
"Why is this specific request slow?" → Traces (latency breakdown)
"What is the trend over the past week?" → Metrics (time-series queries)
"Which service is the bottleneck?" → Traces (span analysis)
"What changed in the config?" → Logs (audit trail)
"How many users hit this error?" → Metrics (counters) + Logs (context)

Pillar Comparison

Attribute	Metrics	Logs	Traces
Data Model	Numeric time-series (name + labels + value)	Timestamped text records (structured/unstructured)	Span trees with parent-child relationships
Storage Cost	Low ~1-2 bytes/sample	High Unbounded per event	Medium ~1-5 KB/span (sampled)
Query Speed	Very fast (pre-aggregated TSDB)	Moderate (full-text index/search)	Fast by trace ID, slower by attribute search
Cardinality	Bounded (labels must be low-cardinality)	Unbounded (any key-value pair)	Per-request (sampled to control volume)
Best Use Case	Alerting, trends, capacity planning	Debugging, audit, event forensics	Latency analysis, dependency mapping
Retention	Weeks to years (downsampled)	Days to months	Days to weeks

03

Metrics & Time Series

Deep dive into Prometheus metric types, PromQL queries, recording rules, and architecture patterns.

Metric Types

Prometheus defines four core metric types. Choosing the right type is critical for accurate instrumentation and meaningful queries.

C

Counter Monotonic

A value that only goes up (and resets to zero on process restart). Use rate() to get the per-second rate of increase.

When to use: Total requests served, total errors, total bytes transferred. Anything that accumulates over time.

Python Instrumentation

from prometheus_client import Counter

# Define a counter with labels
http_requests_total = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'handler', 'status']
)

# Increment in your request handler
@app.route('/api/users', methods=['GET'])
def get_users():
    try:
        users = db.query_users()
        http_requests_total.labels(
            method='GET',
            handler='/api/users',
            status='200'
        ).inc()
        return jsonify(users), 200
    except Exception as e:
        http_requests_total.labels(
            method='GET',
            handler='/api/users',
            status='500'
        ).inc()
        raise

PromQL Queries for Counters

# Per-second request rate over last 5 minutes
rate(http_requests_total[5m])

# Per-second rate, per handler
sum by (handler) (rate(http_requests_total[5m]))

# Total requests in the last hour
increase(http_requests_total[1h])

# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100

Never use a raw counter value in alerts

Counter values are monotonically increasing and reset on restart. Always wrap counters with rate(), irate(), or increase(). Plotting a raw counter produces a meaningless ever-climbing line.

G

Gauge Point-in-Time

A value that can go up and down. Represents a current snapshot: temperature, queue depth, active connections.

When to use: Current memory usage, active connections, queue length, temperature, number of goroutines. Anything with a "current value."

Python Instrumentation

from prometheus_client import Gauge

# Simple gauge
requests_in_flight = Gauge(
    'http_requests_in_flight',
    'Number of HTTP requests currently being processed',
    ['handler']
)

# Use as context manager for automatic inc/dec
@app.route('/api/process')
def process():
    with requests_in_flight.labels(handler='/api/process').track_inprogress():
        result = expensive_computation()
        return jsonify(result)

# Set to an absolute value (e.g., from a sensor)
queue_depth = Gauge('job_queue_depth', 'Current job queue depth')
queue_depth.set(len(pending_jobs))

# Use set_function for callback-based collection
cpu_usage = Gauge('process_cpu_percent', 'CPU usage percentage')
cpu_usage.set_function(lambda: psutil.cpu_percent())

PromQL Queries for Gauges

# Current value
http_requests_in_flight

# Average over the last 5 minutes
avg_over_time(http_requests_in_flight[5m])

# Max value in the last hour
max_over_time(http_requests_in_flight[1h])

# Rate of change (useful for detecting sudden spikes)
deriv(http_requests_in_flight[5m])

# Predict value 4 hours from now (linear regression)
predict_linear(node_filesystem_avail_bytes[6h], 4*3600)

H

Histogram Distribution

Samples observations and counts them in configurable buckets. Enables percentile calculations server-side.

When to use: Request latency, response sizes, batch job durations. Anything where you need percentiles (P50, P95, P99) or distribution analysis.

Exposition Format (What Prometheus Scrapes)

# HELP http_request_duration_seconds HTTP request latency
# TYPE http_request_duration_seconds histogram
http_request_duration_seconds_bucket{handler="/api",le="0.005"}   3240
http_request_duration_seconds_bucket{handler="/api",le="0.01"}    4521
http_request_duration_seconds_bucket{handler="/api",le="0.025"}   6892
http_request_duration_seconds_bucket{handler="/api",le="0.05"}    8341
http_request_duration_seconds_bucket{handler="/api",le="0.1"}     9102
http_request_duration_seconds_bucket{handler="/api",le="0.25"}    9498
http_request_duration_seconds_bucket{handler="/api",le="0.5"}     9612
http_request_duration_seconds_bucket{handler="/api",le="1"}       9645
http_request_duration_seconds_bucket{handler="/api",le="+Inf"}    9650
http_request_duration_seconds_sum{handler="/api"}                  298.45
http_request_duration_seconds_count{handler="/api"}                9650

Python Instrumentation

from prometheus_client import Histogram

# Default buckets: .005, .01, .025, .05, .075, .1, .25, .5, .75, 1, 2.5, 5, 7.5, 10
http_duration = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency in seconds',
    ['method', 'handler'],
    buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5, 5.0, 10.0]
)

# Observe with decorator
@http_duration.labels(method='GET', handler='/api/users').time()
def get_users():
    return db.query_users()

# Or manually observe
start = time.time()
result = process_request()
http_duration.labels(method='POST', handler='/api/orders').observe(time.time() - start)

PromQL for Percentiles

# P95 latency across all handlers
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# P99 latency per handler
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le, handler)
)

# P50 (median) latency
histogram_quantile(0.50,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# Average latency (from sum and count)
rate(http_request_duration_seconds_sum[5m])
/
rate(http_request_duration_seconds_count[5m])

# Apdex score (satisfied < 0.25s, tolerating < 1s)
(
  sum(rate(http_request_duration_seconds_bucket{le="0.25"}[5m]))
  +
  sum(rate(http_request_duration_seconds_bucket{le="1"}[5m]))
) / 2
/
sum(rate(http_request_duration_seconds_count[5m]))

S

Summary Client-side

Calculates quantiles on the client side. Pre-configured percentiles cannot be aggregated across instances.

When to use: Rarely. Prefer histograms in almost all cases. Summaries are useful only when you need exact quantiles for a single instance and cannot accept the bucket approximation error.

Histogram vs Summary Comparison

Feature	Histogram	Summary
Quantile calculation	Server-side (PromQL)	Client-side (pre-computed)
Aggregation across instances	Yes (aggregate buckets, then quantile)	No (cannot average percentiles)
Accuracy	Approximated by bucket boundaries	Exact for configured quantiles
Configuration	Bucket boundaries (can change later)	Quantile targets (fixed at instrumentation time)
Cost on client	Low (simple counter increments)	Higher (streaming quantile estimation)
Recommendation	Preferred	Use only if aggregation is unnecessary

PromQL Essentials

PromQL (Prometheus Query Language) is a functional query language for selecting and aggregating time-series data. Understanding rate() vs irate() vs increase() is the single most important PromQL skill.

rate() vs irate() vs increase()

Function	Behavior	Best For	Example
`rate(v[d])`	Average per-second rate over range `d`	Alerting, recording rules, dashboards	`rate(http_requests_total[5m])`
`irate(v[d])`	Instantaneous per-second rate (last two samples)	Volatile, fast-moving counters on dashboards	`irate(http_requests_total[5m])`
`increase(v[d])`	Total increase over range `d` (= rate * seconds)	"How many X happened in the last hour?"	`increase(http_requests_total[1h])`

Best practice: range window = 4x scrape interval

If your scrape interval is 15s, use at least [1m] ranges for rate(). The rule of thumb is 4x the scrape interval to ensure you always have at least two data points in the range, even after a failed scrape. For 15s intervals: rate(metric[1m]). For 30s intervals: rate(metric[2m]).

Common Query Patterns

# --- Error Rate Percentage ---
# 5xx error rate as a percentage of all requests
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
* 100

# --- P95 Latency ---
histogram_quantile(0.95,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)

# --- Request Rate by Status Code ---
sum by (status) (rate(http_requests_total[5m]))

# --- Top 5 Endpoints by Request Rate ---
topk(5, sum by (handler) (rate(http_requests_total[5m])))

# --- Memory Usage Percentage ---
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# --- CPU Usage Per Core ---
1 - avg by (cpu) (rate(node_cpu_seconds_total{mode="idle"}[5m]))

# --- Disk Space Prediction (hours until full) ---
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[6h], 24*3600) < 0

# --- Absent Alert (target is down) ---
absent(up{job="api-server"}) == 1

# --- Multi-dimensional aggregation ---
# Average latency by service AND method
avg by (service, method) (
  rate(http_request_duration_seconds_sum[5m])
  /
  rate(http_request_duration_seconds_count[5m])
)

Recording Rules

Recording rules pre-compute expensive PromQL expressions and save the results as new time series. This improves dashboard load times and enables composable alerting.

# prometheus-rules.yaml
groups:
  - name: http_recording_rules
    interval: 30s
    rules:
      # Pre-compute per-handler request rate
      - record: job:http_requests:rate5m
        expr: sum by (job, handler) (rate(http_requests_total[5m]))

      # Pre-compute error rate percentage
      - record: job:http_errors:ratio_rate5m
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (job) (rate(http_requests_total[5m]))

      # Pre-compute P95 latency per handler
      - record: job:http_duration_seconds:p95_5m
        expr: |
          histogram_quantile(0.95,
            sum by (job, handler, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )

      # Pre-compute P99 latency (global)
      - record: job:http_duration_seconds:p99_5m
        expr: |
          histogram_quantile(0.99,
            sum by (job, le) (
              rate(http_request_duration_seconds_bucket[5m])
            )
          )

  - name: node_recording_rules
    rules:
      # CPU usage by instance
      - record: instance:node_cpu:ratio_rate5m
        expr: |
          1 - avg by (instance) (
            rate(node_cpu_seconds_total{mode="idle"}[5m])
          )

      # Memory usage by instance
      - record: instance:node_memory:usage_ratio
        expr: |
          1 - (
            node_memory_MemAvailable_bytes
            /
            node_memory_MemTotal_bytes
          )

Prometheus Architecture

Prometheus uses a pull-based model: it actively scrapes HTTP endpoints that expose metrics in the exposition format. This is fundamentally different from push-based systems like StatsD or Datadog Agent.

# Prometheus Architecture Overview
#
# +------------------+     scrape      +------------------+
# |   Application    | <-------------- |    Prometheus    |
# |  /metrics        |    (HTTP GET)   |    Server        |
# |  endpoint        |                 |                  |
# +------------------+                 |  +------------+  |
#                                      |  |   TSDB     |  |
# +------------------+     scrape      |  | (storage)  |  |
# |   Node Exporter  | <-------------- |  +------------+  |
# |  /metrics        |                 |                  |
# +------------------+                 |  +------------+  |
#                                      |  |  PromQL    |  |
# +------------------+     scrape      |  |  Engine    |  |
# |   cAdvisor       | <-------------- |  +------------+  |
# |  /metrics        |                 |                  |
# +------------------+                 +-------+----------+
#                                              |
#                      +------------------+    |  evaluate rules
#                      |  Alertmanager    | <--+
#                      |  (routing,       |    |  query
#                      |   dedup,         |    v
#                      |   silencing)     |  +------------------+
#                      +------------------+  |    Grafana       |
#                              |             |  (dashboards,    |
#                              v             |   visualization) |
#                      +------------------+  +------------------+
#                      | PagerDuty/Slack  |
#                      | Email/Webhook    |
#                      +------------------+

Prometheus configuration: scrape targets

Targets are configured in prometheus.yml under scrape_configs. Use service discovery (Kubernetes, Consul, EC2) for dynamic environments instead of static target lists.

# prometheus.yml — minimal configuration
global:
  scrape_interval: 15s        # Default scrape frequency
  evaluation_interval: 15s    # Rule evaluation frequency
  scrape_timeout: 10s         # Per-scrape timeout

rule_files:
  - "rules/*.yaml"            # Recording & alerting rules

alerting:
  alertmanagers:
    - static_configs:
        - targets: ["alertmanager:9093"]

scrape_configs:
  # Prometheus scrapes itself
  - job_name: "prometheus"
    static_configs:
      - targets: ["localhost:9090"]

  # Application metrics
  - job_name: "api-server"
    metrics_path: /metrics
    scrape_interval: 10s
    static_configs:
      - targets: ["api-server:8080"]
        labels:
          env: "production"
          team: "platform"

  # Node exporter (infrastructure)
  - job_name: "node"
    static_configs:
      - targets:
          - "node1:9100"
          - "node2:9100"
          - "node3:9100"

  # Kubernetes service discovery
  - job_name: "kubernetes-pods"
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

Cardinality is the #1 Prometheus performance killer

Every unique combination of metric name + label values creates a separate time series. A metric with labels {method, handler, status, instance} across 100 instances with 50 handlers and 5 status codes = 25,000 time series from one metric. Avoid high-cardinality labels like user_id, request_id, or email. Monitor cardinality with: prometheus_tsdb_head_series.

04

Logging

Log levels, structured formats, correlation IDs, aggregation tools, and cost-efficient logging strategies for production systems.

Log Levels

Log levels provide a severity hierarchy that allows you to filter and route messages based on importance. Every logging framework supports them, and consistent usage across services is critical for effective debugging and alerting.

Level	Severity	Description	When to Use
`DEBUG`	Lowest	Detailed diagnostic information for developers	Variable values, function entry/exit, SQL queries, cache hits/misses
`INFO`	Normal	General operational events confirming things work as expected	Server started, request completed, job scheduled, config loaded
`WARN`	Elevated	Potentially harmful situation that does not prevent operation	Deprecated API call, retry attempt, approaching resource limit, fallback used
`ERROR`	High	An error event that might still allow the application to continue	Failed request, database connection error, external API failure, unhandled exception
`FATAL`	Critical	Severe error that will likely cause the application to terminate	Cannot bind port, out of memory, required config missing, data corruption detected

Production log level guidance

In production, set the minimum log level to INFO. Use DEBUG only for targeted troubleshooting, ideally controlled per-service via a config flag or environment variable. Never leave DEBUG logging enabled permanently in production — it generates massive volumes and can expose sensitive data.

Structured Logging

Structured logging outputs log entries as machine-parseable data (typically JSON) rather than free-form text. This is the foundation of modern observability — structured logs enable field-level filtering, aggregation, and correlation without fragile regex parsing.

Unstructured vs Structured

# Unstructured — human-readable, machine-hostile
2024-03-15 14:23:45 ERROR Payment failed for user usr_8x7k2m order ord_3f9a1c: timeout after 30s

# Structured — machine-readable, queryable, correlatable
{"timestamp":"2024-03-15T14:23:45.123Z","level":"ERROR","service":"payment-api",
 "trace_id":"abc123def456","message":"Payment failed","user_id":"usr_8x7k2m",
 "order_id":"ord_3f9a1c","error":"timeout","duration_ms":30000}

Why structured logging matters:

Queryable: Filter by any field (level=ERROR AND service=payment-api)
Aggregatable: Count errors by type, service, or endpoint
Correlatable: Join logs across services using trace_id
Parseable: No regex required — tools like Loki, Elasticsearch, and Splunk parse JSON natively

Python — structlog

import structlog

# Configure structlog with JSON output
structlog.configure(
    processors=[
        structlog.contextvars.merge_contextvars,
        structlog.processors.add_log_level,
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.StackInfoRenderer(),
        structlog.processors.format_exc_info,
        structlog.processors.JSONRenderer(),
    ],
)

logger = structlog.get_logger()

# Basic structured log
logger.info("request_completed",
    method="GET",
    path="/api/users",
    status=200,
    duration_ms=45,
    user_id="usr_8x7k2m"
)
# Output: {"event":"request_completed","level":"info",
#   "timestamp":"2024-03-15T14:23:45.123Z","method":"GET",
#   "path":"/api/users","status":200,"duration_ms":45,
#   "user_id":"usr_8x7k2m"}

# Bind context that persists across log calls
logger = logger.bind(service="payment-api", env="production")
logger.error("payment_failed",
    order_id="ord_3f9a1c",
    error="gateway_timeout",
    retry_count=2
)

Go — log/slog (stdlib)

package main

import (
    "log/slog"
    "os"
)

func main() {
    // JSON handler for structured output
    logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
        Level: slog.LevelInfo,
    }))
    slog.SetDefault(logger)

    // Structured log with typed attributes
    slog.Info("request completed",
        slog.String("method", "GET"),
        slog.String("path", "/api/users"),
        slog.Int("status", 200),
        slog.Duration("duration", elapsed),
        slog.String("trace_id", traceID),
    )

    // Group related attributes
    slog.Error("payment failed",
        slog.Group("order",
            slog.String("id", "ord_3f9a1c"),
            slog.Float64("amount", 49.99),
        ),
        slog.String("error", "gateway_timeout"),
    )
}

Node.js — pino

const pino = require('pino');

const logger = pino({
  level: process.env.LOG_LEVEL || 'info',
  timestamp: pino.stdTimeFunctions.isoTime,
  formatters: {
    level: (label) => ({ level: label }),
  },
  // Redact sensitive fields
  redact: ['req.headers.authorization', 'body.password'],
});

// Structured log entry
logger.info({
  method: 'GET',
  path: '/api/users',
  status: 200,
  duration_ms: 45,
  trace_id: 'abc123def456',
}, 'request completed');

// Child logger with bound context
const reqLogger = logger.child({
  service: 'payment-api',
  request_id: 'req_xyz789',
});

reqLogger.error({
  order_id: 'ord_3f9a1c',
  error: 'gateway_timeout',
  retry_count: 2,
}, 'payment failed');

Key Fields Every Log Should Have

Field	Type	Purpose	Example
`timestamp`	ISO 8601	When the event occurred (UTC)	`2024-03-15T14:23:45.123Z`
`level`	string	Severity of the event	`info`, `error`, `warn`
`service`	string	Which service emitted the log	`payment-api`
`trace_id`	string	Correlation with distributed traces	`abc123def456`
`message`	string	Human-readable event description	`Payment processing failed`

Correlation IDs

A correlation ID (also called a request ID) is a unique identifier that follows a request across every service it touches. It is the single most important field for debugging distributed systems — it lets you find every log entry, span, and metric related to one user action.

How it works:

The first service (API gateway or load balancer) generates a unique ID
The ID is passed via HTTP headers to every downstream service
Every service includes the ID in all log entries and spans
Common header names: X-Correlation-ID, X-Request-ID, traceparent (W3C)

Middleware Example (Python / Flask)

import uuid
from flask import Flask, request, g

app = Flask(__name__)

@app.before_request
def set_correlation_id():
    """Extract or generate correlation ID for every request."""
    # Check incoming headers (may come from upstream service)
    correlation_id = (
        request.headers.get('X-Correlation-ID') or
        request.headers.get('X-Request-ID') or
        str(uuid.uuid4())
    )
    g.correlation_id = correlation_id

@app.after_request
def add_correlation_header(response):
    """Include correlation ID in response headers."""
    response.headers['X-Correlation-ID'] = g.correlation_id
    return response

# When calling downstream services, propagate the ID
import requests as http_client

def call_downstream(url, data):
    return http_client.post(url, json=data, headers={
        'X-Correlation-ID': g.correlation_id,
        'Content-Type': 'application/json',
    })

Request Flow with Correlation ID

# Correlation ID propagation across 3 services
#
# correlation_id = "req-7a3f-2b1c-4d5e"
#
# +----------+          +------------+          +-------------+
# |  Client  |  ------> | API Gateway|  ------> | Order Svc   |
# |          |          |            |          |             |
# |          |          | Generates: |          | Receives:   |
# |          |          | X-Corr-ID  |          | X-Corr-ID   |
# +----------+          +-----+------+          +------+------+
#                             |                        |
#                             |                        v
#                             |                 +-------------+
#                             |                 | Payment Svc |
#                             |                 |             |
#                             |                 | Receives:   |
#                             |                 | X-Corr-ID   |
#                             |                 +-------------+
#
# All 3 services log with the same correlation_id:
#
# [API Gateway]  {"correlation_id":"req-7a3f-2b1c-4d5e","msg":"request received"}
# [Order Svc]    {"correlation_id":"req-7a3f-2b1c-4d5e","msg":"order created"}
# [Payment Svc]  {"correlation_id":"req-7a3f-2b1c-4d5e","msg":"payment processed"}
#
# Query all logs for this request:
#   {job=~".+"} | json | correlation_id = "req-7a3f-2b1c-4d5e"

Log Aggregation Tools

Label-based Grafana Loki

Like Prometheus, but for logs. Indexes labels only, not log content, making it dramatically cheaper to operate at scale. Uses LogQL for queries.

# Key characteristics
# - Label-based indexing (not full-text)
# - LogQL query language
# - Object storage backend (S3, GCS)
# - 10-100x cheaper than Elasticsearch
# - Native Grafana integration

# Storage: ~$0.02/GB/month (S3)
# vs Elasticsearch: ~$0.50/GB/month

Full-text ELK Stack

Elasticsearch + Logstash + Kibana. Full-text search and analytics engine. Powerful but resource-hungry. Best when you need complex text queries.

# Components
# Elasticsearch — search & storage
# Logstash — ingestion & transform
# Kibana — visualization & dashboards
# Beats — lightweight log shippers

# Strengths: full-text search,
#   complex aggregations, mature
# Weakness: high memory/storage cost

Forwarding Fluentd / Fluent Bit

Unified log forwarding and processing layer. Collects from multiple sources, transforms, and routes to any destination. Fluent Bit is the lightweight variant.

# Fluentd — full-featured, Ruby
# Fluent Bit — lightweight, C
#
# Input plugins: tail, syslog,
#   docker, kubernetes
# Output plugins: Loki, ES, S3,
#   Kafka, Datadog, CloudWatch
#
# Use Fluent Bit as a DaemonSet
# in Kubernetes for log collection

LogQL Examples

LogQL is Grafana Loki's query language. It uses the same label matching syntax as PromQL, combined with pipeline operators for filtering, parsing, and aggregating log content.

# --- Basic label matching ---
# All logs from the api job
{job="api"}

# Multiple label matchers
{job="api", env="production", level="error"}

# Regex label matching
{job=~"api|payment", namespace="prod"}

# --- Filter by content ---
# Lines containing "error"
{job="api"} |= "error"

# Lines NOT containing "healthcheck"
{job="api"} != "healthcheck"

# Regex filter
{job="api"} |~ "status=(4|5)\\d{2}"

# --- JSON parsing ---
# Parse JSON and filter by field
{job="api"} | json | status >= 500

# Extract specific fields
{job="api"} | json | line_format "{{.method}} {{.path}} {{.status}}"

# Filter parsed fields
{job="api"} | json | duration_ms > 1000 | level = "error"

# --- Rate queries (metric from logs) ---
# Error rate per second over 5 minutes
rate({job="api"} |= "error" [5m])

# Errors per minute by service
sum by (service) (
  rate({job=~".+"} | json | level = "error" [5m])
) * 60

# --- Top errors ---
# Top 10 error messages by frequency
topk(10,
  sum by (error) (
    count_over_time({job="api"} | json | level = "error" [1h])
  )
)

# --- Bytes rate (log volume) ---
# Bytes per second by job
bytes_rate({job="api"} [5m])

# --- Quantile from logs ---
# P95 duration from structured logs
quantile_over_time(0.95,
  {job="api"} | json | unwrap duration_ms [5m]
) by (handler)

Log Sampling & Cost Control

Logging everything in production can cost more than your infrastructure

A single service logging 1,000 req/s at 1 KB/log generates 86 GB/day. Across 20 services, that is 1.7 TB/day before ingestion overhead. At $0.50/GB for Elasticsearch, that is $850/day just for log storage. Sampling and filtering are not optional — they are economic necessities.

Cost control strategies:

Sample DEBUG logs: Log 1-10% of DEBUG messages using a token bucket or probabilistic sampler
Always keep ERROR and FATAL: Never sample error-level logs — these are your lifeline during incidents
Rate-limit per source: Cap log throughput per service/pod to prevent noisy neighbor problems
Drop health check logs: Filter out high-frequency, low-value logs like /healthz and /readyz
Use log levels aggressively: Set production minimum to INFO, enable DEBUG per-service as needed
Tiered retention: Keep hot logs for 7 days, warm for 30 days, cold (S3) for 90 days

Token Bucket Sampling Pattern

# Token bucket rate limiter for log sampling
# Allows burst logging but caps sustained rate

import time
import threading

class LogSampler:
    """Token bucket sampler: allows `rate` logs/sec with
    burst capacity of `bucket_size`."""

    def __init__(self, rate=10, bucket_size=50):
        self.rate = rate            # tokens per second
        self.bucket_size = bucket_size
        self.tokens = bucket_size   # start full
        self.last_refill = time.monotonic()
        self.lock = threading.Lock()

    def should_log(self, level: str) -> bool:
        # Always log ERROR and above
        if level in ("ERROR", "FATAL", "CRITICAL"):
            return True

        with self.lock:
            now = time.monotonic()
            elapsed = now - self.last_refill
            self.tokens = min(
                self.bucket_size,
                self.tokens + elapsed * self.rate
            )
            self.last_refill = now

            if self.tokens >= 1:
                self.tokens -= 1
                return True
            return False

# Usage
sampler = LogSampler(rate=10, bucket_size=50)

if sampler.should_log("DEBUG"):
    logger.debug("cache_miss", key=cache_key)

# ERROR always passes through
if sampler.should_log("ERROR"):
    logger.error("payment_failed", order_id=order_id)

05

Distributed Tracing

Trace structure, span anatomy, instrumentation patterns, sampling strategies, and tracing backend comparisons.

Core Concepts

Distributed tracing captures the end-to-end path of a request as it flows through multiple services. Each trace tells the story of one user action — which services were involved, how long each took, and where failures occurred.

Trace: The complete end-to-end request path, identified by a unique trace_id. A trace is a directed acyclic graph of spans.
Span: A single unit of work within a trace — an HTTP call, a database query, a function invocation. Each span has a start time, duration, and status.
Context Propagation: The mechanism by which trace context (trace_id, span_id) passes between services, typically via HTTP headers like traceparent (W3C) or b3 (Zipkin).
Baggage: Key-value pairs that travel with the trace context across all service boundaries. Useful for propagating user IDs, tenant IDs, or feature flags without modifying service interfaces.

Trace Structure

Trace ID: abc123 API Gateway |============================| 120ms Auth Service |===| 15ms Order Service |=====================| 90ms Database Query |========| 25ms Payment Service |============| 50ms Stripe API |=========| 40ms

Each span is a child of the span that initiated it. The root span (API Gateway) encompasses the entire request. Child spans show the work breakdown, revealing where time is actually spent.

Span Anatomy

Every span carries a set of standard attributes that describe the work it represents. Understanding these fields is essential for effective trace analysis and custom instrumentation.

Attribute	Type	Description	Example
`trace_id`	string (128-bit)	Unique identifier for the entire trace	`4bf92f3577b34da6a3ce929d0e0e4736`
`span_id`	string (64-bit)	Unique identifier for this span	`00f067aa0ba902b7`
`parent_span_id`	string (64-bit)	ID of the parent span (empty for root)	`a3ce929d0e0e4736`
`operation_name`	string	Human-readable name of the operation	`HTTP GET /api/orders`
`start_time`	timestamp	When the span started (nanosecond precision)	`2024-03-15T14:23:45.123456789Z`
`duration`	nanoseconds	How long the operation took	`45200000` (45.2ms)
`status`	enum	OK, ERROR, or UNSET	`ERROR`
`attributes`	key-value map	Custom metadata on the span	`http.method=GET, http.status_code=200`

Example Span (JSON)

{
  "traceId": "4bf92f3577b34da6a3ce929d0e0e4736",
  "spanId": "00f067aa0ba902b7",
  "parentSpanId": "a3ce929d0e0e4736",
  "operationName": "HTTP GET /api/orders/ord_3f9a1c",
  "startTime": "2024-03-15T14:23:45.123456789Z",
  "duration": 45200000,
  "status": { "code": "OK" },
  "attributes": {
    "http.method": "GET",
    "http.url": "/api/orders/ord_3f9a1c",
    "http.status_code": 200,
    "http.response_content_length": 1284,
    "service.name": "order-service",
    "service.version": "1.4.2",
    "deployment.environment": "production"
  },
  "events": [
    {
      "name": "cache_miss",
      "timestamp": "2024-03-15T14:23:45.125000000Z",
      "attributes": { "cache.key": "order:ord_3f9a1c" }
    }
  ],
  "links": []
}

Instrumentation

Instrumentation is the process of adding tracing code to your application. OpenTelemetry provides two approaches: auto-instrumentation (zero-code, uses agents/hooks) and manual instrumentation (explicit span creation in your code).

Auto vs Manual Instrumentation

Aspect	Auto-Instrumentation	Manual Instrumentation
Setup effort	Low Install agent/package	Medium Add code to each operation
Coverage	HTTP, gRPC, database clients, messaging	Any code path you choose
Custom attributes	Limited (generic HTTP/DB attributes)	Full control (business-specific data)
Maintenance	Automatic updates with library versions	Must maintain with code changes
Best for	Quick start, framework-level visibility	Business logic, custom operations
Recommendation	Use both: auto-instrumentation as the base, manual spans for business-critical paths

Python — OpenTelemetry Manual Spans

from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

# Setup
provider = TracerProvider()
provider.add_span_processor(
    BatchSpanProcessor(OTLPSpanExporter(endpoint="localhost:4317"))
)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer("order-service")

def process_order(order_id: str, user_id: str):
    # Create a span for this operation
    with tracer.start_as_current_span("process-order") as span:
        span.set_attribute("order.id", order_id)
        span.set_attribute("user.id", user_id)

        # Child span for database lookup
        with tracer.start_as_current_span("db-lookup") as db_span:
            db_span.set_attribute("db.system", "postgresql")
            db_span.set_attribute("db.statement", "SELECT * FROM orders WHERE id = $1")
            order = db.fetch_order(order_id)

        # Child span for payment
        with tracer.start_as_current_span("charge-payment") as pay_span:
            pay_span.set_attribute("payment.gateway", "stripe")
            pay_span.set_attribute("payment.amount", order.total)
            try:
                result = payment_gateway.charge(order)
                pay_span.set_attribute("payment.status", "success")
            except PaymentError as e:
                pay_span.set_status(trace.StatusCode.ERROR, str(e))
                pay_span.record_exception(e)
                raise

        span.set_attribute("order.status", "completed")
        return order

Context Propagation Between Services

# Service A — outgoing HTTP call (inject context)
from opentelemetry.propagate import inject
import requests

def call_payment_service(order):
    headers = {}
    inject(headers)  # Injects traceparent header
    # headers now contains:
    # {"traceparent": "00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01"}

    response = requests.post(
        "http://payment-service/charge",
        json=order.to_dict(),
        headers=headers,
    )
    return response.json()


# Service B — incoming HTTP request (extract context)
from opentelemetry.propagate import extract
from flask import request

@app.route("/charge", methods=["POST"])
def handle_charge():
    # Extract trace context from incoming headers
    context = extract(request.headers)

    # Start a new span linked to the parent trace
    with tracer.start_as_current_span(
        "handle-charge",
        context=context,
    ) as span:
        span.set_attribute("payment.method", "credit_card")
        # This span is now a child of the span in Service A
        result = process_charge(request.json)
        return jsonify(result)

Sampling Strategies

Tracing every request in a high-traffic system is prohibitively expensive. Sampling reduces volume while preserving the traces that matter most — errors, slow requests, and important business transactions.

Head-based sampling: The decision to sample is made at the very start of a trace (at the root span). Simple, low overhead, but may miss interesting traces because the decision is made before the outcome is known.
Tail-based sampling: The decision is made after the trace is complete, at the collector level. This allows keeping all error traces and slow traces, but requires buffering complete traces in memory, increasing collector overhead.
Combined approach: Head-sample at a base rate (e.g., 10%), then tail-sample at the collector to ensure all errors and slow traces are retained.

Sampling Strategy Comparison

Strategy	Decision Point	Overhead	Error Coverage	Best For
Head-based	Trace start (root span)	Low	Partial (random)	High-traffic, cost-sensitive systems
Tail-based	After trace completes (collector)	High (memory)	Full	When every error trace must be captured
Combined	Head first, tail at collector	Medium	Full	Best balance of cost and coverage
Rule-based	Per-service rules	Medium	Configurable	Different rates for different endpoints

OTel Collector Tail Sampling Configuration

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  tail_sampling:
    decision_wait: 10s           # Wait time for complete trace
    num_traces: 100000           # Max traces in memory
    expected_new_traces_per_sec: 1000
    policies:
      # Always keep error traces
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]

      # Always keep slow traces (> 1 second)
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 1000

      # Sample 10% of remaining traces
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

      # Always trace specific endpoints
      - name: critical-endpoints
        type: string_attribute
        string_attribute:
          key: http.url
          values: ["/api/checkout", "/api/payment"]

  batch:
    timeout: 5s
    send_batch_size: 1000

exporters:
  otlp:
    endpoint: tempo:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [tail_sampling, batch]
      exporters: [otlp]

Tracing Tools

CNCF Jaeger

End-to-end distributed tracing platform originally built by Uber. CNCF graduated project. Rich UI with dependency graphs and trace comparison.

# Run Jaeger all-in-one (dev)
docker run -d --name jaeger \
  -p 16686:16686 \
  -p 4317:4317 \
  -p 4318:4318 \
  jaegertracing/all-in-one:latest

# UI: http://localhost:16686
# Accepts: OTLP, Jaeger, Zipkin

Grafana Grafana Tempo

High-scale trace backend with minimal indexing. Stores traces in object storage (S3/GCS). Integrates natively with Grafana, Loki, and Mimir.

# Key advantages
# - Label-based, no full-text index
# - Object storage = low cost
# - TraceQL query language
# - Exemplar links from metrics
# - Logs-to-traces integration

# TraceQL example
{ span.http.status_code >= 500
  && resource.service.name = "api" }

Lightweight Zipkin

One of the original distributed tracing systems, inspired by Google Dapper. Lightweight, simple setup, good for smaller deployments.

# Run Zipkin (dev)
docker run -d -p 9411:9411 \
  openzipkin/zipkin

# UI: http://localhost:9411
# Accepts: Zipkin B3, OTLP
# Storage: in-memory, MySQL,
#   Cassandra, Elasticsearch

Tracing Backend Comparison

Feature	Jaeger	Grafana Tempo	Zipkin
Query Language	UI-based search	TraceQL	UI-based search
Storage	Elasticsearch, Cassandra, Kafka	Object storage (S3, GCS, Azure)	In-memory, MySQL, Cassandra, ES
Scale	Large	Very large	Small-Medium
Cost at scale	Medium (requires ES/Cassandra)	Low (object storage)	Low (simple storage)
Grafana integration	Plugin	Native	Plugin
Best for	Teams wanting rich UI and dependency maps	Grafana stack users at scale	Small teams wanting simplicity

06

Alerting & On-Call

Prometheus alert rules, Alertmanager routing, on-call best practices, runbook templates, and common anti-patterns to avoid.

Prometheus Alert Rules

Alert rules are PromQL expressions evaluated at regular intervals. When an expression returns results and the condition persists for the for duration, the alert fires and is sent to Alertmanager.

# alerting-rules.yaml
groups:
  - name: service_alerts
    interval: 30s
    rules:

      # --- High Latency ---
      # P95 latency exceeds 500ms for 5 minutes
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)
          ) > 0.5
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "High P95 latency on {{ $labels.job }}"
          description: "P95 latency is {{ $value | humanizeDuration }} (threshold: 500ms)"
          runbook: "https://wiki.internal/runbooks/high-latency"

      # --- High Error Rate ---
      # Error rate exceeds 1% for 5 minutes
      - alert: HighErrorRate
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum by (job) (rate(http_requests_total[5m]))
          * 100 > 1
        for: 5m
        labels:
          severity: critical
          team: platform
        annotations:
          summary: "High error rate on {{ $labels.job }}"
          description: "Error rate is {{ $value | printf \"%.2f\" }}% (threshold: 1%)"
          runbook: "https://wiki.internal/runbooks/high-error-rate"

      # --- High CPU ---
      # CPU usage above 80% for 10 minutes
      - alert: HighCPU
        expr: |
          100 - (avg by (instance) (
            rate(node_cpu_seconds_total{mode="idle"}[5m])
          ) * 100) > 80
        for: 10m
        labels:
          severity: warning
          team: infrastructure
        annotations:
          summary: "High CPU on {{ $labels.instance }}"
          description: "CPU usage is {{ $value | printf \"%.1f\" }}%"
          runbook: "https://wiki.internal/runbooks/high-cpu"

      # --- Disk Will Fill ---
      # Linear prediction shows disk full within 4 hours
      - alert: DiskWillFill
        expr: |
          predict_linear(
            node_filesystem_avail_bytes{mountpoint="/"}[6h], 4 * 3600
          ) < 0
        for: 30m
        labels:
          severity: critical
          team: infrastructure
        annotations:
          summary: "Disk on {{ $labels.instance }} predicted to fill within 4h"
          description: "Current available: {{ $value | humanize1024 }}B"
          runbook: "https://wiki.internal/runbooks/disk-full"

      # --- Traffic Drop ---
      # 50% decrease in traffic compared to same time yesterday
      - alert: TrafficDrop
        expr: |
          sum(rate(http_requests_total[5m]))
          /
          sum(rate(http_requests_total[5m] offset 1d))
          < 0.5
        for: 15m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "Significant traffic drop detected"
          description: "Current traffic is {{ $value | printf \"%.0f\" }}% of yesterday"
          runbook: "https://wiki.internal/runbooks/traffic-drop"

Alertmanager Configuration

Alertmanager handles alert deduplication, grouping, routing, silencing, and inhibition. Its routing tree determines which alerts go to which receivers based on label matching.

# alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.internal:587'
  smtp_from: 'alerts@company.com'
  slack_api_url: 'https://hooks.slack.com/services/T00/B00/xxxxx'

# Inhibition — suppress less severe alerts when critical is firing
inhibit_rules:
  - source_matchers:
      - severity = critical
    target_matchers:
      - severity = warning
    equal: ['job', 'instance']

# Routing tree — match alerts to receivers
route:
  receiver: default-email          # Fallback receiver
  group_by: ['alertname', 'job']   # Group related alerts
  group_wait: 30s                  # Wait before first notification
  group_interval: 5m               # Wait between grouped notifications
  repeat_interval: 4h              # Re-notify after this interval

  routes:
    # Critical alerts → PagerDuty (immediate)
    - matchers:
        - severity = critical
      receiver: pagerduty-critical
      group_wait: 10s
      repeat_interval: 1h
      continue: false

    # Warning alerts → Slack channel
    - matchers:
        - severity = warning
      receiver: slack-warnings
      group_wait: 1m
      repeat_interval: 4h

    # Infrastructure team alerts
    - matchers:
        - team = infrastructure
      receiver: slack-infra
      routes:
        - matchers:
            - severity = critical
          receiver: pagerduty-infra

# Receivers
receivers:
  - name: default-email
    email_configs:
      - to: 'oncall@company.com'
        send_resolved: true

  - name: pagerduty-critical
    pagerduty_configs:
      - routing_key: 'YOUR_PAGERDUTY_KEY'
        severity: critical
        description: '{{ .CommonAnnotations.summary }}'
        details:
          description: '{{ .CommonAnnotations.description }}'
          runbook: '{{ .CommonAnnotations.runbook }}'

  - name: slack-warnings
    slack_configs:
      - channel: '#alerts-warning'
        send_resolved: true
        title: '{{ .CommonLabels.alertname }}'
        text: '{{ .CommonAnnotations.description }}'
        actions:
          - type: button
            text: 'Runbook'
            url: '{{ .CommonAnnotations.runbook }}'
          - type: button
            text: 'Silence'
            url: '{{ .ExternalURL }}/#/silences/new?filter=%7B'

  - name: slack-infra
    slack_configs:
      - channel: '#alerts-infra'
        send_resolved: true

  - name: pagerduty-infra
    pagerduty_configs:
      - routing_key: 'YOUR_INFRA_PD_KEY'

Routing tree evaluation

Alertmanager evaluates routes top-to-bottom. The first matching route wins, unless continue: true is set, which allows an alert to match multiple routes. Use group_by to aggregate related alerts into a single notification — grouping by [alertname, job] means all instances of the same alert for the same job arrive as one message.

On-Call Best Practices

Escalation Policies

Level	Responder	Timeout	Action
L1	Primary on-call engineer	5 minutes	Acknowledge and begin investigation
L2	Secondary on-call engineer	10 minutes	Escalated if L1 does not acknowledge
L3	Engineering manager / team lead	15 minutes	Escalated if L2 does not acknowledge
L4	VP of Engineering / incident commander	30 minutes	Full incident response mobilized

Alert Fatigue Reduction — 6 Strategies

Set appropriate thresholds with for clause: A 30-second CPU spike is noise. Require the condition to persist for 5-10 minutes before alerting. The for clause eliminates transient spikes and flapping.
Use inhibition rules: When a database is down, suppress all downstream service alerts. Inhibition rules in Alertmanager prevent alert storms from a single root cause.
Aggregate related alerts: Group by alertname and job so 50 pod restarts arrive as one notification, not 50 separate pages.
Regular alert review: Schedule monthly alert audits. Delete alerts that never fire, tune thresholds on alerts that fire too often, and improve runbooks for alerts that take too long to resolve.
Every alert must be actionable: If the response to an alert is "ignore it" or "wait and see," delete the alert or change it to a warning. Every page should require immediate human action.
Progressive severity: Start with a Slack warning. If the condition worsens or persists, escalate to a page. Not everything needs to wake someone at 3 AM.

Runbook Template

Every alert should link to a runbook. A good runbook lets any on-call engineer — even one unfamiliar with the service — diagnose and resolve the issue. Keep runbooks in version control alongside alert rules.

# ============================================
# RUNBOOK: HighErrorRate
# ============================================
# Alert:     HighErrorRate
# Severity:  critical
# Team:      platform
# Updated:   2024-03-15
# ============================================

## Summary
The 5xx error rate for the API service has exceeded 1%
for more than 5 minutes. This alert indicates a significant
number of user-facing requests are failing.

## Impact
- Users are experiencing errors on API calls
- Affected endpoints may be returning 500/502/503 errors
- Business transactions (orders, payments) may be failing

## Investigation Steps

### 1. Check the error rate dashboard
  Open: https://grafana.internal/d/api-errors

### 2. Identify which endpoints are failing
  Query:
    sum by (handler, status) (
      rate(http_requests_total{status=~"5.."}[5m])
    )

### 3. Check recent deployments
  kubectl rollout history deployment/api-server -n production

### 4. Check downstream dependencies
  # Database
  kubectl exec -it postgres-0 -- pg_isready
  # Redis
  redis-cli -h redis.internal ping
  # External APIs
  curl -s https://api.stripe.com/v1/health

### 5. Check application logs
  {job="api-server"} | json | level = "error" | line_format
    "{{.timestamp}} {{.error}} {{.path}}"

### 6. Check resource utilization
  # CPU, memory, disk on API pods
  kubectl top pods -n production -l app=api-server

## Resolution Steps

### If caused by a bad deployment:
  kubectl rollout undo deployment/api-server -n production

### If caused by database issues:
  # Check connection pool
  # Check slow queries
  # Check disk space on DB server

### If caused by downstream service failure:
  # Enable circuit breaker / fallback
  # Scale up affected service
  # Contact responsible team

## Escalation Path
1. Primary on-call (platform team)
2. Service owner: @jane-doe
3. Database team: #db-oncall (if DB-related)
4. Infrastructure: #infra-oncall (if resource-related)

## Related Alerts
- HighLatency (often fires alongside this alert)
- DatabaseConnectionPoolExhausted
- DownstreamServiceTimeout

Alert Anti-Patterns

Common alerting mistakes that cause on-call burnout

Alerting on symptoms instead of SLOs: Alert on "error budget burn rate > 2x" instead of individual error count thresholds. SLO-based alerts directly reflect user impact and are more meaningful.
No for clause (flapping alerts): Without a for duration, alerts fire on momentary spikes and resolve immediately, creating a stream of fire-then-resolve notifications that erode trust.
Missing runbook links: An alert without a runbook forces the on-call engineer to reverse-engineer the problem from scratch at 3 AM. Every alert annotation must include a runbook URL.
Alerting on things that auto-recover: If a pod restarts and Kubernetes reschedules it within 30 seconds, paging a human adds no value. Alert only on conditions that persist beyond automatic recovery.
Too many critical alerts (alert fatigue): When everything is critical, nothing is. Reserve critical/page severity for issues that directly impact users or revenue. Use warning/ticket severity for everything else.
Alerting on causes instead of effects: "CPU is high" is a cause. "Request latency exceeds SLO" is the effect users feel. Alert on the effect; investigate the cause in the runbook.
Copy-pasting thresholds: Every service has different baselines. A 500ms P95 might be normal for a report generator but catastrophic for an API gateway. Set thresholds based on actual SLOs, not arbitrary numbers.

The ideal alert checklist

Has a meaningful for duration (no flapping)
Includes runbook URL in annotations
Has clear summary and description with template variables
Severity matches impact (critical = user-facing, warning = degraded)
Requires immediate human action (not "wait and see")
Is reviewed and tuned at least quarterly
Has been tested with amtool check-config and dry-run routing

07

Dashboards & Visualization

Dashboard design principles, essential dashboard types, Grafana panel types, template variables, and dashboard-as-code workflows.

Dashboard Design Principles

A dashboard is not a collection of random graphs. It is a visual argument about the health of a system. Every panel should answer a question, and every dashboard should tell a story. The goal is a 5-second glance that tells you whether things are fine or need attention.

Z-pattern scanning: Users scan from top-left → top-right → bottom-left → bottom-right. Place the most critical panels (error rates, SLO status) in the top-left quadrant. Place detail panels and drill-downs in the lower sections.
Visual hierarchy — larger panels = more important: A full-width time-series panel for request rate commands attention. A small stat panel for uptime is a supporting detail. Size encodes priority.
5–7 panels per dashboard for quick comprehension: A dashboard with 30 panels requires scrolling and context-switching. Break large dashboards into linked sub-dashboards with drill-down links.
Consistent color coding: Red means something is broken. Green means healthy. Yellow means warning or degraded. Never deviate from this convention — it leverages pre-attentive processing so viewers can assess status before conscious thought.

Match Visualization Type to Data Type

Data Type	Best Visualization	Example
Trends over time	Time series	CPU usage, request rate, error count over 24h
Current value	Stat / Gauge	Uptime %, current error rate, active connections
Distribution	Heatmap	Latency percentiles, request size distribution
Comparison	Bar chart	Service-to-service traffic, resource usage by pod
Detail	Table	Active incidents, top endpoints by latency, error log entries

Essential Dashboard Types

Every team needs at least four dashboards covering different perspectives of system health. Each maps to a monitoring methodology and answers different questions.

Services RED Dashboard

For each service: Request rate (throughput), Error rate as a percentage of total requests, and Duration as P50/P95/P99 latency. The RED method monitors the user experience of your service.

# Request Rate (per service)
sum(rate(http_requests_total{
  job="$service"
}[5m])) by (handler)

# Error Rate %
sum(rate(http_requests_total{
  job="$service",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{
  job="$service"}[5m])) * 100

# P95 Latency
histogram_quantile(0.95,
  sum(rate(
    http_request_duration_seconds_bucket{
      job="$service"
    }[5m]
  )) by (le)
)

Infrastructure USE Dashboard

For infrastructure resources: Utilization (how busy is it), Saturation (how overloaded is it), and Errors (how often does it fail). Covers CPU, memory, disk, and network.

# CPU Utilization %
100 - (avg by (instance) (
  rate(node_cpu_seconds_total{
    mode="idle"
  }[5m])
) * 100)

# Memory Saturation (swap usage)
node_memory_SwapTotal_bytes
  - node_memory_SwapFree_bytes

# Disk Errors
rate(node_disk_io_time_weighted_seconds_total[5m])

Reliability SLO Dashboard

Error budget remaining over 30-day window, burn rate alerts, availability plotted over time. The SLO dashboard answers: “Are we meeting our promises to users?”

# Error Budget Remaining (%)
(1 - (
  sum(rate(http_requests_total{
    status=~"5.."}[30d]))
  /
  sum(rate(http_requests_total[30d]))
) / (1 - 0.999)) * 100

# Burn Rate (1h window)
sum(rate(http_requests_total{
  status=~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
/
(1 - 0.999)

Overview Golden Signals

High-level system health at a glance. Combines latency, traffic, errors, and saturation into a single view. This is the dashboard you put on the team’s wall TV.

# Latency (P95 across all services)
histogram_quantile(0.95,
  sum(rate(
    http_request_duration_seconds_bucket[5m]
  )) by (le))

# Traffic (total req/s)
sum(rate(http_requests_total[5m]))

# Error Rate (global)
sum(rate(http_requests_total{
  status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) * 100

# Saturation (avg CPU across fleet)
avg(100 - (avg by (instance) (
  rate(node_cpu_seconds_total{
    mode="idle"}[5m])) * 100))

Grafana Panel Types

Grafana ships with a rich set of visualization panels. Choosing the right panel type for your data is critical — a heatmap for latency distribution conveys more insight than a time series of average latency.

Trend Time Series

The workhorse panel. Shows values changing over time with lines, bars, or points. Supports multiple series, thresholds, and annotations. Use for CPU usage, request rate, error trends, and any metric that varies over time.

Value Stat

Single big number with optional sparkline. Supports color thresholds (green/yellow/red) that change the background based on value ranges. Use for uptime percentage, current error rate, request count, or any single KPI.

Range Gauge

Circular or bar gauge showing current value within a min/max range. Color thresholds indicate healthy, warning, and critical zones. Use for CPU utilization, memory usage, disk capacity, and any bounded metric.

Distribution Heatmap

Two-dimensional visualization using color intensity to represent value density. Time on x-axis, bucket on y-axis, color = count. Use for latency distribution, request size patterns, and histogram data from Prometheus.

Detail Table

Multi-column tabular display with sorting, filtering, and cell-level formatting. Supports links, thresholds, and unit formatting. Use for top-N endpoints, active incidents, pod status, or any multi-dimensional data.

Text Logs

Integrated log viewing panel connected to Loki or Elasticsearch. Supports live tailing, field extraction, and log-to-trace links via trace IDs. Embeds log context directly alongside metric dashboards.

Template Variables

Template variables turn one dashboard into many. Instead of creating separate dashboards for each service, namespace, or environment, use dropdown variables that dynamically filter all panels. This eliminates dashboard sprawl and keeps your Grafana instance manageable.

Common variables every dashboard should include:

$namespace — Kubernetes namespace (production, staging, dev)
$service — Service or job name, populated from label values
$interval — PromQL range interval that adapts to the dashboard time range
$instance — Specific instance for drill-down views

Grafana Variable Definition (JSON)

{
  "templating": {
    "list": [
      {
        "name": "namespace",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(kube_pod_info, namespace)",
        "refresh": 2,
        "includeAll": true,
        "multi": true,
        "current": {
          "text": "production",
          "value": "production"
        }
      },
      {
        "name": "service",
        "type": "query",
        "datasource": "Prometheus",
        "query": "label_values(http_requests_total{namespace=\"$namespace\"}, job)",
        "refresh": 2,
        "includeAll": false,
        "multi": false
      },
      {
        "name": "interval",
        "type": "interval",
        "query": "1m,5m,15m,1h",
        "auto": true,
        "auto_min": "1m",
        "auto_count": 30
      }
    ]
  }
}

Variable refresh settings

Set "refresh": 2 (on time range change) for variables that depend on the selected time window. Use "refresh": 1 (on dashboard load) for stable labels like namespace or cluster names. Avoid "refresh": 0 (never) in production — stale variable values cause confusing empty panels.

Dashboard as Code

Manually building dashboards in the Grafana UI is fine for prototyping, but production dashboards must be version-controlled and reproducible. Dashboard-as-code tools generate Grafana JSON from a higher-level language, enabling code review, templating, and CI/CD deployment.

Grafonnet (Jsonnet) Example

local grafonnet = import 'grafonnet-latest/main.libsonnet';
local dashboard = grafonnet.dashboard;
local panel = grafonnet.panel;
local prometheus = grafonnet.query.prometheus;

local requestRatePanel =
  panel.timeSeries.new('Request Rate')
  + panel.timeSeries.queryOptions.withTargets([
      prometheus.new(
        'Prometheus',
        'sum(rate(http_requests_total{job="$service"}[$interval])) by (handler)'
      )
      + prometheus.withLegendFormat('{{ handler }}'),
    ])
  + panel.timeSeries.standardOptions.withUnit('reqps')
  + panel.timeSeries.fieldConfig.defaults.custom.withFillOpacity(10);

local errorRatePanel =
  panel.stat.new('Error Rate')
  + panel.stat.queryOptions.withTargets([
      prometheus.new(
        'Prometheus',
        'sum(rate(http_requests_total{job="$service",status=~"5.."}[$interval]))
         / sum(rate(http_requests_total{job="$service"}[$interval])) * 100'
      ),
    ])
  + panel.stat.standardOptions.withUnit('percent')
  + panel.stat.options.withGraphMode('area');

grafonnet.dashboard.new('Service Overview')
+ dashboard.withUid('svc-overview')
+ dashboard.withTags(['generated', 'service'])
+ dashboard.withRefresh('30s')
+ dashboard.withPanels([
    requestRatePanel + panel.timeSeries.gridPos.withW(16) + panel.timeSeries.gridPos.withH(8),
    errorRatePanel + panel.stat.gridPos.withW(8) + panel.stat.gridPos.withH(8) + panel.stat.gridPos.withX(16),
  ])

Version-control your dashboards

Manual dashboard changes drift and get lost. Treat dashboards like infrastructure — define them in code, store them in Git, and deploy them through CI/CD. Use Grafana’s provisioning feature or the API to push dashboard JSON from your repository. When a dashboard is modified in the UI, your CI pipeline should detect the drift and reconcile it.

08

OpenTelemetry

The vendor-neutral observability standard. Architecture, Collector configuration, auto-instrumentation, manual instrumentation, and essential environment variables.

Architecture Overview

OpenTelemetry (OTel) is a CNCF incubating project that provides a single set of APIs, SDKs, and tools to instrument, generate, collect, and export telemetry data (metrics, logs, and traces). It is the merger of OpenCensus and OpenTracing and is now the second-most-active CNCF project after Kubernetes.

Application ├── OTel SDK (API + SDK) │ ├── Traces → Span Processor → Exporter │ ├── Metrics → Metric Reader → Exporter │ └── Logs → Log Processor → Exporter └── → OTel Collector ├── Receivers (OTLP, Prometheus, Jaeger) ├── Processors (Batch, Filter, Attributes) └── Exporters (OTLP, Prometheus, Jaeger, Loki)

The key architectural decision is whether to export telemetry directly from the SDK to a backend, or route it through the OTel Collector. In production, always use the Collector — it decouples your application from your backend choice, enables batching, retry, and filtering, and lets you switch backends without redeploying applications.

OTLP Protocol

Transport	Default Port	Use Case	Notes
gRPC	`4317`	Service-to-collector, high throughput	Binary protobuf, bidirectional streaming, lower overhead
HTTP/protobuf	`4318`	Browser, serverless, environments where gRPC is unavailable	Protobuf over HTTP POST, easier to proxy and load-balance
HTTP/JSON	`4318`	Debugging, manual testing with curl	Human-readable but larger payload, slower serialization

Collector Configuration

The OTel Collector is the central hub that receives, processes, and exports telemetry data. Its configuration is a single YAML file with four top-level sections: receivers, processors, exporters, and service (which wires them into pipelines).

# otel-collector-config.yaml
# ============================================
# Production-ready OTel Collector configuration
# ============================================

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
        - job_name: 'otel-collector'
          scrape_interval: 15s
          static_configs:
            - targets: ['localhost:8888']

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  resource:
    attributes:
      - key: environment
        value: production
        action: insert
      - key: collector.version
        value: "0.96.0"
        action: insert

exporters:
  otlp:
    endpoint: tempo:4317
    tls:
      insecure: true
  prometheus:
    endpoint: "0.0.0.0:8889"
    namespace: otel
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
    labels:
      attributes:
        severity: ""
        service.name: ""

service:
  telemetry:
    logs:
      level: info
    metrics:
      address: 0.0.0.0:8888
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, resource]
      exporters: [otlp]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

Processor ordering matters

Processors execute in the order listed in the pipeline. Always put memory_limiter first — it protects the Collector from OOM kills by applying backpressure. Put batch after filtering processors so you batch only the data you intend to export.

Auto-Instrumentation

Auto-instrumentation attaches agents or hooks at runtime to capture telemetry from popular frameworks and libraries without any code changes. It is the fastest way to get traces, metrics, and logs flowing from an existing application.

JVM Java

The Java agent attaches via -javaagent and instruments Spring Boot, gRPC, JDBC, Kafka, and 100+ libraries automatically.

# Download the agent
curl -L -o opentelemetry-javaagent.jar \
  https://github.com/open-telemetry/
    opentelemetry-java-instrumentation/
    releases/latest/download/
    opentelemetry-javaagent.jar

# Run with auto-instrumentation
java -javaagent:opentelemetry-javaagent.jar \
  -Dotel.service.name=order-service \
  -Dotel.exporter.otlp.endpoint=
    http://collector:4317 \
  -jar app.jar

Interpreted Python

The Python distro auto-detects Flask, Django, FastAPI, requests, psycopg2, and more. Zero code changes required.

# Install the distro + auto-instrumentors
pip install opentelemetry-distro
opentelemetry-bootstrap -a install

# Run with auto-instrumentation
OTEL_SERVICE_NAME=order-service \
OTEL_EXPORTER_OTLP_ENDPOINT=
  http://collector:4318 \
opentelemetry-instrument python app.py

Runtime Node.js

Requires a setup file loaded with --require. Auto-instruments Express, Fastify, pg, mysql2, Redis, and HTTP modules.

// tracing.js — load before app
const { NodeSDK } = require(
  '@opentelemetry/sdk-node');
const { getNodeAutoInstrumentations }
  = require('@opentelemetry/
    auto-instrumentations-node');
const { OTLPTraceExporter }
  = require('@opentelemetry/
    exporter-trace-otlp-http');

const sdk = new NodeSDK({
  traceExporter: new OTLPTraceExporter(),
  instrumentations: [
    getNodeAutoInstrumentations()
  ],
});
sdk.start();

// Run: node --require ./tracing.js app.js

Compiled Go

Go has no runtime agent mechanism. Instrumentation must be added manually using OTel SDK wrappers for net/http, gRPC, database/sql, etc.

// Manual instrumentation required
// No auto-instrument agent for Go

// Use instrumentation libraries:
// go.opentelemetry.io/contrib/
//   instrumentation/net/http/otelhttp
// go.opentelemetry.io/contrib/
//   instrumentation/google.golang.org/
//   grpc/otelgrpc

import "go.opentelemetry.io/contrib/
  instrumentation/net/http/otelhttp"

handler := otelhttp.NewHandler(
  mux, "server")

Manual Instrumentation

Auto-instrumentation captures framework-level telemetry, but business logic requires manual spans and custom metrics. Manual instrumentation adds the domain context that makes traces and metrics meaningful — order IDs, user types, payment amounts, feature flags.

Python — Traces with Custom Attributes and Metrics

from opentelemetry import trace, metrics

# Create a tracer and meter for this service
tracer = trace.get_tracer("order-service")
meter = metrics.get_meter("order-service")

# Define custom metrics
order_counter = meter.create_counter(
    "orders_processed",
    description="Total orders processed",
    unit="1"
)
order_duration = meter.create_histogram(
    "order_duration_ms",
    description="Time to process an order",
    unit="ms"
)

def process_order(order):
    with tracer.start_as_current_span("process-order") as span:
        # Add business context as span attributes
        span.set_attribute("order.id", order.id)
        span.set_attribute("order.total", order.total)
        span.set_attribute("order.item_count", len(order.items))
        span.set_attribute("customer.tier", order.customer.tier)

        try:
            result = execute_order(order)
            order_counter.add(1, {"status": "success", "tier": order.customer.tier})
            return result

        except Exception as e:
            # Record the exception on the span
            span.record_exception(e)
            span.set_status(trace.Status(trace.StatusCode.ERROR, str(e)))
            order_counter.add(1, {"status": "error", "tier": order.customer.tier})
            raise

Adding Child Spans for Sub-Operations

def execute_order(order):
    with tracer.start_as_current_span("validate-inventory") as span:
        span.set_attribute("warehouse.id", order.warehouse_id)
        check_inventory(order.items)

    with tracer.start_as_current_span("charge-payment") as span:
        span.set_attribute("payment.method", order.payment_method)
        span.set_attribute("payment.amount", order.total)
        charge_result = process_payment(order)
        span.set_attribute("payment.transaction_id", charge_result.txn_id)

    with tracer.start_as_current_span("send-confirmation") as span:
        span.set_attribute("notification.channel", "email")
        send_confirmation_email(order)

    return OrderResult(status="completed")

Environment Variables

OTel SDKs are configured primarily through environment variables, making it easy to change behavior without code changes. These variables work across all language SDKs, providing a consistent configuration interface.

Variable	Purpose	Example
`OTEL_SERVICE_NAME`	Service identifier — the most important attribute. Appears in all telemetry and is used for service maps.	`"order-service"`
`OTEL_TRACES_EXPORTER`	Trace export format. `otlp` for Collector, `console` for debugging, `none` to disable.	`"otlp"`
`OTEL_METRICS_EXPORTER`	Metrics export format. Same options as traces.	`"otlp"`
`OTEL_EXPORTER_OTLP_ENDPOINT`	Collector URL. For gRPC use port 4317, for HTTP use 4318.	`"http://localhost:4318"`
`OTEL_TRACES_SAMPLER`	Sampling strategy. `always_on`, `always_off`, `traceidratio`, or `parentbased_traceidratio`.	`"parentbased_traceidratio"`
`OTEL_TRACES_SAMPLER_ARG`	Sample rate argument. `0.1` means 10% of traces are sampled.	`"0.1"`
`OTEL_RESOURCE_ATTRIBUTES`	Comma-separated key=value pairs added to all telemetry as resource attributes.	`"env=prod,version=1.2.3"`

Kubernetes Deployment with OTel Environment Variables

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  template:
    spec:
      containers:
        - name: order-service
          image: order-service:1.2.3
          env:
            - name: OTEL_SERVICE_NAME
              value: "order-service"
            - name: OTEL_EXPORTER_OTLP_ENDPOINT
              value: "http://otel-collector:4318"
            - name: OTEL_TRACES_SAMPLER
              value: "parentbased_traceidratio"
            - name: OTEL_TRACES_SAMPLER_ARG
              value: "0.1"
            - name: OTEL_RESOURCE_ATTRIBUTES
              value: "env=production,version=1.2.3,k8s.namespace=$(K8S_NAMESPACE)"
            - name: K8S_NAMESPACE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.namespace

Sampling in production

At scale, tracing 100% of requests is prohibitively expensive. Use parentbased_traceidratio with a rate of 0.1 (10%) as a starting point. The parentbased prefix ensures child spans inherit the parent’s sampling decision, keeping traces complete. For critical paths (checkout, payment), use the Collector’s tail-sampling processor to always capture errors and high-latency traces regardless of the head-sampling rate.

09

Infrastructure Monitoring

Node Exporter for Linux systems, container monitoring with cAdvisor and kube-state-metrics, cloud provider integrations, and a complete Docker Compose monitoring stack.

Node Exporter (Linux Systems)

The Prometheus Node Exporter exposes hardware and OS-level metrics from Linux hosts. It is the standard way to collect CPU, memory, disk, network, and filesystem metrics from bare-metal servers, VMs, and Kubernetes nodes.

Installation & Setup

# Download and run Node Exporter
curl -LO https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar xvfz node_exporter-*.tar.gz
cd node_exporter-*
./node_exporter --web.listen-address=":9100"

# Or run via Docker
docker run -d --name node-exporter \
  --net="host" \
  --pid="host" \
  -v "/:/host:ro,rslave" \
  prom/node-exporter:latest \
  --path.rootfs=/host

# Prometheus scrape config
# Add to prometheus.yml:
scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']
    scrape_interval: 15s

Key Metrics

Metric	Type	Description
`node_cpu_seconds_total`	Counter	CPU time spent in each mode (user, system, idle, iowait, etc.)
`node_memory_MemAvailable_bytes`	Gauge	Memory available for allocation without swapping
`node_filesystem_avail_bytes`	Gauge	Available disk space on each mounted filesystem
`node_network_receive_bytes_total`	Counter	Total bytes received on each network interface
`node_load1`	Gauge	1-minute load average — number of processes in runnable or uninterruptible state

Essential PromQL Queries

# CPU Usage % (across all cores)
100 - (avg by (instance) (
  rate(node_cpu_seconds_total{mode="idle"}[5m])
) * 100)

# Memory Usage %
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100

# Disk Usage %
(1 - node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}
  / node_filesystem_size_bytes{fstype!~"tmpfs|overlay"}) * 100

# Network Throughput (bytes/sec received)
rate(node_network_receive_bytes_total{device!="lo"}[5m])

# Network Throughput (bytes/sec transmitted)
rate(node_network_transmit_bytes_total{device!="lo"}[5m])

# Disk I/O Utilization %
rate(node_disk_io_time_seconds_total[5m]) * 100

# Predict disk full in 24 hours (linear regression)
predict_linear(node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"}[6h], 24*3600) < 0

Container Monitoring

Container environments add a layer of abstraction between your application and the host. cAdvisor (Container Advisor) collects resource usage and performance data from running containers. In Kubernetes, kube-state-metrics exposes cluster-level state as Prometheus metrics.

cAdvisor Metrics (Docker / containerd)

# Container CPU Usage %
sum(rate(container_cpu_usage_seconds_total{
  name!="",container!="POD"
}[5m])) by (name) * 100

# Container Memory Usage
container_memory_working_set_bytes{
  name!="",container!="POD"
}

# Container Memory Usage % (vs limit)
container_memory_working_set_bytes{container!="POD"}
/
container_spec_memory_limit_bytes{container!="POD"} * 100

# Container Network I/O (received bytes/sec)
rate(container_network_receive_bytes_total[5m])

# Container Filesystem Usage
container_fs_usage_bytes{container!="POD"}
/
container_fs_limit_bytes{container!="POD"} * 100

Kubernetes Metrics (kube-state-metrics)

# Pods not in Running state
kube_pod_status_phase{phase!="Running",phase!="Succeeded"} == 1

# Container restarts (last hour)
increase(kube_pod_container_status_restarts_total[1h]) > 3

# Deployment replicas not ready
kube_deployment_status_replicas_available
  != kube_deployment_spec_replicas

# Node conditions (DiskPressure, MemoryPressure, PIDPressure)
kube_node_status_condition{condition!="Ready",status="true"} == 1

# Resource requests vs limits vs actual usage
# (helps identify over/under-provisioned pods)
sum by (namespace, pod) (
  kube_pod_container_resource_requests{resource="memory"}
)

# Pending pods (scheduling failures)
kube_pod_status_phase{phase="Pending"} == 1

Cloud Provider Monitoring

Each major cloud provider has its own monitoring stack. The modern approach is to use OpenTelemetry as a unified collection layer and export to provider-native backends or your own Prometheus/Grafana stack.

AWS CloudWatch

Native AWS metrics for all services (EC2, RDS, Lambda, ECS, etc.). Custom metrics via PutMetricData API. CloudWatch Agent collects OS-level metrics and logs. Supports OTLP export via AWS Distro for OpenTelemetry (ADOT).

# Install ADOT Collector (ECS)
# Add as a sidecar container:
{
  "name": "otel-collector",
  "image": "public.ecr.aws/aws-
    observability/aws-otel-collector",
  "environment": [{
    "name": "AOT_CONFIG_CONTENT",
    "value": "... collector config ..."
  }]
}

# CloudWatch Metrics Insight query
SELECT AVG(CPUUtilization)
FROM SCHEMA("AWS/EC2", InstanceId)
WHERE InstanceId = 'i-0123456789'

Azure Azure Monitor

Application Insights for APM (traces, metrics, logs). Log Analytics workspace with KQL query language. Native OpenTelemetry SDK support through Azure Monitor Exporter. Integrates with Grafana via Azure Monitor data source.

// Azure OpenTelemetry setup (C#)
using Azure.Monitor.OpenTelemetry
  .AspNetCore;

var builder =
  WebApplication.CreateBuilder(args);

builder.Services
  .AddOpenTelemetry()
  .UseAzureMonitor(options => {
    options.ConnectionString =
      "InstrumentationKey=...";
  });

// KQL query in Log Analytics
requests
| where resultCode >= 500
| summarize count() by bin(
    timestamp, 5m), cloud_RoleName
| render timechart

GCP Cloud Monitoring

Native OTLP support — GCP accepts OpenTelemetry data directly. Cloud Trace for distributed tracing. Cloud Logging with powerful filtering. Managed Prometheus (GMP) for PromQL-compatible metrics at scale.

# GCP Managed Prometheus
# Scrapes Prometheus metrics and stores
# in Cloud Monitoring backend.
# Use standard PromQL to query.

# Deploy GMP frontend
kubectl apply -f https://raw.githubusercontent
  .com/GoogleCloudPlatform/
  prometheus-engine/main/manifests/
  setup.yaml

# MQL (Monitoring Query Language)
fetch gce_instance
| metric 'compute.googleapis.com/
    instance/cpu/utilization'
| group_by [resource.instance_id]
| align rate(5m)
| every 1m

Docker Compose Monitoring Stack

The following Docker Compose file sets up a complete monitoring stack for local development or small deployments. It includes all four pillars of the LGTM stack: Loki (logs), Grafana (visualization), Tempo (traces), and Mimir/Prometheus (metrics).

# docker-compose.monitoring.yaml
# ============================================
# Complete LGTM Monitoring Stack
# ============================================
version: '3.8'

services:
  # --- Metrics ---
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=15d'
      - '--web.enable-lifecycle'
    restart: unless-stopped

  # --- Visualization ---
  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      - GF_AUTH_ANONYMOUS_ENABLED=true
      - GF_AUTH_ANONYMOUS_ORG_ROLE=Admin
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    depends_on:
      - prometheus
      - loki
      - tempo
    restart: unless-stopped

  # --- Logs ---
  loki:
    image: grafana/loki:latest
    container_name: loki
    ports:
      - "3100:3100"
    volumes:
      - loki_data:/loki
    command: -config.file=/etc/loki/local-config.yaml
    restart: unless-stopped

  # --- Traces ---
  tempo:
    image: grafana/tempo:latest
    container_name: tempo
    ports:
      - "3200:3200"     # Tempo API
      - "4317:4317"     # OTLP gRPC
      - "4318:4318"     # OTLP HTTP
    command: [ "-config.file=/etc/tempo.yaml" ]
    volumes:
      - tempo_data:/var/tempo
    restart: unless-stopped

  # --- Collector ---
  otel-collector:
    image: otel/opentelemetry-collector-contrib:latest
    container_name: otel-collector
    ports:
      - "4317:4317"     # OTLP gRPC (if Tempo not using it)
      - "4318:4318"     # OTLP HTTP
      - "8889:8889"     # Prometheus exporter
      - "8888:8888"     # Collector metrics
    volumes:
      - ./otel-config.yaml:/etc/otel/config.yaml
    command: ["--config=/etc/otel/config.yaml"]
    depends_on:
      - prometheus
      - loki
      - tempo
    restart: unless-stopped

  # --- Host Metrics ---
  node-exporter:
    image: prom/node-exporter:latest
    container_name: node-exporter
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    restart: unless-stopped

volumes:
  prometheus_data:
  grafana_data:
  loki_data:
  tempo_data:

The complete LGTM stack

This stack gives you metrics (Prometheus), logs (Loki), traces (Tempo), and dashboards (Grafana) — the complete LGTM stack. Point your applications at the OTel Collector on ports 4317 (gRPC) or 4318 (HTTP), and all three signal types flow through a single endpoint. Grafana connects to all backends and provides cross-signal correlation: click a metric spike to see correlated logs and traces.

10

SLIs, SLOs & Error Budgets

Service Level Indicators, Objectives, and Agreements. Error budget math, burn rate alerting, and policies that balance reliability with development velocity.

Core Definitions

SLI (Service Level Indicator) is a quantitative measure of some aspect of the level of service being provided. SLIs are the metrics that matter most to your users. Common examples include availability ratio, latency percentiles, throughput, and error rates.

SLO (Service Level Objective) is a target value or range for an SLI. It defines the acceptable level of service. Examples: "99.9% of requests succeed over a 30-day rolling window" or "P95 latency < 300ms."

SLA (Service Level Agreement) is a business contract between a provider and a customer that specifies consequences (usually financial) when SLOs are not met. SLAs are legal commitments; SLOs are internal engineering targets.

SLOs should be stricter than SLAs

The SLO is your internal target; the SLA is your external promise. If your SLA guarantees 99.9% availability, your SLO should target 99.95% or higher. This gives your team a buffer to detect and fix problems before they breach the SLA and trigger financial penalties.

Error Budget is the allowed amount of unreliability. It is calculated as 100% - SLO%. If your SLO is 99.9%, your error budget is 0.1% — meaning you can afford 0.1% of requests to fail (or 0.1% of time to be unavailable) within the measurement window.

Error Budget Math

The following table shows how SLO targets translate into concrete error budgets and allowed downtime. Small differences in SLO percentages have dramatic effects on permitted unavailability.

SLO	Error Budget	Monthly Downtime	Yearly Downtime
99%	1%	7h 18m	3d 15h
99.5%	0.5%	3h 39m	1d 19h
99.9%	0.1%	43m 50s	8h 46m
99.95%	0.05%	21m 55s	4h 23m
99.99%	0.01%	4m 23s	52m 36s

Burn Rate

Burn rate measures how fast you are consuming your error budget relative to the SLO window. A burn rate of 1 means you will exhaust your budget exactly at the end of the window. Higher burn rates indicate faster consumption and require more urgent response.

# Burn Rate Formula
# ============================================
Burn Rate = Actual Error Rate / Error Budget Rate

# Examples (assuming 30-day window, 99.9% SLO):
# Error Budget Rate = 0.1% = 0.001

Burn Rate 1    = sustainable (budget lasts full 30-day window)
Burn Rate 2    = budget exhausted in 15 days (half the window)
Burn Rate 10   = budget exhausted in 3 days
Burn Rate 14.4 = 2% of budget consumed in 1 hour
Burn Rate 36   = entire budget consumed in 20 hours

Defining SLIs

SLIs should be measured as close to the user as possible. Use recording rules to pre-compute SLI ratios so that dashboards and alerts query pre-aggregated data rather than computing ratios on the fly.

Availability SLI

The availability SLI measures the proportion of successful requests. A request is "successful" if it does not return a server error (5xx status code).

# Availability SLI — PromQL Recording Rule
# ============================================
# Ratio of non-5xx requests to total requests
# over a 5-minute window

- record: sli:availability:ratio_5m
  expr: |
    sum(rate(http_requests_total{status!~"5.."}[5m]))
    /
    sum(rate(http_requests_total[5m]))

Latency SLI

The latency SLI measures the proportion of requests that complete within an acceptable threshold. For example, "what percentage of requests finish in under 300ms?"

# Latency SLI — Percentage Under Threshold
# ============================================
# Ratio of requests completing in < 300ms
# Uses histogram bucket boundaries

- record: sli:latency:ratio_5m
  expr: |
    sum(rate(http_request_duration_seconds_bucket{le="0.3"}[5m]))
    /
    sum(rate(http_request_duration_seconds_count[5m]))

P95 / P99 Recording Rules

# P95 Latency Recording Rule
- record: sli:latency:p95_5m
  expr: |
    histogram_quantile(0.95,
      sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
    )

# P99 Latency Recording Rule
- record: sli:latency:p99_5m
  expr: |
    histogram_quantile(0.99,
      sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
    )

# Per-service P95 (useful for SLO dashboards)
- record: sli:latency:p95_by_service_5m
  expr: |
    histogram_quantile(0.95,
      sum(rate(http_request_duration_seconds_bucket[5m])) by (le, service)
    )

Multi-Window Burn Rate Alerts

Multi-window, multi-burn-rate alerting is the gold standard for SLO-based alerting. It uses two windows per severity level: a long window to detect sustained problems and a short window to confirm the problem is ongoing (reducing false positives from brief spikes). The short window is always 1/12 of the long window.

Severity	Long Window	Short Window	Burn Rate	Budget Consumed	Action
Critical	1 hour	5 min	14.4x	2% in 1h	Page immediately
Warning	6 hours	30 min	6x	5% in 6h	Page on-call
Info	3 days	6 hours	1x	10% in 3d	Create ticket

Recording Rules for Error Ratios

# Recording rules for error ratios at multiple windows
# ============================================
# These pre-compute error ratios so alert rules
# can reference them cheaply

groups:
  - name: slo-error-ratios
    rules:
      # 5-minute window
      - record: slo:error_ratio:rate5m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(http_requests_total[5m]))

      # 30-minute window
      - record: slo:error_ratio:rate30m
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[30m]))
          /
          sum(rate(http_requests_total[30m]))

      # 1-hour window
      - record: slo:error_ratio:rate1h
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[1h]))
          /
          sum(rate(http_requests_total[1h]))

      # 6-hour window
      - record: slo:error_ratio:rate6h
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[6h]))
          /
          sum(rate(http_requests_total[6h]))

      # 3-day window
      - record: slo:error_ratio:rate3d
        expr: |
          sum(rate(http_requests_total{status=~"5.."}[3d]))
          /
          sum(rate(http_requests_total[3d]))

Complete Prometheus Alert Rules

# Multi-window burn rate alert rules
# ============================================
# SLO: 99.9% availability (error budget = 0.001)

groups:
  - name: slo-burn-rate-alerts
    rules:
      # CRITICAL — 14.4x burn rate
      # 2% of 30-day budget consumed in 1 hour
      - alert: SLOBurnRateCritical
        expr: |
          slo:error_ratio:rate1h > (14.4 * 0.001)
          and
          slo:error_ratio:rate5m > (14.4 * 0.001)
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "High burn rate on availability SLO"
          description: "Error budget burn rate is 14.4x. 2% of monthly budget consumed in the last hour."

      # WARNING — 6x burn rate
      # 5% of 30-day budget consumed in 6 hours
      - alert: SLOBurnRateWarning
        expr: |
          slo:error_ratio:rate6h > (6 * 0.001)
          and
          slo:error_ratio:rate30m > (6 * 0.001)
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Elevated burn rate on availability SLO"
          description: "Error budget burn rate is 6x. 5% of monthly budget consumed in the last 6 hours."

      # INFO — 1x burn rate (ticket)
      # 10% of 30-day budget consumed in 3 days
      - alert: SLOBurnRateInfo
        expr: |
          slo:error_ratio:rate3d > (1 * 0.001)
          and
          slo:error_ratio:rate6h > (1 * 0.001)
        for: 30m
        labels:
          severity: info
        annotations:
          summary: "Slow burn on availability SLO"
          description: "Error budget burn rate is 1x. 10% of monthly budget consumed in the last 3 days."

Error Budget Policy

An error budget policy defines what actions the team takes based on how much error budget remains. It is the mechanism that translates SLO health into concrete engineering priorities.

Budget Remaining	Status	Action
> 20%	Healthy	Innovate freely. Ship features, experiment, take calculated risks with deployments.
0 – 20%	Caution	Slow down releases. Increase testing coverage. Add canary deployments. Review recent changes for reliability risk.
Exhausted	Frozen	Freeze all non-critical releases. Only P0 bug fixes and reliability improvements allowed. Focus entirely on restoring budget.
Single incident > 20%	Mandatory Review	Any single incident consuming more than 20% of the error budget triggers a mandatory postmortem regardless of remaining budget.

Error budgets bridge velocity and reliability

Error budgets are the bridge between development velocity and reliability. When the budget is healthy, developers ship fast. When the budget is low, the team shifts focus to reliability. This creates a self-regulating system where both sides — feature development and operational stability — get the attention they need, driven by objective data rather than subjective arguments.

11

Incident Response

Severity classification, incident lifecycle management, blameless postmortems, key reliability metrics, and communication best practices for when things go wrong.

Severity Levels

A consistent severity classification ensures that incidents are escalated appropriately and that response expectations are clear across the organization.

Level	Name	Impact	Response Time	Notification
SEV-1	Critical	Complete outage, data loss risk, or security breach	< 15 min	Page entire team
SEV-2	Major	Significant degradation affecting many users	< 30 min	Page on-call
SEV-3	Minor	Limited impact, workaround available	< 4 hours	Slack / ticket
SEV-4	Low	Cosmetic issue, minor inconvenience	Next business day	Ticket only

Incident Lifecycle

Every incident follows five phases. Clear ownership and defined activities at each phase prevent chaos and reduce resolution time.

Phase 1 Detect

Automated monitoring detects the anomaly. Alerts fire based on SLO burn rates, threshold breaches, or anomaly detection.

# Key activities:
- Alert fires (PagerDuty, OpsGenie)
- Automated diagnostics trigger
- On-call engineer is paged
- Incident channel is created

# Responsible: Monitoring system
# Goal: MTTD < 5 minutes

Phase 2 Triage

Assess severity, assign an Incident Commander (IC), and determine the scope of impact. Escalate if needed.

# Key activities:
- Assign severity (SEV-1 through SEV-4)
- Designate Incident Commander
- Identify affected systems/services
- Notify stakeholders
- Begin timeline documentation

# Responsible: On-call engineer / IC
# Goal: MTTA < 15 minutes

Phase 3 Mitigate

Stop the bleeding. Apply immediate fixes to restore service — rollback, failover, scale up, or apply hotfix. Root cause comes later.

# Key activities:
- Rollback recent deployments
- Failover to healthy replicas
- Scale up resources
- Apply config fixes / feature flags
- Communicate status updates

# Responsible: IC + responders
# Goal: Restore service ASAP

Phase 4 Resolve

Confirm the root cause, deploy a permanent fix, and verify that all metrics have returned to normal baselines.

# Key activities:
- Identify root cause
- Deploy permanent fix
- Verify metrics are nominal
- Confirm with affected users
- Close incident channel

# Responsible: IC + engineering team
# Goal: Full resolution

Phase 5 Learn

Conduct a blameless postmortem. Document what happened, what went well, what failed, and define action items to prevent recurrence.

# Key activities:
- Schedule postmortem (within 48h)
- Write incident report
- Identify action items with owners
- Share learnings organization-wide
- Update runbooks and alerts

# Responsible: IC + all responders
# Goal: Prevent recurrence

Incident Commander Responsibilities

The Incident Commander (IC) is the single point of coordination during an incident. They do not fix the problem — they manage the response.

# Incident Commander Checklist
# ============================================

1. DECLARE the incident and assign severity
2. OPEN a dedicated incident channel (#inc-YYYY-MM-DD-title)
3. ASSEMBLE responders with relevant expertise
4. DELEGATE investigation to specific individuals
5. TRACK progress and maintain the timeline
6. COMMUNICATE status updates every 15-30 minutes
7. ESCALATE if resolution is not progressing
8. DECIDE when to declare mitigation / resolution
9. SCHEDULE the postmortem before closing the incident
10. HAND OFF if the incident outlasts your shift

Postmortem Process

Blameless postmortems are the most important tool for building a culture of continuous improvement. They transform failures into organizational learning.

Blameless Culture Principles

1 Focus on Systems

Analyze the systems, processes, and tools that allowed the failure to happen. Ask "what failed?" not "who failed?"

2 Assume Good Intentions

Everyone involved was doing their best with the information they had at the time. Hindsight bias is the enemy of learning.

3 Psychological Safety

People must feel safe reporting errors and near-misses. If blame is the outcome, people will hide problems instead of surfacing them.

4 Continuous Improvement

Every postmortem must produce concrete, assigned, and time-bound action items. Track completion and follow up.

Postmortem Template

# Incident Postmortem: [Title]
# ============================================

Date: YYYY-MM-DD
Severity: SEV-X
Duration: X hours Y minutes
Impact: N users affected, M% request degradation

## Timeline
- HH:MM - Alert triggered (source: Prometheus/PagerDuty)
- HH:MM - Incident Commander assigned (@name)
- HH:MM - Root cause identified
- HH:MM - Mitigation deployed (rollback/hotfix/scaling)
- HH:MM - Service fully restored
- HH:MM - Incident resolved and channel closed

## Root Cause
[Systems-focused analysis of what broke and why.
 Include contributing factors and the chain of events.]

## What Went Well
- Detection was fast (< 3 min MTTD)
- Runbook was accurate and up-to-date
- Communication was clear and timely

## What Didn't Go Well
- Rollback took too long (no automated rollback)
- Alert was too noisy, initially ignored
- Staging did not catch this failure mode

## Action Items
- [ ] Implement automated rollback (Owner: @name, Due: DATE)
- [ ] Update runbook for service X (Owner: @name, Due: DATE)
- [ ] Add integration test for failure mode (Owner: @name, Due: DATE)
- [ ] Tune alert threshold to reduce noise (Owner: @name, Due: DATE)

Key Metrics — MTTD, MTTA, MTTR, MTBF

These four metrics quantify your incident response capability. Track them over time to measure improvement and set targets for each severity level.

Metric	Full Name	Formula	Target
MTTD	Mean Time to Detect	`Alert time - Incident start time`	< 5 min
MTTA	Mean Time to Acknowledge	`Ack time - Alert time`	< 15 min
MTTR	Mean Time to Resolve	`Resolution time - Detection time`	< 1 hour (SEV-1)
MTBF	Mean Time Between Failures	`Total uptime / Number of failures`	Maximize

# Calculating incident response metrics
# ============================================

# MTTD — How fast do we detect problems?
MTTD = avg(alert_fired_timestamp - incident_start_timestamp)

# MTTA — How fast do we acknowledge alerts?
MTTA = avg(ack_timestamp - alert_fired_timestamp)

# MTTR — How fast do we resolve incidents?
MTTR = avg(resolved_timestamp - detected_timestamp)

# MTBF — How often do we have failures?
MTBF = total_uptime_hours / number_of_incidents

# Availability from MTTR and MTBF:
Availability = MTBF / (MTBF + MTTR)

Communication During Incidents

Effective communication during incidents reduces confusion, prevents duplicate effort, and keeps stakeholders informed. Establish clear communication channels and cadences before incidents happen.

Channel Dedicated Incident Channel

Create a dedicated Slack/Teams channel for each SEV-1 or SEV-2 incident. Name it consistently: #inc-YYYY-MM-DD-brief-title. Pin the incident summary at the top.

Cadence Regular Status Updates

SEV-1: Update every 15 minutes. SEV-2: Update every 30 minutes. Include current status, what is being tried, and expected next update time. Silence breeds anxiety.

Audience Internal vs. External

Internal: Full technical details in the incident channel. External: Simple, empathetic language on the status page. Never blame specific components publicly.

Status Page Public Updates

Update the status page within 5 minutes of confirming impact. Use three states: Investigating, Identified, Resolved. Tools: Statuspage.io, Cachet, Instatus.

# Status update template (internal)
# ============================================
# Post this in the incident channel at each update interval

**Incident Update — [HH:MM UTC]**
**Status:** Investigating / Mitigating / Monitoring
**Impact:** [Who/what is affected, % of users]
**Current actions:** [What we're doing right now]
**Next update:** [HH:MM UTC]
**IC:** @name

# Status update template (external / status page)
# ============================================
**[HH:MM UTC] — Investigating**
We are aware of issues affecting [service].
Our team is actively investigating.
We will provide an update within [30 minutes].

**[HH:MM UTC] — Identified**
The issue has been identified and a fix is being deployed.
Some users may experience [specific impact].

**[HH:MM UTC] — Resolved**
The issue has been resolved. All services are
operating normally. We will publish a postmortem
within 48 hours.

12

Tool Ecosystem

The Grafana LGTM stack, commercial platforms, cloud-native tools, comparison matrices, decision frameworks, and emerging trends in observability tooling.

The Grafana Stack (LGTM)

The Grafana LGTM stack is the most widely adopted open-source observability platform. Each component handles one signal type, and Grafana unifies them into a single pane of glass with cross-signal correlation.

Metrics Prometheus / Mimir

Pull-based metrics collection with the PromQL query language. Mimir provides horizontally scalable long-term storage for Prometheus metrics.

Visualization Grafana

Dashboarding and visualization for all signal types. 100+ data source plugins. Alerting, annotations, and team-based access control built in.

Logs Loki

Log aggregation inspired by Prometheus. Indexes labels (not content), making it 10x cheaper than Elasticsearch for most workloads. LogQL query language.

Traces Tempo

Distributed tracing backend with minimal indexing. Accepts Jaeger, Zipkin, and OTLP formats. Object storage backend for cost efficiency.

Best for

Cost-sensitive organizations, Kubernetes-native environments, and teams with SRE expertise. The LGTM stack offers maximum customization and zero licensing cost, but requires significant operational investment to deploy, tune, and maintain at scale.

Commercial Platforms

Commercial observability platforms trade cost for convenience. They handle infrastructure, scaling, and maintenance, letting your team focus on instrumentation and analysis rather than platform operations.

SaaS Datadog

All-in-one SaaS observability platform with 750+ integrations. Excellent developer experience with auto-discovery, APM, log management, and infrastructure monitoring. Best-in-class UX but expensive at scale — costs grow linearly with hosts and log volume.

SaaS New Relic

Full-stack observability with a generous free tier (100 GB/month). Developer-friendly with strong APM, browser monitoring, and AI-powered anomaly detection. Popular in the mid-market for its balance of features and pricing.

Enterprise Splunk

Industry leader in log analytics with powerful search processing language (SPL). ML-driven anomaly detection, enterprise-grade security features, and compliance certifications. Strong in regulated industries and large enterprises.

SaaS Honeycomb

OpenTelemetry-native observability focused on high-cardinality tracing and debugging. BubbleUp feature automatically surfaces anomalous dimensions. Built for teams that prioritize deep debugging over broad dashboarding.

Cloud-Native Tools

Every major cloud provider offers native monitoring tools. These integrate deeply with the provider's services but create vendor lock-in and typically offer limited multi-cloud support.

AWS CloudWatch + X-Ray

Native AWS metrics, logs, and alarms with zero setup for AWS services. X-Ray provides distributed tracing but is being sunset in favor of OpenTelemetry-based solutions. Migrate to OTel SDK with CloudWatch OTLP endpoint.

Azure Azure Monitor

Application Insights for APM, Log Analytics workspace for centralized logging, and Azure Monitor Metrics for infrastructure. Native integration with all Azure services and strong Kusto Query Language (KQL) for log analysis.

GCP Cloud Monitoring

Native OTLP ingestion endpoint, Managed Service for Prometheus (drop-in Prometheus replacement with no infrastructure management), and Cloud Trace for distributed tracing. Leading cloud provider for OpenTelemetry support.

Comparison Matrix

Use this matrix to compare observability platforms across the dimensions that matter most for your team and organization.

Feature	Grafana Stack	Datadog	CloudWatch	Elastic	Honeycomb
Type	Open Source	SaaS	Cloud-Native	Open / Commercial	SaaS
Cost	Free + ops	$$$$$	$$$	$$$	$$$
Setup Time	Weeks	Days	Days	Weeks	Days
Best Signal	Metrics	All	Logs	Logs	Traces
Multi-cloud	Excellent	Excellent	AWS only	Excellent	Excellent
Customization	Maximum	Medium	Low	High	Low
Learning Curve	Steep	Low	Medium	Steep	Medium

Decision Framework

There is no single "best" observability platform. The right choice depends on your team's expertise, budget, infrastructure, and operational maturity.

Open Source Choose If...

Your budget is limited but you have strong SRE expertise. You need maximum control over data retention, sampling, and pipeline configuration. You run multi-cloud or hybrid infrastructure and cannot accept vendor lock-in. You need to comply with data residency requirements.

Commercial Choose If...

You need rapid implementation (days, not weeks). Your team has limited observability expertise and needs guided setup. You want enterprise support with SLAs. You value integrated AI/ML anomaly detection and need a single vendor for all signals.

Start with OpenTelemetry for instrumentation

Regardless of which backend you choose, instrument your applications using OpenTelemetry. It is vendor-neutral — you can switch backends later without re-instrumenting your code. OTel is the universal standard supported by every major observability vendor, and it decouples your instrumentation investment from your platform choice.

2026 Trends

The observability landscape is evolving rapidly. These are the key trends shaping the industry and the tools teams are adopting.

Standard OpenTelemetry as Universal Standard

OTel has become the de facto instrumentation standard. All major vendors now accept OTLP natively. Proprietary agents are being phased out in favor of OTel SDKs and the Collector.

Migration AWS X-Ray Sunset

AWS is migrating from X-Ray to OpenTelemetry-based tracing. Teams using X-Ray SDKs should migrate to OTel SDKs with the CloudWatch OTLP endpoint for future-proof instrumentation.

Kernel eBPF for Observability

eBPF enables kernel-level observability without modifying application code. Tools like Cilium, Pixie, and Grafana Beyla provide zero-instrumentation tracing, network monitoring, and security observability.

AI/ML AI-Driven Anomaly Detection

Machine learning models are increasingly used for automated anomaly detection, root cause analysis, and alert correlation. Platforms like Datadog, New Relic, and Dynatrace embed AI assistants that can explain incidents and suggest fixes.

GitOps Observability-as-Code

Dashboards, alerts, SLOs, and recording rules are managed as code in Git repositories. Tools like Grafana Terraform provider, Crossplane, and jsonnet enable version-controlled, reviewable observability configuration.

Cost Cost Optimization

As telemetry volumes grow, cost optimization is becoming critical. Teams are adopting adaptive sampling, metric aggregation, log filtering at the Collector level, and tiered storage to control observability spend without losing visibility.