Log Levels
Log levels provide a severity hierarchy that allows you to filter and route messages based on importance. Every logging framework supports them, and consistent usage across services is critical for effective debugging and alerting.
| Level |
Severity |
Description |
When to Use |
DEBUG |
Lowest |
Detailed diagnostic information for developers |
Variable values, function entry/exit, SQL queries, cache hits/misses |
INFO |
Normal |
General operational events confirming things work as expected |
Server started, request completed, job scheduled, config loaded |
WARN |
Elevated |
Potentially harmful situation that does not prevent operation |
Deprecated API call, retry attempt, approaching resource limit, fallback used |
ERROR |
High |
An error event that might still allow the application to continue |
Failed request, database connection error, external API failure, unhandled exception |
FATAL |
Critical |
Severe error that will likely cause the application to terminate |
Cannot bind port, out of memory, required config missing, data corruption detected |
Production log level guidance
In production, set the minimum log level to INFO. Use DEBUG only for targeted troubleshooting, ideally controlled per-service via a config flag or environment variable. Never leave DEBUG logging enabled permanently in production — it generates massive volumes and can expose sensitive data.
Structured Logging
Structured logging outputs log entries as machine-parseable data (typically JSON) rather than free-form text. This is the foundation of modern observability — structured logs enable field-level filtering, aggregation, and correlation without fragile regex parsing.
Unstructured vs Structured
# Unstructured — human-readable, machine-hostile
2024-03-15 14:23:45 ERROR Payment failed for user usr_8x7k2m order ord_3f9a1c: timeout after 30s
# Structured — machine-readable, queryable, correlatable
{"timestamp":"2024-03-15T14:23:45.123Z","level":"ERROR","service":"payment-api",
"trace_id":"abc123def456","message":"Payment failed","user_id":"usr_8x7k2m",
"order_id":"ord_3f9a1c","error":"timeout","duration_ms":30000}
Why structured logging matters:
- Queryable: Filter by any field (
level=ERROR AND service=payment-api)
- Aggregatable: Count errors by type, service, or endpoint
- Correlatable: Join logs across services using
trace_id
- Parseable: No regex required — tools like Loki, Elasticsearch, and Splunk parse JSON natively
Python — structlog
import structlog
# Configure structlog with JSON output
structlog.configure(
processors=[
structlog.contextvars.merge_contextvars,
structlog.processors.add_log_level,
structlog.processors.TimeStamper(fmt="iso"),
structlog.processors.StackInfoRenderer(),
structlog.processors.format_exc_info,
structlog.processors.JSONRenderer(),
],
)
logger = structlog.get_logger()
# Basic structured log
logger.info("request_completed",
method="GET",
path="/api/users",
status=200,
duration_ms=45,
user_id="usr_8x7k2m"
)
# Output: {"event":"request_completed","level":"info",
# "timestamp":"2024-03-15T14:23:45.123Z","method":"GET",
# "path":"/api/users","status":200,"duration_ms":45,
# "user_id":"usr_8x7k2m"}
# Bind context that persists across log calls
logger = logger.bind(service="payment-api", env="production")
logger.error("payment_failed",
order_id="ord_3f9a1c",
error="gateway_timeout",
retry_count=2
)
Go — log/slog (stdlib)
package main
import (
"log/slog"
"os"
)
func main() {
// JSON handler for structured output
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
Level: slog.LevelInfo,
}))
slog.SetDefault(logger)
// Structured log with typed attributes
slog.Info("request completed",
slog.String("method", "GET"),
slog.String("path", "/api/users"),
slog.Int("status", 200),
slog.Duration("duration", elapsed),
slog.String("trace_id", traceID),
)
// Group related attributes
slog.Error("payment failed",
slog.Group("order",
slog.String("id", "ord_3f9a1c"),
slog.Float64("amount", 49.99),
),
slog.String("error", "gateway_timeout"),
)
}
Node.js — pino
const pino = require('pino');
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
timestamp: pino.stdTimeFunctions.isoTime,
formatters: {
level: (label) => ({ level: label }),
},
// Redact sensitive fields
redact: ['req.headers.authorization', 'body.password'],
});
// Structured log entry
logger.info({
method: 'GET',
path: '/api/users',
status: 200,
duration_ms: 45,
trace_id: 'abc123def456',
}, 'request completed');
// Child logger with bound context
const reqLogger = logger.child({
service: 'payment-api',
request_id: 'req_xyz789',
});
reqLogger.error({
order_id: 'ord_3f9a1c',
error: 'gateway_timeout',
retry_count: 2,
}, 'payment failed');
Key Fields Every Log Should Have
| Field |
Type |
Purpose |
Example |
timestamp |
ISO 8601 |
When the event occurred (UTC) |
2024-03-15T14:23:45.123Z |
level |
string |
Severity of the event |
info, error, warn |
service |
string |
Which service emitted the log |
payment-api |
trace_id |
string |
Correlation with distributed traces |
abc123def456 |
message |
string |
Human-readable event description |
Payment processing failed |
Correlation IDs
A correlation ID (also called a request ID) is a unique identifier that follows a request across every service it touches. It is the single most important field for debugging distributed systems — it lets you find every log entry, span, and metric related to one user action.
How it works:
- The first service (API gateway or load balancer) generates a unique ID
- The ID is passed via HTTP headers to every downstream service
- Every service includes the ID in all log entries and spans
- Common header names:
X-Correlation-ID, X-Request-ID, traceparent (W3C)
Middleware Example (Python / Flask)
import uuid
from flask import Flask, request, g
app = Flask(__name__)
@app.before_request
def set_correlation_id():
"""Extract or generate correlation ID for every request."""
# Check incoming headers (may come from upstream service)
correlation_id = (
request.headers.get('X-Correlation-ID') or
request.headers.get('X-Request-ID') or
str(uuid.uuid4())
)
g.correlation_id = correlation_id
@app.after_request
def add_correlation_header(response):
"""Include correlation ID in response headers."""
response.headers['X-Correlation-ID'] = g.correlation_id
return response
# When calling downstream services, propagate the ID
import requests as http_client
def call_downstream(url, data):
return http_client.post(url, json=data, headers={
'X-Correlation-ID': g.correlation_id,
'Content-Type': 'application/json',
})
Request Flow with Correlation ID
# Correlation ID propagation across 3 services
#
# correlation_id = "req-7a3f-2b1c-4d5e"
#
# +----------+ +------------+ +-------------+
# | Client | ------> | API Gateway| ------> | Order Svc |
# | | | | | |
# | | | Generates: | | Receives: |
# | | | X-Corr-ID | | X-Corr-ID |
# +----------+ +-----+------+ +------+------+
# | |
# | v
# | +-------------+
# | | Payment Svc |
# | | |
# | | Receives: |
# | | X-Corr-ID |
# | +-------------+
#
# All 3 services log with the same correlation_id:
#
# [API Gateway] {"correlation_id":"req-7a3f-2b1c-4d5e","msg":"request received"}
# [Order Svc] {"correlation_id":"req-7a3f-2b1c-4d5e","msg":"order created"}
# [Payment Svc] {"correlation_id":"req-7a3f-2b1c-4d5e","msg":"payment processed"}
#
# Query all logs for this request:
# {job=~".+"} | json | correlation_id = "req-7a3f-2b1c-4d5e"
Log Aggregation Tools
Label-based
Grafana Loki
Like Prometheus, but for logs. Indexes labels only, not log content, making it dramatically cheaper to operate at scale. Uses LogQL for queries.
# Key characteristics
# - Label-based indexing (not full-text)
# - LogQL query language
# - Object storage backend (S3, GCS)
# - 10-100x cheaper than Elasticsearch
# - Native Grafana integration
# Storage: ~$0.02/GB/month (S3)
# vs Elasticsearch: ~$0.50/GB/month
Full-text
ELK Stack
Elasticsearch + Logstash + Kibana. Full-text search and analytics engine. Powerful but resource-hungry. Best when you need complex text queries.
# Components
# Elasticsearch — search & storage
# Logstash — ingestion & transform
# Kibana — visualization & dashboards
# Beats — lightweight log shippers
# Strengths: full-text search,
# complex aggregations, mature
# Weakness: high memory/storage cost
Forwarding
Fluentd / Fluent Bit
Unified log forwarding and processing layer. Collects from multiple sources, transforms, and routes to any destination. Fluent Bit is the lightweight variant.
# Fluentd — full-featured, Ruby
# Fluent Bit — lightweight, C
#
# Input plugins: tail, syslog,
# docker, kubernetes
# Output plugins: Loki, ES, S3,
# Kafka, Datadog, CloudWatch
#
# Use Fluent Bit as a DaemonSet
# in Kubernetes for log collection
LogQL Examples
LogQL is Grafana Loki's query language. It uses the same label matching syntax as PromQL, combined with pipeline operators for filtering, parsing, and aggregating log content.
# --- Basic label matching ---
# All logs from the api job
{job="api"}
# Multiple label matchers
{job="api", env="production", level="error"}
# Regex label matching
{job=~"api|payment", namespace="prod"}
# --- Filter by content ---
# Lines containing "error"
{job="api"} |= "error"
# Lines NOT containing "healthcheck"
{job="api"} != "healthcheck"
# Regex filter
{job="api"} |~ "status=(4|5)\\d{2}"
# --- JSON parsing ---
# Parse JSON and filter by field
{job="api"} | json | status >= 500
# Extract specific fields
{job="api"} | json | line_format "{{.method}} {{.path}} {{.status}}"
# Filter parsed fields
{job="api"} | json | duration_ms > 1000 | level = "error"
# --- Rate queries (metric from logs) ---
# Error rate per second over 5 minutes
rate({job="api"} |= "error" [5m])
# Errors per minute by service
sum by (service) (
rate({job=~".+"} | json | level = "error" [5m])
) * 60
# --- Top errors ---
# Top 10 error messages by frequency
topk(10,
sum by (error) (
count_over_time({job="api"} | json | level = "error" [1h])
)
)
# --- Bytes rate (log volume) ---
# Bytes per second by job
bytes_rate({job="api"} [5m])
# --- Quantile from logs ---
# P95 duration from structured logs
quantile_over_time(0.95,
{job="api"} | json | unwrap duration_ms [5m]
) by (handler)
Log Sampling & Cost Control
Logging everything in production can cost more than your infrastructure
A single service logging 1,000 req/s at 1 KB/log generates 86 GB/day. Across 20 services, that is 1.7 TB/day before ingestion overhead. At $0.50/GB for Elasticsearch, that is $850/day just for log storage. Sampling and filtering are not optional — they are economic necessities.
Cost control strategies:
- Sample DEBUG logs: Log 1-10% of DEBUG messages using a token bucket or probabilistic sampler
- Always keep ERROR and FATAL: Never sample error-level logs — these are your lifeline during incidents
- Rate-limit per source: Cap log throughput per service/pod to prevent noisy neighbor problems
- Drop health check logs: Filter out high-frequency, low-value logs like
/healthz and /readyz
- Use log levels aggressively: Set production minimum to INFO, enable DEBUG per-service as needed
- Tiered retention: Keep hot logs for 7 days, warm for 30 days, cold (S3) for 90 days
Token Bucket Sampling Pattern
# Token bucket rate limiter for log sampling
# Allows burst logging but caps sustained rate
import time
import threading
class LogSampler:
"""Token bucket sampler: allows `rate` logs/sec with
burst capacity of `bucket_size`."""
def __init__(self, rate=10, bucket_size=50):
self.rate = rate # tokens per second
self.bucket_size = bucket_size
self.tokens = bucket_size # start full
self.last_refill = time.monotonic()
self.lock = threading.Lock()
def should_log(self, level: str) -> bool:
# Always log ERROR and above
if level in ("ERROR", "FATAL", "CRITICAL"):
return True
with self.lock:
now = time.monotonic()
elapsed = now - self.last_refill
self.tokens = min(
self.bucket_size,
self.tokens + elapsed * self.rate
)
self.last_refill = now
if self.tokens >= 1:
self.tokens -= 1
return True
return False
# Usage
sampler = LogSampler(rate=10, bucket_size=50)
if sampler.should_log("DEBUG"):
logger.debug("cache_miss", key=cache_key)
# ERROR always passes through
if sampler.should_log("ERROR"):
logger.error("payment_failed", order_id=order_id)