Architecture Maps

Kubernetes Architecture

Interactive architecture map of Kubernetes internals — control plane, data plane, networking, storage, security, and extensibility patterns compiled from official documentation.

CNCF Graduated v1.32 (Dec 2025) Go / gRPC / etcd Updated: Mar 2026
01

Cluster Overview

Kubernetes is an open-source container orchestration platform that automates deployment, scaling, and management of containerized applications. A cluster consists of a control plane (the brain) and worker nodes (the muscle), communicating through the API server.

5.6M+
Developers
88%
Container Adoption
130+
Certified Distros
4 mo
Release Cadence
Kubernetes Cluster Architecture
graph TD
    subgraph CP["Control Plane"]
        API["kube-apiserver"]
        ETCD["etcd cluster"]
        SCHED["kube-scheduler"]
        CCM["controller-manager"]
    end

    subgraph N1["Worker Node 1"]
        K1["kubelet"]
        KP1["kube-proxy"]
        CR1["Container Runtime"]
        P1A["Pod A"]
        P1B["Pod B"]
    end

    subgraph N2["Worker Node 2"]
        K2["kubelet"]
        KP2["kube-proxy"]
        CR2["Container Runtime"]
        P2A["Pod C"]
        P2B["Pod D"]
    end

    API --- ETCD
    API --- SCHED
    API --- CCM
    K1 --> API
    K2 --> API
    KP1 --> API
    KP2 --> API
    K1 --> CR1
    K2 --> CR2

    style API fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
    style ETCD fill:#1a5276,stroke:#5dade2,color:#f5f0e0
    style SCHED fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
    style CCM fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
    style K1 fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
    style K2 fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
    style KP1 fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
    style KP2 fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
    style CR1 fill:#6c3483,stroke:#a569bd,color:#f5f0e0
    style CR2 fill:#6c3483,stroke:#a569bd,color:#f5f0e0
    style P1A fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
    style P1B fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
    style P2A fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
    style P2B fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
                
02

Control Plane

The control plane maintains the desired state of the cluster, making all scheduling and orchestration decisions. In production, these components run across multiple nodes for high availability.

kube-apiserver

The front door to the cluster. Every interaction goes through its RESTful HTTP API. The only component that talks directly to etcd. Horizontally scalable behind a load balancer.

Control

etcd

Distributed key-value store serving as the cluster's single source of truth. Every Kubernetes object is persisted here. Uses Raft consensus for strong consistency.

Data

kube-scheduler

Watches for unscheduled Pods. Runs a two-phase algorithm: filtering eliminates infeasible nodes, scoring ranks remaining nodes by fitness. Pluggable scheduling framework.

Control

kube-controller-manager

Bundles dozens of independent control loops. Each controller watches a resource type via the API server and reconciles actual state toward desired state.

Control

cloud-controller-manager

Decouples cloud-provider-specific logic from core controllers. Manages cloud load balancers, node lifecycle, and routes. Implementations for AWS, Azure, GCP, and others.

Control optional

Key Controllers

Controller Responsibility
ReplicaSet Ensures the correct number of Pod replicas are running
Deployment Manages rollouts and rollbacks for declarative updates
StatefulSet Ordered deployment with stable network identities and persistent storage
Job / CronJob Run-to-completion and scheduled workloads
Node Monitors node health, marks unreachable nodes for eviction
EndpointSlice Populates service backend lists efficiently (up to 100 endpoints per slice)
Namespace Cleans up all resources when a namespace is deleted
03

Data Plane (Node Components)

Each worker node runs three components that execute and network the actual workloads. The kubelet acts as the node agent, kube-proxy implements service routing, and the container runtime manages container lifecycles.

Worker Node Internals
graph TD
    subgraph Node["Worker Node"]
        KL["kubelet"]
        KP["kube-proxy"]

        subgraph Runtime["Container Runtime (CRI)"]
            CTD["containerd / CRI-O"]
            RUNC["runc (OCI)"]
        end

        subgraph Pods["Pod Namespace"]
            PA["Pod A
containers + volumes"] PB["Pod B
containers + volumes"] end subgraph Net["Network Stack"] IPT["iptables / IPVS"] CNI["CNI Plugin"] end end API["kube-apiserver"] --> KL KL --> CTD CTD --> RUNC RUNC --> PA RUNC --> PB KP --> IPT CNI --> PA CNI --> PB style API fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0 style KL fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0 style KP fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0 style CTD fill:#6c3483,stroke:#a569bd,color:#f5f0e0 style RUNC fill:#6c3483,stroke:#a569bd,color:#f5f0e0 style PA fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0 style PB fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0 style IPT fill:#1a5276,stroke:#5dade2,color:#f5f0e0 style CNI fill:#1a5276,stroke:#5dade2,color:#f5f0e0

kubelet

Primary node agent. Receives PodSpecs, ensures containers are running via CRI. Handles volume mounts, probes, resource monitoring, and Pod eviction under pressure.

Runtime

kube-proxy

Implements the Service abstraction via iptables, IPVS, or nftables rules. Routes traffic from ClusterIP to backend Pods with load balancing.

Network

Container Runtime

Pulls images, creates containers, manages lifecycle. containerd is most widely used; CRI-O is common in OpenShift. Both use runc by default, or gVisor/Kata for isolation.

Runtime
kube-proxy modes

iptables (default) installs rules for random backend selection. IPVS uses Linux kernel virtual server for higher performance and more algorithms (round-robin, least connections, source hashing). nftables is the newer alternative replacing iptables.

04

API Request Lifecycle

Every request to the cluster follows a strict pipeline through authentication, authorization, admission control, and persistence. Understanding this flow is key to debugging access issues and writing admission webhooks.

Request Pipeline
graph LR
    C["Client
kubectl / controller"] --> TLS["TLS
Termination"] TLS --> AuthN["Authentication
certs, tokens, OIDC"] AuthN --> AuthZ["Authorization
RBAC / Webhook"] AuthZ --> MUT["Mutating
Admission"] MUT --> VAL["Schema
Validation"] VAL --> VADM["Validating
Admission"] VADM --> ETCD["Persist to
etcd"] ETCD --> RESP["Response"] style C fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0 style TLS fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0 style AuthN fill:#1a5276,stroke:#5dade2,color:#f5f0e0 style AuthZ fill:#1a5276,stroke:#5dade2,color:#f5f0e0 style MUT fill:#6c3483,stroke:#a569bd,color:#f5f0e0 style VAL fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0 style VADM fill:#6c3483,stroke:#a569bd,color:#f5f0e0 style ETCD fill:#1a5276,stroke:#5dade2,color:#f5f0e0 style RESP fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0

Pipeline Stages

  1. TLS Termination — all API traffic is encrypted; the API server validates client certificates
  2. Authentication — identifies the requester via client certs, bearer tokens, OIDC tokens, or ServiceAccount tokens
  3. Authorization — RBAC checks whether the authenticated identity can perform the requested verb on the resource
  4. Mutating Admission Webhooks — external HTTPS endpoints that can modify the object (inject sidecars, add labels, set defaults). Called serially.
  5. Object Schema Validation — validates the object against its OpenAPI schema
  6. Validating Admission Webhooks — accept/reject only, cannot modify. Called in parallel for speed.
  7. Persistence to etcd — the validated object is written to the key-value store
  8. Response — the API server returns the result to the client
ValidatingAdmissionPolicy (K8s 1.30+)

GA since Kubernetes 1.30, this in-process validation uses CEL (Common Expression Language) expressions instead of external webhooks. Lower latency, no external dependency, and easier to audit.

05

Pod Lifecycle & Scheduling

The scheduler assigns Pods to nodes using a pluggable framework with filtering and scoring phases. Users influence placement through affinity rules, taints, tolerations, and resource constraints.

Scheduling Pipeline
graph TD
    NEW["New Pod Created
nodeName empty"] --> WATCH["Scheduler Detects
Unscheduled Pod"] subgraph Sched["Scheduling Cycle (serial)"] FILT["Filter
eliminate infeasible nodes"] POST["PostFilter
attempt preemption"] SCORE["Score & Rank
surviving nodes"] RESERVE["Reserve
claim resources"] end subgraph Bind["Binding Cycle (concurrent)"] PREBIND["PreBind
mount volumes"] BINDN["Bind
set nodeName in etcd"] POSTBIND["PostBind
cleanup"] end WATCH --> FILT FILT -->|"no nodes pass"| POST FILT -->|"nodes pass"| SCORE POST --> SCORE SCORE --> RESERVE RESERVE --> PREBIND PREBIND --> BINDN BINDN --> POSTBIND style NEW fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0 style WATCH fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0 style FILT fill:#1a5276,stroke:#5dade2,color:#f5f0e0 style POST fill:#c0392b,stroke:#e74c3c,color:#f5f0e0 style SCORE fill:#1a5276,stroke:#5dade2,color:#f5f0e0 style RESERVE fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0 style PREBIND fill:#6c3483,stroke:#a569bd,color:#f5f0e0 style BINDN fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0 style POSTBIND fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0

Pod Phases

Phase Description
Pending Accepted by the cluster but not yet scheduled, or images not yet pulled
Running Bound to a node with at least one container running
Succeeded All containers terminated with exit code 0
Failed All containers terminated, at least one with non-zero exit
Unknown Pod status undetermined (usually node communication failure)

Scheduling Constraints

nodeSelector

Simple label matching to constrain Pods to nodes with specific labels.

Affinity / Anti-Affinity

Expressive rules for node and Pod placement: required (hard) or preferred (soft) constraints.

Taints & Tolerations

Nodes repel Pods unless the Pod explicitly tolerates the taint. Used for dedicated nodes, special hardware.

Topology Spread

Distribute Pods across failure domains (zones, nodes) with configurable skew constraints.

Priority & Preemption

Higher-priority Pods can evict lower-priority ones when cluster resources are scarce.

Resource Requests/Limits

CPU, memory, GPU, ephemeral storage. Requests guarantee minimum; limits cap maximum usage.

06

Networking & Services

Kubernetes enforces a flat network model: every Pod gets a unique cluster-wide IP, and Pods communicate without NAT. Services provide stable endpoints and load balancing atop this flat network.

Service Types & Traffic Flow
graph TD
    EXT["External Client"] --> LB["LoadBalancer
cloud-provisioned"] LB --> NP["NodePort
30000-32767"] NP --> CIP["ClusterIP
virtual IP"] CIP --> EP["EndpointSlice"] EP --> PA["Pod A"] EP --> PB["Pod B"] EP --> PC["Pod C"] ING["Ingress Controller
L7 HTTP routing"] --> CIP GW["Gateway API
L4/L7 routing"] --> CIP style EXT fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0 style LB fill:#1a5276,stroke:#5dade2,color:#f5f0e0 style NP fill:#1a5276,stroke:#85c1e9,color:#f5f0e0 style CIP fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0 style EP fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0 style PA fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0 style PB fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0 style PC fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0 style ING fill:#6c3483,stroke:#a569bd,color:#f5f0e0 style GW fill:#6c3483,stroke:#a569bd,color:#f5f0e0

Service Types

Type Scope Use Case
ClusterIP Internal only Default. Virtual IP reachable within the cluster for service-to-service communication.
NodePort External via node IP Opens a static port (30000-32767) on every node. Used for development or on-prem.
LoadBalancer External via cloud LB Provisions a cloud load balancer with its own external IP. Production standard.
ExternalName DNS alias Maps a Service to a DNS CNAME record, no proxying. Aliases external services.
Headless Direct Pod IPs ClusterIP: None. DNS returns Pod IPs directly. Used for StatefulSets.

Gateway API & Service Mesh

Gateway API Resource Model
graph TD
    GC["GatewayClass
infrastructure provider"] --> GW["Gateway
proxy config, listeners, TLS"] GW --> HR["HTTPRoute"] GW --> TR["TCPRoute"] GW --> GR["GRPCRoute"] HR --> SVC1["Service A"] HR --> SVC2["Service B"] TR --> SVC3["Service C"] style GC fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0 style GW fill:#1a5276,stroke:#5dade2,color:#f5f0e0 style HR fill:#6c3483,stroke:#a569bd,color:#f5f0e0 style TR fill:#6c3483,stroke:#a569bd,color:#f5f0e0 style GR fill:#6c3483,stroke:#a569bd,color:#f5f0e0 style SVC1 fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0 style SVC2 fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0 style SVC3 fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0

The Gateway API is the successor to Ingress, offering richer capabilities: TCP, UDP, gRPC, TLS passthrough, traffic splitting, and header-based routing. It separates concerns between cluster operators (GatewayClass, Gateway) and developers (Routes).

Service Mesh Options

Istio is the most widely adopted service mesh, now supporting ambient mode (GA since 1.24) with per-node ztunnel proxies instead of sidecar injection. Cilium provides eBPF-based mesh without sidecar proxies, using kernel-level packet processing for networking, security, and observability.

07

CRI / CNI / CSI Plugin Interfaces

Kubernetes uses three standardized plugin interfaces to decouple core logic from runtime, networking, and storage implementations. This extensibility model enables a rich ecosystem of providers.

Plugin Architecture
graph TD
    KL["kubelet"]

    subgraph CRI["Container Runtime Interface"]
        CRI_GW["CRI gRPC"]
        CTRD["containerd"]
        CRIO["CRI-O"]
    end

    subgraph CNI["Container Network Interface"]
        CNI_CFG["CNI Config
/etc/cni/net.d/"] CAL["Calico"] CIL["Cilium"] FLAN["Flannel"] end subgraph CSI["Container Storage Interface"] CSI_CTRL["CSI Controller
provision, attach"] CSI_NODE["CSI Node Plugin
mount, unmount"] end KL --> CRI_GW CRI_GW --> CTRD CRI_GW --> CRIO KL --> CNI_CFG CNI_CFG --> CAL CNI_CFG --> CIL CNI_CFG --> FLAN KL --> CSI_NODE CSI_CTRL --> CSI_NODE style KL fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0 style CRI_GW fill:#6c3483,stroke:#a569bd,color:#f5f0e0 style CTRD fill:#6c3483,stroke:#a569bd,color:#f5f0e0 style CRIO fill:#6c3483,stroke:#a569bd,color:#f5f0e0 style CNI_CFG fill:#1a5276,stroke:#5dade2,color:#f5f0e0 style CAL fill:#1a5276,stroke:#5dade2,color:#f5f0e0 style CIL fill:#1a5276,stroke:#5dade2,color:#f5f0e0 style FLAN fill:#1a5276,stroke:#5dade2,color:#f5f0e0 style CSI_CTRL fill:#8B6914,stroke:#d4bc7a,color:#f5f0e0 style CSI_NODE fill:#8B6914,stroke:#d4bc7a,color:#f5f0e0

CRI (Container Runtime)

gRPC protocol defining RuntimeService (Pod/container lifecycle, exec, attach) and ImageService (pull, list, remove). Replaced the original Docker-specific code path.

Runtime

CNI (Networking)

Specification for Pod networking. Assigns Pod IPs, configures routes, sets up network namespaces. Kubelet reads config from /etc/cni/net.d/ and invokes plugins at Pod start/stop.

Network

CSI (Storage)

Standardized interface for storage vendors. CSI Controller (Deployment) handles provisioning and snapshots; CSI Node plugin (DaemonSet) handles mount/unmount on each node.

Storage

Popular CNI Plugins

Plugin Approach Network Policy
Calico BGP-based routing, eBPF dataplane option Yes
Cilium eBPF-native networking and security, identity-based policy Yes
Flannel Simple VXLAN overlay No
Weave Net Encrypted mesh overlay Yes
AWS VPC CNI Assigns real VPC IPs to Pods Yes
08

RBAC & Admission Control

Role-Based Access Control determines who can do what in the cluster. Admission controllers intercept API requests after auth, enforcing policies and mutating objects before persistence.

RBAC Resource Model
graph LR
    subgraph Subjects["Subjects"]
        U["User"]
        G["Group"]
        SA["ServiceAccount"]
    end

    subgraph Bindings["Bindings"]
        RB["RoleBinding
namespace-scoped"] CRB["ClusterRoleBinding
cluster-scoped"] end subgraph Roles["Roles"] R["Role
namespace-scoped"] CR["ClusterRole
cluster-scoped"] end subgraph Resources["Resources"] PODS["pods"] DEPLOY["deployments"] SVC["services"] SEC["secrets"] end U --> RB G --> CRB SA --> RB RB --> R RB --> CR CRB --> CR R --> PODS R --> DEPLOY CR --> SVC CR --> SEC style U fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0 style G fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0 style SA fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0 style RB fill:#1a5276,stroke:#5dade2,color:#f5f0e0 style CRB fill:#1a5276,stroke:#5dade2,color:#f5f0e0 style R fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0 style CR fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0 style PODS fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0 style DEPLOY fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0 style SVC fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0 style SEC fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
Additive Permissions Only

RBAC permissions are purely additive — there are no deny rules. Rules specify apiGroups, resources, and verbs (get, list, watch, create, update, patch, delete). Users cannot grant permissions they do not already have (privilege escalation prevention).

Built-in Admission Controllers

LimitRanger

Enforces default resource requests and limits for containers that do not specify them.

ResourceQuota

Enforces aggregate resource quotas per namespace (CPU, memory, object counts).

PodSecurity

Enforces Pod Security Standards at three levels: privileged, baseline, restricted.

NamespaceLifecycle

Prevents creation of new objects in namespaces that are being terminated.

ServiceAccount

Automatically mounts projected service account tokens into Pods.

DefaultStorageClass

Assigns the default StorageClass to PersistentVolumeClaims that do not specify one.

09

Operators & Custom Resources

The Operator pattern extends Kubernetes by encoding domain knowledge into custom controllers. CRDs extend the API with new resource types, and controllers reconcile actual state toward the desired state declared in those resources.

Operator Reconciliation Loop
graph TD
    CR["Custom Resource
desired state (spec)"] --> WATCH["Controller
watches API server"] WATCH --> COMPARE["Compare
desired vs actual"] COMPARE -->|"drift detected"| ACT["Take Action
create Pods, update config"] COMPARE -->|"in sync"| WAIT["Wait for Next
Change Event"] ACT --> STATUS["Update Status
subresource"] STATUS --> WATCH WAIT --> WATCH style CR fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0 style WATCH fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0 style COMPARE fill:#1a5276,stroke:#5dade2,color:#f5f0e0 style ACT fill:#c0392b,stroke:#e74c3c,color:#f5f0e0 style WAIT fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0 style STATUS fill:#6c3483,stroke:#a569bd,color:#f5f0e0

Operator Capability Levels

Level Capability Description
1 Basic Install Automated deployment of the application
2 Seamless Upgrades Patch and minor version upgrades
3 Full Lifecycle Backup, restore, failure recovery
4 Deep Insights Metrics, alerts, log processing
5 Auto Pilot Auto-scaling, auto-tuning, anomaly detection

Operator Frameworks

Kubebuilder

Official Kubernetes SIG project. Scaffold-based Go framework for building controllers and webhooks.

Go

Operator SDK

Red Hat project supporting Go, Ansible, and Helm. Includes OLM for operator lifecycle management.

Go / Ansible / Helm

kube-rs

High-performance Rust framework for building Kubernetes controllers and operators.

Rust

Kopf

Pythonic operator framework with decorators for event handlers and timers.

Python
Common Operators

Databases (PostgreSQL Operator, MySQL Operator), message queues (Strimzi for Kafka), monitoring (Prometheus Operator), certificates (cert-manager), GitOps (Argo CD, Flux). The controller is level-triggered and idempotent, making reconciliation self-healing.

10

etcd & Raft Consensus

etcd is the distributed key-value store that serves as Kubernetes' single source of truth. Written in Go, it uses the Raft consensus algorithm for strong consistency across a 3- or 5-member cluster.

Raft Consensus - Log Replication
graph TD
    CLIENT["API Server
(sole client)"] --> LEADER["etcd Leader"] LEADER --> LOG["Append to
Leader Log"] LOG --> REP1["Replicate to
Follower 1"] LOG --> REP2["Replicate to
Follower 2"] REP1 --> QUORUM["Quorum
Acknowledged"] REP2 --> QUORUM QUORUM --> COMMIT["Commit Entry"] COMMIT --> APPLY["Apply to
State Machine"] APPLY --> RESP["Respond to
API Server"] style CLIENT fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0 style LEADER fill:#1a5276,stroke:#5dade2,color:#f5f0e0 style LOG fill:#1a5276,stroke:#85c1e9,color:#f5f0e0 style REP1 fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0 style REP2 fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0 style QUORUM fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0 style COMMIT fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0 style APPLY fill:#6c3483,stroke:#a569bd,color:#f5f0e0 style RESP fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0

Strong Consistency

Linearizable reads and writes via Raft. All operations appear to occur atomically and in total order.

Watch Support

Clients subscribe to key change notifications. This powers all Kubernetes controllers' event-driven architecture.

MVCC

Multi-Version Concurrency Control maintains revision history, enabling consistent snapshots and watch resumption.

Lease System

TTL-based key expiration for leader election and distributed locking. Scheduler and controller-manager use leases.

Kubernetes-Specific Patterns

  1. Sole client — only kube-apiserver reads/writes to etcd directly; all other components go through the API
  2. Watch-based architecture — controllers establish long-lived watches via the API server, which maps to etcd watches. Event-driven, no polling.
  3. Resource versioning — every object carries a resourceVersion (mapped to etcd's mod_revision) for optimistic concurrency and watch resumption
  4. Key layout — /registry/<resource-type>/<namespace>/<name> for namespaced resources
  5. Compaction — old revisions are periodically compacted to reclaim storage; snapshots enable backup/restore
Performance Warning

etcd is sensitive to disk I/O latency. SSDs are strongly recommended for production deployments. Large clusters (1000+ nodes) may require tuning heartbeat interval, election timeout, and snapshot count. Quorum: (n/2) + 1 members must agree. A 3-member cluster tolerates 1 failure; 5 members tolerate 2.

11

High Availability Patterns

Production Kubernetes clusters use redundancy at both the control plane and data plane levels. HA requires careful distribution across failure domains and automated scaling mechanisms.

HA Control Plane Architecture
graph TD
    LB["Load Balancer"] --> API1["API Server 1"]
    LB --> API2["API Server 2"]
    LB --> API3["API Server 3"]

    API1 --> ETCD1["etcd 1"]
    API2 --> ETCD2["etcd 2"]
    API3 --> ETCD3["etcd 3"]

    SCHED["Scheduler
(leader-elected)"] --> LB CCM["Controller Manager
(leader-elected)"] --> LB ETCD1 <--> ETCD2 ETCD2 <--> ETCD3 ETCD1 <--> ETCD3 style LB fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0 style API1 fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0 style API2 fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0 style API3 fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0 style ETCD1 fill:#1a5276,stroke:#5dade2,color:#f5f0e0 style ETCD2 fill:#1a5276,stroke:#5dade2,color:#f5f0e0 style ETCD3 fill:#1a5276,stroke:#5dade2,color:#f5f0e0 style SCHED fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0 style CCM fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0

Control Plane HA

API Server Replicas

Multiple instances behind a load balancer. Stateless and horizontally scalable. All instances are active.

etcd Cluster

3 or 5 members across failure domains. Uses Raft consensus. Odd numbers ensure clean quorum.

Leader Election

Scheduler and controller-manager use lease-based election via the API server. Only one active instance; others on standby.

Data Plane HA

Pod Disruption Budgets

Guarantee minimum available replicas during voluntary disruptions (node drain, upgrades).

Topology Spread

Distribute Pod replicas across availability zones and nodes to survive infrastructure failures.

Horizontal Pod Autoscaler

Automatically scales replica count based on CPU, memory, or custom metrics (Prometheus, Datadog).

Cluster Autoscaler

Adds or removes nodes based on pending Pods and node utilization. Integrates with cloud providers.

12

Acronym Reference

APIApplication Programming Interface
BGPBorder Gateway Protocol
CELCommon Expression Language
CNCFCloud Native Computing Foundation
CNIContainer Network Interface
CRDCustom Resource Definition
CRIContainer Runtime Interface
CSIContainer Storage Interface
DNSDomain Name System
eBPFExtended Berkeley Packet Filter
gRPCGoogle Remote Procedure Call
HAHigh Availability
HPAHorizontal Pod Autoscaler
IPVSIP Virtual Server
K8sKubernetes (K + 8 letters + s)
L4/L7OSI Layer 4 (transport) / Layer 7 (application)
mTLSMutual Transport Layer Security
MVCCMulti-Version Concurrency Control
NATNetwork Address Translation
OCIOpen Container Initiative
OIDCOpenID Connect
OLMOperator Lifecycle Manager
PDBPod Disruption Budget
PVPersistent Volume
PVCPersistent Volume Claim
RBACRole-Based Access Control
SIGSpecial Interest Group
TLSTransport Layer Security
TTLTime To Live
VXLANVirtual Extensible LAN
Diagram
100%
Scroll to zoom · Drag to pan · Esc to close