Kubernetes Architecture Map — Architecture Guide

01

Cluster Overview

Kubernetes is an open-source container orchestration platform that automates deployment, scaling, and management of containerized applications. A cluster consists of a control plane (the brain) and worker nodes (the muscle), communicating through the API server.

5.6M+

Developers

88%

Container Adoption

130+

Certified Distros

4 mo

Release Cadence

Kubernetes Cluster Architecture

graph TD
    subgraph CP["Control Plane"]
        API["kube-apiserver"]
        ETCD["etcd cluster"]
        SCHED["kube-scheduler"]
        CCM["controller-manager"]
    end

    subgraph N1["Worker Node 1"]
        K1["kubelet"]
        KP1["kube-proxy"]
        CR1["Container Runtime"]
        P1A["Pod A"]
        P1B["Pod B"]
    end

    subgraph N2["Worker Node 2"]
        K2["kubelet"]
        KP2["kube-proxy"]
        CR2["Container Runtime"]
        P2A["Pod C"]
        P2B["Pod D"]
    end

    API --- ETCD
    API --- SCHED
    API --- CCM
    K1 --> API
    K2 --> API
    KP1 --> API
    KP2 --> API
    K1 --> CR1
    K2 --> CR2

    style API fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
    style ETCD fill:#1a5276,stroke:#5dade2,color:#f5f0e0
    style SCHED fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
    style CCM fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
    style K1 fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
    style K2 fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
    style KP1 fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
    style KP2 fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
    style CR1 fill:#6c3483,stroke:#a569bd,color:#f5f0e0
    style CR2 fill:#6c3483,stroke:#a569bd,color:#f5f0e0
    style P1A fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
    style P1B fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
    style P2A fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
    style P2B fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0

02

Control Plane

The control plane maintains the desired state of the cluster, making all scheduling and orchestration decisions. In production, these components run across multiple nodes for high availability.

kube-apiserver

The front door to the cluster. Every interaction goes through its RESTful HTTP API. The only component that talks directly to etcd. Horizontally scalable behind a load balancer.

Control

etcd

Distributed key-value store serving as the cluster's single source of truth. Every Kubernetes object is persisted here. Uses Raft consensus for strong consistency.

Data

kube-scheduler

Watches for unscheduled Pods. Runs a two-phase algorithm: filtering eliminates infeasible nodes, scoring ranks remaining nodes by fitness. Pluggable scheduling framework.

Control

kube-controller-manager

Bundles dozens of independent control loops. Each controller watches a resource type via the API server and reconciles actual state toward desired state.

Control

cloud-controller-manager

Decouples cloud-provider-specific logic from core controllers. Manages cloud load balancers, node lifecycle, and routes. Implementations for AWS, Azure, GCP, and others.

Control optional

Key Controllers

Controller	Responsibility
ReplicaSet	Ensures the correct number of Pod replicas are running
Deployment	Manages rollouts and rollbacks for declarative updates
StatefulSet	Ordered deployment with stable network identities and persistent storage
Job / CronJob	Run-to-completion and scheduled workloads
Node	Monitors node health, marks unreachable nodes for eviction
EndpointSlice	Populates service backend lists efficiently (up to 100 endpoints per slice)
Namespace	Cleans up all resources when a namespace is deleted

03

Data Plane (Node Components)

Each worker node runs three components that execute and network the actual workloads. The kubelet acts as the node agent, kube-proxy implements service routing, and the container runtime manages container lifecycles.

Worker Node Internals

graph TD
    subgraph Node["Worker Node"]
        KL["kubelet"]
        KP["kube-proxy"]

        subgraph Runtime["Container Runtime (CRI)"]
            CTD["containerd / CRI-O"]
            RUNC["runc (OCI)"]
        end

        subgraph Pods["Pod Namespace"]
            PA["Pod A
containers + volumes"]
            PB["Pod B
containers + volumes"]
        end

        subgraph Net["Network Stack"]
            IPT["iptables / IPVS"]
            CNI["CNI Plugin"]
        end
    end

    API["kube-apiserver"] --> KL
    KL --> CTD
    CTD --> RUNC
    RUNC --> PA
    RUNC --> PB
    KP --> IPT
    CNI --> PA
    CNI --> PB

    style API fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
    style KL fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
    style KP fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
    style CTD fill:#6c3483,stroke:#a569bd,color:#f5f0e0
    style RUNC fill:#6c3483,stroke:#a569bd,color:#f5f0e0
    style PA fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
    style PB fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
    style IPT fill:#1a5276,stroke:#5dade2,color:#f5f0e0
    style CNI fill:#1a5276,stroke:#5dade2,color:#f5f0e0

kubelet

Primary node agent. Receives PodSpecs, ensures containers are running via CRI. Handles volume mounts, probes, resource monitoring, and Pod eviction under pressure.

Runtime

kube-proxy

Implements the Service abstraction via iptables, IPVS, or nftables rules. Routes traffic from ClusterIP to backend Pods with load balancing.

Network

Container Runtime

Pulls images, creates containers, manages lifecycle. containerd is most widely used; CRI-O is common in OpenShift. Both use runc by default, or gVisor/Kata for isolation.

Runtime

kube-proxy modes

iptables (default) installs rules for random backend selection. IPVS uses Linux kernel virtual server for higher performance and more algorithms (round-robin, least connections, source hashing). nftables is the newer alternative replacing iptables.

04

API Request Lifecycle

Every request to the cluster follows a strict pipeline through authentication, authorization, admission control, and persistence. Understanding this flow is key to debugging access issues and writing admission webhooks.

Request Pipeline

graph LR
    C["Client
kubectl / controller"] --> TLS["TLS
Termination"]
    TLS --> AuthN["Authentication
certs, tokens, OIDC"]
    AuthN --> AuthZ["Authorization
RBAC / Webhook"]
    AuthZ --> MUT["Mutating
Admission"]
    MUT --> VAL["Schema
Validation"]
    VAL --> VADM["Validating
Admission"]
    VADM --> ETCD["Persist to
etcd"]
    ETCD --> RESP["Response"]

    style C fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
    style TLS fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
    style AuthN fill:#1a5276,stroke:#5dade2,color:#f5f0e0
    style AuthZ fill:#1a5276,stroke:#5dade2,color:#f5f0e0
    style MUT fill:#6c3483,stroke:#a569bd,color:#f5f0e0
    style VAL fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
    style VADM fill:#6c3483,stroke:#a569bd,color:#f5f0e0
    style ETCD fill:#1a5276,stroke:#5dade2,color:#f5f0e0
    style RESP fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0

Pipeline Stages

TLS Termination — all API traffic is encrypted; the API server validates client certificates
Authentication — identifies the requester via client certs, bearer tokens, OIDC tokens, or ServiceAccount tokens
Authorization — RBAC checks whether the authenticated identity can perform the requested verb on the resource
Mutating Admission Webhooks — external HTTPS endpoints that can modify the object (inject sidecars, add labels, set defaults). Called serially.
Object Schema Validation — validates the object against its OpenAPI schema
Validating Admission Webhooks — accept/reject only, cannot modify. Called in parallel for speed.
Persistence to etcd — the validated object is written to the key-value store
Response — the API server returns the result to the client

ValidatingAdmissionPolicy (K8s 1.30+)

GA since Kubernetes 1.30, this in-process validation uses CEL (Common Expression Language) expressions instead of external webhooks. Lower latency, no external dependency, and easier to audit.

05

Pod Lifecycle & Scheduling

The scheduler assigns Pods to nodes using a pluggable framework with filtering and scoring phases. Users influence placement through affinity rules, taints, tolerations, and resource constraints.

Scheduling Pipeline

graph TD
    NEW["New Pod Created
nodeName empty"] --> WATCH["Scheduler Detects
Unscheduled Pod"]

    subgraph Sched["Scheduling Cycle (serial)"]
        FILT["Filter
eliminate infeasible nodes"]
        POST["PostFilter
attempt preemption"]
        SCORE["Score & Rank
surviving nodes"]
        RESERVE["Reserve
claim resources"]
    end

    subgraph Bind["Binding Cycle (concurrent)"]
        PREBIND["PreBind
mount volumes"]
        BINDN["Bind
set nodeName in etcd"]
        POSTBIND["PostBind
cleanup"]
    end

    WATCH --> FILT
    FILT -->|"no nodes pass"| POST
    FILT -->|"nodes pass"| SCORE
    POST --> SCORE
    SCORE --> RESERVE
    RESERVE --> PREBIND
    PREBIND --> BINDN
    BINDN --> POSTBIND

    style NEW fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
    style WATCH fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
    style FILT fill:#1a5276,stroke:#5dade2,color:#f5f0e0
    style POST fill:#c0392b,stroke:#e74c3c,color:#f5f0e0
    style SCORE fill:#1a5276,stroke:#5dade2,color:#f5f0e0
    style RESERVE fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
    style PREBIND fill:#6c3483,stroke:#a569bd,color:#f5f0e0
    style BINDN fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
    style POSTBIND fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0

Pod Phases

Phase	Description
Pending	Accepted by the cluster but not yet scheduled, or images not yet pulled
Running	Bound to a node with at least one container running
Succeeded	All containers terminated with exit code 0
Failed	All containers terminated, at least one with non-zero exit
Unknown	Pod status undetermined (usually node communication failure)

Scheduling Constraints

nodeSelector

Simple label matching to constrain Pods to nodes with specific labels.

Affinity / Anti-Affinity

Expressive rules for node and Pod placement: required (hard) or preferred (soft) constraints.

Taints & Tolerations

Nodes repel Pods unless the Pod explicitly tolerates the taint. Used for dedicated nodes, special hardware.

Topology Spread

Distribute Pods across failure domains (zones, nodes) with configurable skew constraints.

Priority & Preemption

Higher-priority Pods can evict lower-priority ones when cluster resources are scarce.

Resource Requests/Limits

CPU, memory, GPU, ephemeral storage. Requests guarantee minimum; limits cap maximum usage.

06

Networking & Services

Kubernetes enforces a flat network model: every Pod gets a unique cluster-wide IP, and Pods communicate without NAT. Services provide stable endpoints and load balancing atop this flat network.

Service Types & Traffic Flow

graph TD
    EXT["External Client"] --> LB["LoadBalancer
cloud-provisioned"]
    LB --> NP["NodePort
30000-32767"]
    NP --> CIP["ClusterIP
virtual IP"]
    CIP --> EP["EndpointSlice"]
    EP --> PA["Pod A"]
    EP --> PB["Pod B"]
    EP --> PC["Pod C"]

    ING["Ingress Controller
L7 HTTP routing"] --> CIP
    GW["Gateway API
L4/L7 routing"] --> CIP

    style EXT fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
    style LB fill:#1a5276,stroke:#5dade2,color:#f5f0e0
    style NP fill:#1a5276,stroke:#85c1e9,color:#f5f0e0
    style CIP fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
    style EP fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
    style PA fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
    style PB fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
    style PC fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
    style ING fill:#6c3483,stroke:#a569bd,color:#f5f0e0
    style GW fill:#6c3483,stroke:#a569bd,color:#f5f0e0

Service Types

Type	Scope	Use Case
ClusterIP	Internal only	Default. Virtual IP reachable within the cluster for service-to-service communication.
NodePort	External via node IP	Opens a static port (30000-32767) on every node. Used for development or on-prem.
LoadBalancer	External via cloud LB	Provisions a cloud load balancer with its own external IP. Production standard.
ExternalName	DNS alias	Maps a Service to a DNS CNAME record, no proxying. Aliases external services.
Headless	Direct Pod IPs	ClusterIP: None. DNS returns Pod IPs directly. Used for StatefulSets.

Gateway API & Service Mesh

Gateway API Resource Model

graph TD
    GC["GatewayClass
infrastructure provider"] --> GW["Gateway
proxy config, listeners, TLS"]
    GW --> HR["HTTPRoute"]
    GW --> TR["TCPRoute"]
    GW --> GR["GRPCRoute"]

    HR --> SVC1["Service A"]
    HR --> SVC2["Service B"]
    TR --> SVC3["Service C"]

    style GC fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
    style GW fill:#1a5276,stroke:#5dade2,color:#f5f0e0
    style HR fill:#6c3483,stroke:#a569bd,color:#f5f0e0
    style TR fill:#6c3483,stroke:#a569bd,color:#f5f0e0
    style GR fill:#6c3483,stroke:#a569bd,color:#f5f0e0
    style SVC1 fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
    style SVC2 fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
    style SVC3 fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0

The Gateway API is the successor to Ingress, offering richer capabilities: TCP, UDP, gRPC, TLS passthrough, traffic splitting, and header-based routing. It separates concerns between cluster operators (GatewayClass, Gateway) and developers (Routes).

Service Mesh Options

Istio is the most widely adopted service mesh, now supporting ambient mode (GA since 1.24) with per-node ztunnel proxies instead of sidecar injection. Cilium provides eBPF-based mesh without sidecar proxies, using kernel-level packet processing for networking, security, and observability.

07

CRI / CNI / CSI Plugin Interfaces

Kubernetes uses three standardized plugin interfaces to decouple core logic from runtime, networking, and storage implementations. This extensibility model enables a rich ecosystem of providers.

Plugin Architecture

graph TD
    KL["kubelet"]

    subgraph CRI["Container Runtime Interface"]
        CRI_GW["CRI gRPC"]
        CTRD["containerd"]
        CRIO["CRI-O"]
    end

    subgraph CNI["Container Network Interface"]
        CNI_CFG["CNI Config
/etc/cni/net.d/"]
        CAL["Calico"]
        CIL["Cilium"]
        FLAN["Flannel"]
    end

    subgraph CSI["Container Storage Interface"]
        CSI_CTRL["CSI Controller
provision, attach"]
        CSI_NODE["CSI Node Plugin
mount, unmount"]
    end

    KL --> CRI_GW
    CRI_GW --> CTRD
    CRI_GW --> CRIO
    KL --> CNI_CFG
    CNI_CFG --> CAL
    CNI_CFG --> CIL
    CNI_CFG --> FLAN
    KL --> CSI_NODE
    CSI_CTRL --> CSI_NODE

    style KL fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
    style CRI_GW fill:#6c3483,stroke:#a569bd,color:#f5f0e0
    style CTRD fill:#6c3483,stroke:#a569bd,color:#f5f0e0
    style CRIO fill:#6c3483,stroke:#a569bd,color:#f5f0e0
    style CNI_CFG fill:#1a5276,stroke:#5dade2,color:#f5f0e0
    style CAL fill:#1a5276,stroke:#5dade2,color:#f5f0e0
    style CIL fill:#1a5276,stroke:#5dade2,color:#f5f0e0
    style FLAN fill:#1a5276,stroke:#5dade2,color:#f5f0e0
    style CSI_CTRL fill:#8B6914,stroke:#d4bc7a,color:#f5f0e0
    style CSI_NODE fill:#8B6914,stroke:#d4bc7a,color:#f5f0e0

CRI (Container Runtime)

gRPC protocol defining RuntimeService (Pod/container lifecycle, exec, attach) and ImageService (pull, list, remove). Replaced the original Docker-specific code path.

Runtime

CNI (Networking)

Specification for Pod networking. Assigns Pod IPs, configures routes, sets up network namespaces. Kubelet reads config from /etc/cni/net.d/ and invokes plugins at Pod start/stop.

Network

CSI (Storage)

Standardized interface for storage vendors. CSI Controller (Deployment) handles provisioning and snapshots; CSI Node plugin (DaemonSet) handles mount/unmount on each node.

Storage

Popular CNI Plugins

Plugin	Approach	Network Policy
Calico	BGP-based routing, eBPF dataplane option	Yes
Cilium	eBPF-native networking and security, identity-based policy	Yes
Flannel	Simple VXLAN overlay	No
Weave Net	Encrypted mesh overlay	Yes
AWS VPC CNI	Assigns real VPC IPs to Pods	Yes

08

RBAC & Admission Control

Role-Based Access Control determines who can do what in the cluster. Admission controllers intercept API requests after auth, enforcing policies and mutating objects before persistence.

RBAC Resource Model

graph LR
    subgraph Subjects["Subjects"]
        U["User"]
        G["Group"]
        SA["ServiceAccount"]
    end

    subgraph Bindings["Bindings"]
        RB["RoleBinding
namespace-scoped"]
        CRB["ClusterRoleBinding
cluster-scoped"]
    end

    subgraph Roles["Roles"]
        R["Role
namespace-scoped"]
        CR["ClusterRole
cluster-scoped"]
    end

    subgraph Resources["Resources"]
        PODS["pods"]
        DEPLOY["deployments"]
        SVC["services"]
        SEC["secrets"]
    end

    U --> RB
    G --> CRB
    SA --> RB
    RB --> R
    RB --> CR
    CRB --> CR
    R --> PODS
    R --> DEPLOY
    CR --> SVC
    CR --> SEC

    style U fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
    style G fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
    style SA fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
    style RB fill:#1a5276,stroke:#5dade2,color:#f5f0e0
    style CRB fill:#1a5276,stroke:#5dade2,color:#f5f0e0
    style R fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
    style CR fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
    style PODS fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
    style DEPLOY fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
    style SVC fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
    style SEC fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0

Additive Permissions Only

RBAC permissions are purely additive — there are no deny rules. Rules specify apiGroups, resources, and verbs (get, list, watch, create, update, patch, delete). Users cannot grant permissions they do not already have (privilege escalation prevention).

Built-in Admission Controllers

LimitRanger

Enforces default resource requests and limits for containers that do not specify them.

ResourceQuota

Enforces aggregate resource quotas per namespace (CPU, memory, object counts).

PodSecurity

Enforces Pod Security Standards at three levels: privileged, baseline, restricted.

NamespaceLifecycle

Prevents creation of new objects in namespaces that are being terminated.

ServiceAccount

Automatically mounts projected service account tokens into Pods.

DefaultStorageClass

Assigns the default StorageClass to PersistentVolumeClaims that do not specify one.

09

Operators & Custom Resources

The Operator pattern extends Kubernetes by encoding domain knowledge into custom controllers. CRDs extend the API with new resource types, and controllers reconcile actual state toward the desired state declared in those resources.

Operator Reconciliation Loop

graph TD
    CR["Custom Resource
desired state (spec)"] --> WATCH["Controller
watches API server"]
    WATCH --> COMPARE["Compare
desired vs actual"]
    COMPARE -->|"drift detected"| ACT["Take Action
create Pods, update config"]
    COMPARE -->|"in sync"| WAIT["Wait for Next
Change Event"]
    ACT --> STATUS["Update Status
subresource"]
    STATUS --> WATCH
    WAIT --> WATCH

    style CR fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
    style WATCH fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
    style COMPARE fill:#1a5276,stroke:#5dade2,color:#f5f0e0
    style ACT fill:#c0392b,stroke:#e74c3c,color:#f5f0e0
    style WAIT fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
    style STATUS fill:#6c3483,stroke:#a569bd,color:#f5f0e0

Operator Capability Levels

Level	Capability	Description
1	Basic Install	Automated deployment of the application
2	Seamless Upgrades	Patch and minor version upgrades
3	Full Lifecycle	Backup, restore, failure recovery
4	Deep Insights	Metrics, alerts, log processing
5	Auto Pilot	Auto-scaling, auto-tuning, anomaly detection

Operator Frameworks

Kubebuilder

Official Kubernetes SIG project. Scaffold-based Go framework for building controllers and webhooks.

Go

Operator SDK

Red Hat project supporting Go, Ansible, and Helm. Includes OLM for operator lifecycle management.

Go / Ansible / Helm

kube-rs

High-performance Rust framework for building Kubernetes controllers and operators.

Rust

Kopf

Pythonic operator framework with decorators for event handlers and timers.

Python

Common Operators

Databases (PostgreSQL Operator, MySQL Operator), message queues (Strimzi for Kafka), monitoring (Prometheus Operator), certificates (cert-manager), GitOps (Argo CD, Flux). The controller is level-triggered and idempotent, making reconciliation self-healing.

10

etcd & Raft Consensus

etcd is the distributed key-value store that serves as Kubernetes' single source of truth. Written in Go, it uses the Raft consensus algorithm for strong consistency across a 3- or 5-member cluster.

Raft Consensus - Log Replication

graph TD
    CLIENT["API Server
(sole client)"] --> LEADER["etcd Leader"]
    LEADER --> LOG["Append to
Leader Log"]
    LOG --> REP1["Replicate to
Follower 1"]
    LOG --> REP2["Replicate to
Follower 2"]
    REP1 --> QUORUM["Quorum
Acknowledged"]
    REP2 --> QUORUM
    QUORUM --> COMMIT["Commit Entry"]
    COMMIT --> APPLY["Apply to
State Machine"]
    APPLY --> RESP["Respond to
API Server"]

    style CLIENT fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
    style LEADER fill:#1a5276,stroke:#5dade2,color:#f5f0e0
    style LOG fill:#1a5276,stroke:#85c1e9,color:#f5f0e0
    style REP1 fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
    style REP2 fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
    style QUORUM fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
    style COMMIT fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
    style APPLY fill:#6c3483,stroke:#a569bd,color:#f5f0e0
    style RESP fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0

Strong Consistency

Linearizable reads and writes via Raft. All operations appear to occur atomically and in total order.

Watch Support

Clients subscribe to key change notifications. This powers all Kubernetes controllers' event-driven architecture.

MVCC

Multi-Version Concurrency Control maintains revision history, enabling consistent snapshots and watch resumption.

Lease System

TTL-based key expiration for leader election and distributed locking. Scheduler and controller-manager use leases.

Kubernetes-Specific Patterns

Sole client — only kube-apiserver reads/writes to etcd directly; all other components go through the API
Watch-based architecture — controllers establish long-lived watches via the API server, which maps to etcd watches. Event-driven, no polling.
Resource versioning — every object carries a resourceVersion (mapped to etcd's mod_revision) for optimistic concurrency and watch resumption
Key layout — /registry/<resource-type>/<namespace>/<name> for namespaced resources
Compaction — old revisions are periodically compacted to reclaim storage; snapshots enable backup/restore

Performance Warning

etcd is sensitive to disk I/O latency. SSDs are strongly recommended for production deployments. Large clusters (1000+ nodes) may require tuning heartbeat interval, election timeout, and snapshot count. Quorum: (n/2) + 1 members must agree. A 3-member cluster tolerates 1 failure; 5 members tolerate 2.

11

High Availability Patterns

Production Kubernetes clusters use redundancy at both the control plane and data plane levels. HA requires careful distribution across failure domains and automated scaling mechanisms.

HA Control Plane Architecture

graph TD
    LB["Load Balancer"] --> API1["API Server 1"]
    LB --> API2["API Server 2"]
    LB --> API3["API Server 3"]

    API1 --> ETCD1["etcd 1"]
    API2 --> ETCD2["etcd 2"]
    API3 --> ETCD3["etcd 3"]

    SCHED["Scheduler
(leader-elected)"] --> LB
    CCM["Controller Manager
(leader-elected)"] --> LB

    ETCD1 <--> ETCD2
    ETCD2 <--> ETCD3
    ETCD1 <--> ETCD3

    style LB fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
    style API1 fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
    style API2 fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
    style API3 fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
    style ETCD1 fill:#1a5276,stroke:#5dade2,color:#f5f0e0
    style ETCD2 fill:#1a5276,stroke:#5dade2,color:#f5f0e0
    style ETCD3 fill:#1a5276,stroke:#5dade2,color:#f5f0e0
    style SCHED fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
    style CCM fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0

Control Plane HA

API Server Replicas

Multiple instances behind a load balancer. Stateless and horizontally scalable. All instances are active.

etcd Cluster

3 or 5 members across failure domains. Uses Raft consensus. Odd numbers ensure clean quorum.

Leader Election

Scheduler and controller-manager use lease-based election via the API server. Only one active instance; others on standby.

Data Plane HA

Pod Disruption Budgets

Guarantee minimum available replicas during voluntary disruptions (node drain, upgrades).

Topology Spread

Distribute Pod replicas across availability zones and nodes to survive infrastructure failures.

Horizontal Pod Autoscaler

Automatically scales replica count based on CPU, memory, or custom metrics (Prometheus, Datadog).

Cluster Autoscaler

Adds or removes nodes based on pending Pods and node utilization. Integrates with cloud providers.

12

Acronym Reference

APIApplication Programming Interface

BGPBorder Gateway Protocol

CELCommon Expression Language

CNCFCloud Native Computing Foundation

CNIContainer Network Interface

CRDCustom Resource Definition

CRIContainer Runtime Interface

CSIContainer Storage Interface

DNSDomain Name System

eBPFExtended Berkeley Packet Filter

gRPCGoogle Remote Procedure Call

HAHigh Availability

HPAHorizontal Pod Autoscaler

IPVSIP Virtual Server

K8sKubernetes (K + 8 letters + s)

L4/L7OSI Layer 4 (transport) / Layer 7 (application)

mTLSMutual Transport Layer Security

MVCCMulti-Version Concurrency Control

NATNetwork Address Translation

OCIOpen Container Initiative

OIDCOpenID Connect

OLMOperator Lifecycle Manager

PDBPod Disruption Budget

PVPersistent Volume

PVCPersistent Volume Claim

RBACRole-Based Access Control

SIGSpecial Interest Group

TLSTransport Layer Security

TTLTime To Live

VXLANVirtual Extensible LAN