Interactive architecture map of Kubernetes internals — control plane, data plane, networking, storage, security, and extensibility patterns compiled from official documentation.
Kubernetes is an open-source container orchestration platform that automates deployment, scaling, and management of containerized applications. A cluster consists of a control plane (the brain) and worker nodes (the muscle), communicating through the API server.
graph TD
subgraph CP["Control Plane"]
API["kube-apiserver"]
ETCD["etcd cluster"]
SCHED["kube-scheduler"]
CCM["controller-manager"]
end
subgraph N1["Worker Node 1"]
K1["kubelet"]
KP1["kube-proxy"]
CR1["Container Runtime"]
P1A["Pod A"]
P1B["Pod B"]
end
subgraph N2["Worker Node 2"]
K2["kubelet"]
KP2["kube-proxy"]
CR2["Container Runtime"]
P2A["Pod C"]
P2B["Pod D"]
end
API --- ETCD
API --- SCHED
API --- CCM
K1 --> API
K2 --> API
KP1 --> API
KP2 --> API
K1 --> CR1
K2 --> CR2
style API fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
style ETCD fill:#1a5276,stroke:#5dade2,color:#f5f0e0
style SCHED fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
style CCM fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
style K1 fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
style K2 fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
style KP1 fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
style KP2 fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
style CR1 fill:#6c3483,stroke:#a569bd,color:#f5f0e0
style CR2 fill:#6c3483,stroke:#a569bd,color:#f5f0e0
style P1A fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
style P1B fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
style P2A fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
style P2B fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
The control plane maintains the desired state of the cluster, making all scheduling and orchestration decisions. In production, these components run across multiple nodes for high availability.
The front door to the cluster. Every interaction goes through its RESTful HTTP API. The only component that talks directly to etcd. Horizontally scalable behind a load balancer.
Distributed key-value store serving as the cluster's single source of truth. Every Kubernetes object is persisted here. Uses Raft consensus for strong consistency.
Watches for unscheduled Pods. Runs a two-phase algorithm: filtering eliminates infeasible nodes, scoring ranks remaining nodes by fitness. Pluggable scheduling framework.
Bundles dozens of independent control loops. Each controller watches a resource type via the API server and reconciles actual state toward desired state.
Decouples cloud-provider-specific logic from core controllers. Manages cloud load balancers, node lifecycle, and routes. Implementations for AWS, Azure, GCP, and others.
| Controller | Responsibility |
|---|---|
| ReplicaSet | Ensures the correct number of Pod replicas are running |
| Deployment | Manages rollouts and rollbacks for declarative updates |
| StatefulSet | Ordered deployment with stable network identities and persistent storage |
| Job / CronJob | Run-to-completion and scheduled workloads |
| Node | Monitors node health, marks unreachable nodes for eviction |
| EndpointSlice | Populates service backend lists efficiently (up to 100 endpoints per slice) |
| Namespace | Cleans up all resources when a namespace is deleted |
Each worker node runs three components that execute and network the actual workloads. The kubelet acts as the node agent, kube-proxy implements service routing, and the container runtime manages container lifecycles.
graph TD
subgraph Node["Worker Node"]
KL["kubelet"]
KP["kube-proxy"]
subgraph Runtime["Container Runtime (CRI)"]
CTD["containerd / CRI-O"]
RUNC["runc (OCI)"]
end
subgraph Pods["Pod Namespace"]
PA["Pod A
containers + volumes"]
PB["Pod B
containers + volumes"]
end
subgraph Net["Network Stack"]
IPT["iptables / IPVS"]
CNI["CNI Plugin"]
end
end
API["kube-apiserver"] --> KL
KL --> CTD
CTD --> RUNC
RUNC --> PA
RUNC --> PB
KP --> IPT
CNI --> PA
CNI --> PB
style API fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
style KL fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
style KP fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
style CTD fill:#6c3483,stroke:#a569bd,color:#f5f0e0
style RUNC fill:#6c3483,stroke:#a569bd,color:#f5f0e0
style PA fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
style PB fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
style IPT fill:#1a5276,stroke:#5dade2,color:#f5f0e0
style CNI fill:#1a5276,stroke:#5dade2,color:#f5f0e0
Primary node agent. Receives PodSpecs, ensures containers are running via CRI. Handles volume mounts, probes, resource monitoring, and Pod eviction under pressure.
Implements the Service abstraction via iptables, IPVS, or nftables rules. Routes traffic from ClusterIP to backend Pods with load balancing.
Pulls images, creates containers, manages lifecycle. containerd is most widely used; CRI-O is common in OpenShift. Both use runc by default, or gVisor/Kata for isolation.
iptables (default) installs rules for random backend selection. IPVS uses Linux kernel virtual server for higher performance and more algorithms (round-robin, least connections, source hashing). nftables is the newer alternative replacing iptables.
Every request to the cluster follows a strict pipeline through authentication, authorization, admission control, and persistence. Understanding this flow is key to debugging access issues and writing admission webhooks.
graph LR
C["Client
kubectl / controller"] --> TLS["TLS
Termination"]
TLS --> AuthN["Authentication
certs, tokens, OIDC"]
AuthN --> AuthZ["Authorization
RBAC / Webhook"]
AuthZ --> MUT["Mutating
Admission"]
MUT --> VAL["Schema
Validation"]
VAL --> VADM["Validating
Admission"]
VADM --> ETCD["Persist to
etcd"]
ETCD --> RESP["Response"]
style C fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
style TLS fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
style AuthN fill:#1a5276,stroke:#5dade2,color:#f5f0e0
style AuthZ fill:#1a5276,stroke:#5dade2,color:#f5f0e0
style MUT fill:#6c3483,stroke:#a569bd,color:#f5f0e0
style VAL fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
style VADM fill:#6c3483,stroke:#a569bd,color:#f5f0e0
style ETCD fill:#1a5276,stroke:#5dade2,color:#f5f0e0
style RESP fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
GA since Kubernetes 1.30, this in-process validation uses CEL (Common Expression Language) expressions instead of external webhooks. Lower latency, no external dependency, and easier to audit.
The scheduler assigns Pods to nodes using a pluggable framework with filtering and scoring phases. Users influence placement through affinity rules, taints, tolerations, and resource constraints.
graph TD
NEW["New Pod Created
nodeName empty"] --> WATCH["Scheduler Detects
Unscheduled Pod"]
subgraph Sched["Scheduling Cycle (serial)"]
FILT["Filter
eliminate infeasible nodes"]
POST["PostFilter
attempt preemption"]
SCORE["Score & Rank
surviving nodes"]
RESERVE["Reserve
claim resources"]
end
subgraph Bind["Binding Cycle (concurrent)"]
PREBIND["PreBind
mount volumes"]
BINDN["Bind
set nodeName in etcd"]
POSTBIND["PostBind
cleanup"]
end
WATCH --> FILT
FILT -->|"no nodes pass"| POST
FILT -->|"nodes pass"| SCORE
POST --> SCORE
SCORE --> RESERVE
RESERVE --> PREBIND
PREBIND --> BINDN
BINDN --> POSTBIND
style NEW fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
style WATCH fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
style FILT fill:#1a5276,stroke:#5dade2,color:#f5f0e0
style POST fill:#c0392b,stroke:#e74c3c,color:#f5f0e0
style SCORE fill:#1a5276,stroke:#5dade2,color:#f5f0e0
style RESERVE fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
style PREBIND fill:#6c3483,stroke:#a569bd,color:#f5f0e0
style BINDN fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
style POSTBIND fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
| Phase | Description |
|---|---|
| Pending | Accepted by the cluster but not yet scheduled, or images not yet pulled |
| Running | Bound to a node with at least one container running |
| Succeeded | All containers terminated with exit code 0 |
| Failed | All containers terminated, at least one with non-zero exit |
| Unknown | Pod status undetermined (usually node communication failure) |
Simple label matching to constrain Pods to nodes with specific labels.
Expressive rules for node and Pod placement: required (hard) or preferred (soft) constraints.
Nodes repel Pods unless the Pod explicitly tolerates the taint. Used for dedicated nodes, special hardware.
Distribute Pods across failure domains (zones, nodes) with configurable skew constraints.
Higher-priority Pods can evict lower-priority ones when cluster resources are scarce.
CPU, memory, GPU, ephemeral storage. Requests guarantee minimum; limits cap maximum usage.
Kubernetes enforces a flat network model: every Pod gets a unique cluster-wide IP, and Pods communicate without NAT. Services provide stable endpoints and load balancing atop this flat network.
graph TD
EXT["External Client"] --> LB["LoadBalancer
cloud-provisioned"]
LB --> NP["NodePort
30000-32767"]
NP --> CIP["ClusterIP
virtual IP"]
CIP --> EP["EndpointSlice"]
EP --> PA["Pod A"]
EP --> PB["Pod B"]
EP --> PC["Pod C"]
ING["Ingress Controller
L7 HTTP routing"] --> CIP
GW["Gateway API
L4/L7 routing"] --> CIP
style EXT fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
style LB fill:#1a5276,stroke:#5dade2,color:#f5f0e0
style NP fill:#1a5276,stroke:#85c1e9,color:#f5f0e0
style CIP fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
style EP fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
style PA fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
style PB fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
style PC fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
style ING fill:#6c3483,stroke:#a569bd,color:#f5f0e0
style GW fill:#6c3483,stroke:#a569bd,color:#f5f0e0
| Type | Scope | Use Case |
|---|---|---|
| ClusterIP | Internal only | Default. Virtual IP reachable within the cluster for service-to-service communication. |
| NodePort | External via node IP | Opens a static port (30000-32767) on every node. Used for development or on-prem. |
| LoadBalancer | External via cloud LB | Provisions a cloud load balancer with its own external IP. Production standard. |
| ExternalName | DNS alias | Maps a Service to a DNS CNAME record, no proxying. Aliases external services. |
| Headless | Direct Pod IPs | ClusterIP: None. DNS returns Pod IPs directly. Used for StatefulSets. |
graph TD
GC["GatewayClass
infrastructure provider"] --> GW["Gateway
proxy config, listeners, TLS"]
GW --> HR["HTTPRoute"]
GW --> TR["TCPRoute"]
GW --> GR["GRPCRoute"]
HR --> SVC1["Service A"]
HR --> SVC2["Service B"]
TR --> SVC3["Service C"]
style GC fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
style GW fill:#1a5276,stroke:#5dade2,color:#f5f0e0
style HR fill:#6c3483,stroke:#a569bd,color:#f5f0e0
style TR fill:#6c3483,stroke:#a569bd,color:#f5f0e0
style GR fill:#6c3483,stroke:#a569bd,color:#f5f0e0
style SVC1 fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
style SVC2 fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
style SVC3 fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
The Gateway API is the successor to Ingress, offering richer capabilities: TCP, UDP, gRPC, TLS passthrough, traffic splitting, and header-based routing. It separates concerns between cluster operators (GatewayClass, Gateway) and developers (Routes).
Istio is the most widely adopted service mesh, now supporting ambient mode (GA since 1.24) with per-node ztunnel proxies instead of sidecar injection. Cilium provides eBPF-based mesh without sidecar proxies, using kernel-level packet processing for networking, security, and observability.
Kubernetes uses three standardized plugin interfaces to decouple core logic from runtime, networking, and storage implementations. This extensibility model enables a rich ecosystem of providers.
graph TD
KL["kubelet"]
subgraph CRI["Container Runtime Interface"]
CRI_GW["CRI gRPC"]
CTRD["containerd"]
CRIO["CRI-O"]
end
subgraph CNI["Container Network Interface"]
CNI_CFG["CNI Config
/etc/cni/net.d/"]
CAL["Calico"]
CIL["Cilium"]
FLAN["Flannel"]
end
subgraph CSI["Container Storage Interface"]
CSI_CTRL["CSI Controller
provision, attach"]
CSI_NODE["CSI Node Plugin
mount, unmount"]
end
KL --> CRI_GW
CRI_GW --> CTRD
CRI_GW --> CRIO
KL --> CNI_CFG
CNI_CFG --> CAL
CNI_CFG --> CIL
CNI_CFG --> FLAN
KL --> CSI_NODE
CSI_CTRL --> CSI_NODE
style KL fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
style CRI_GW fill:#6c3483,stroke:#a569bd,color:#f5f0e0
style CTRD fill:#6c3483,stroke:#a569bd,color:#f5f0e0
style CRIO fill:#6c3483,stroke:#a569bd,color:#f5f0e0
style CNI_CFG fill:#1a5276,stroke:#5dade2,color:#f5f0e0
style CAL fill:#1a5276,stroke:#5dade2,color:#f5f0e0
style CIL fill:#1a5276,stroke:#5dade2,color:#f5f0e0
style FLAN fill:#1a5276,stroke:#5dade2,color:#f5f0e0
style CSI_CTRL fill:#8B6914,stroke:#d4bc7a,color:#f5f0e0
style CSI_NODE fill:#8B6914,stroke:#d4bc7a,color:#f5f0e0
gRPC protocol defining RuntimeService (Pod/container lifecycle, exec, attach) and ImageService (pull, list, remove). Replaced the original Docker-specific code path.
Specification for Pod networking. Assigns Pod IPs, configures routes, sets up network namespaces. Kubelet reads config from /etc/cni/net.d/ and invokes plugins at Pod start/stop.
Standardized interface for storage vendors. CSI Controller (Deployment) handles provisioning and snapshots; CSI Node plugin (DaemonSet) handles mount/unmount on each node.
| Plugin | Approach | Network Policy |
|---|---|---|
| Calico | BGP-based routing, eBPF dataplane option | Yes |
| Cilium | eBPF-native networking and security, identity-based policy | Yes |
| Flannel | Simple VXLAN overlay | No |
| Weave Net | Encrypted mesh overlay | Yes |
| AWS VPC CNI | Assigns real VPC IPs to Pods | Yes |
Role-Based Access Control determines who can do what in the cluster. Admission controllers intercept API requests after auth, enforcing policies and mutating objects before persistence.
graph LR
subgraph Subjects["Subjects"]
U["User"]
G["Group"]
SA["ServiceAccount"]
end
subgraph Bindings["Bindings"]
RB["RoleBinding
namespace-scoped"]
CRB["ClusterRoleBinding
cluster-scoped"]
end
subgraph Roles["Roles"]
R["Role
namespace-scoped"]
CR["ClusterRole
cluster-scoped"]
end
subgraph Resources["Resources"]
PODS["pods"]
DEPLOY["deployments"]
SVC["services"]
SEC["secrets"]
end
U --> RB
G --> CRB
SA --> RB
RB --> R
RB --> CR
CRB --> CR
R --> PODS
R --> DEPLOY
CR --> SVC
CR --> SEC
style U fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
style G fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
style SA fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
style RB fill:#1a5276,stroke:#5dade2,color:#f5f0e0
style CRB fill:#1a5276,stroke:#5dade2,color:#f5f0e0
style R fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
style CR fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
style PODS fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
style DEPLOY fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
style SVC fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
style SEC fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
RBAC permissions are purely additive — there are no deny rules. Rules specify apiGroups, resources, and verbs (get, list, watch, create, update, patch, delete). Users cannot grant permissions they do not already have (privilege escalation prevention).
Enforces default resource requests and limits for containers that do not specify them.
Enforces aggregate resource quotas per namespace (CPU, memory, object counts).
Enforces Pod Security Standards at three levels: privileged, baseline, restricted.
Prevents creation of new objects in namespaces that are being terminated.
Automatically mounts projected service account tokens into Pods.
Assigns the default StorageClass to PersistentVolumeClaims that do not specify one.
The Operator pattern extends Kubernetes by encoding domain knowledge into custom controllers. CRDs extend the API with new resource types, and controllers reconcile actual state toward the desired state declared in those resources.
graph TD
CR["Custom Resource
desired state (spec)"] --> WATCH["Controller
watches API server"]
WATCH --> COMPARE["Compare
desired vs actual"]
COMPARE -->|"drift detected"| ACT["Take Action
create Pods, update config"]
COMPARE -->|"in sync"| WAIT["Wait for Next
Change Event"]
ACT --> STATUS["Update Status
subresource"]
STATUS --> WATCH
WAIT --> WATCH
style CR fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
style WATCH fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
style COMPARE fill:#1a5276,stroke:#5dade2,color:#f5f0e0
style ACT fill:#c0392b,stroke:#e74c3c,color:#f5f0e0
style WAIT fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
style STATUS fill:#6c3483,stroke:#a569bd,color:#f5f0e0
| Level | Capability | Description |
|---|---|---|
| 1 | Basic Install | Automated deployment of the application |
| 2 | Seamless Upgrades | Patch and minor version upgrades |
| 3 | Full Lifecycle | Backup, restore, failure recovery |
| 4 | Deep Insights | Metrics, alerts, log processing |
| 5 | Auto Pilot | Auto-scaling, auto-tuning, anomaly detection |
Official Kubernetes SIG project. Scaffold-based Go framework for building controllers and webhooks.
Red Hat project supporting Go, Ansible, and Helm. Includes OLM for operator lifecycle management.
High-performance Rust framework for building Kubernetes controllers and operators.
Pythonic operator framework with decorators for event handlers and timers.
Databases (PostgreSQL Operator, MySQL Operator), message queues (Strimzi for Kafka), monitoring (Prometheus Operator), certificates (cert-manager), GitOps (Argo CD, Flux). The controller is level-triggered and idempotent, making reconciliation self-healing.
etcd is the distributed key-value store that serves as Kubernetes' single source of truth. Written in Go, it uses the Raft consensus algorithm for strong consistency across a 3- or 5-member cluster.
graph TD
CLIENT["API Server
(sole client)"] --> LEADER["etcd Leader"]
LEADER --> LOG["Append to
Leader Log"]
LOG --> REP1["Replicate to
Follower 1"]
LOG --> REP2["Replicate to
Follower 2"]
REP1 --> QUORUM["Quorum
Acknowledged"]
REP2 --> QUORUM
QUORUM --> COMMIT["Commit Entry"]
COMMIT --> APPLY["Apply to
State Machine"]
APPLY --> RESP["Respond to
API Server"]
style CLIENT fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
style LEADER fill:#1a5276,stroke:#5dade2,color:#f5f0e0
style LOG fill:#1a5276,stroke:#85c1e9,color:#f5f0e0
style REP1 fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
style REP2 fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
style QUORUM fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
style COMMIT fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
style APPLY fill:#6c3483,stroke:#a569bd,color:#f5f0e0
style RESP fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
Linearizable reads and writes via Raft. All operations appear to occur atomically and in total order.
Clients subscribe to key change notifications. This powers all Kubernetes controllers' event-driven architecture.
Multi-Version Concurrency Control maintains revision history, enabling consistent snapshots and watch resumption.
TTL-based key expiration for leader election and distributed locking. Scheduler and controller-manager use leases.
etcd is sensitive to disk I/O latency. SSDs are strongly recommended for production deployments. Large clusters (1000+ nodes) may require tuning heartbeat interval, election timeout, and snapshot count. Quorum: (n/2) + 1 members must agree. A 3-member cluster tolerates 1 failure; 5 members tolerate 2.
Production Kubernetes clusters use redundancy at both the control plane and data plane levels. HA requires careful distribution across failure domains and automated scaling mechanisms.
graph TD
LB["Load Balancer"] --> API1["API Server 1"]
LB --> API2["API Server 2"]
LB --> API3["API Server 3"]
API1 --> ETCD1["etcd 1"]
API2 --> ETCD2["etcd 2"]
API3 --> ETCD3["etcd 3"]
SCHED["Scheduler
(leader-elected)"] --> LB
CCM["Controller Manager
(leader-elected)"] --> LB
ETCD1 <--> ETCD2
ETCD2 <--> ETCD3
ETCD1 <--> ETCD3
style LB fill:#8b7d3c,stroke:#d4bc7a,color:#f5f0e0
style API1 fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
style API2 fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
style API3 fill:#2d5016,stroke:#6b9b3a,color:#f5f0e0
style ETCD1 fill:#1a5276,stroke:#5dade2,color:#f5f0e0
style ETCD2 fill:#1a5276,stroke:#5dade2,color:#f5f0e0
style ETCD3 fill:#1a5276,stroke:#5dade2,color:#f5f0e0
style SCHED fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
style CCM fill:#4a7c28,stroke:#6b9b3a,color:#f5f0e0
Multiple instances behind a load balancer. Stateless and horizontally scalable. All instances are active.
3 or 5 members across failure domains. Uses Raft consensus. Odd numbers ensure clean quorum.
Scheduler and controller-manager use lease-based election via the API server. Only one active instance; others on standby.
Guarantee minimum available replicas during voluntary disruptions (node drain, upgrades).
Distribute Pod replicas across availability zones and nodes to survive infrastructure failures.
Automatically scales replica count based on CPU, memory, or custom metrics (Prometheus, Datadog).
Adds or removes nodes based on pending Pods and node utilization. Integrates with cloud providers.