A picture show of the cloud-native microservices landscape that streams entertainment to 260+ million subscribers — compiled from public talks, blog posts, and open-source repositories.
Netflix runs a cloud-native microservices architecture on Amazon Web Services across multiple regions. The full AWS migration completed in 2016. The system is composed of over 1,000 loosely coupled microservices, split between a control plane (AWS) for all backend logic and a data plane (Open Connect CDN) for video delivery.
graph TD
subgraph Clients["Client Devices"]
TV["Smart TVs"]
MOB["Mobile Apps"]
WEB["Web Browsers"]
end
subgraph Control["Control Plane — AWS"]
ZUUL["Zuul API Gateway"]
SVC["Microservices
(1,000+)"]
RECS["Recommendation
Engine"]
DATA["Data Platform
(Kafka + Flink)"]
end
subgraph Deliver["Data Plane — Open Connect CDN"]
OCA["OCA Appliances
(inside ISP networks)"]
IXP["IXP Peering
Sites"]
end
Clients --> ZUUL
ZUUL --> SVC
SVC --> RECS
SVC --> DATA
SVC -->|"steering &
manifests"| Deliver
Clients -->|"video streams"| Deliver
style ZUUL fill:#5c4a32,stroke:#c4a87a,color:#e8e4dc
style SVC fill:#3a332c,stroke:#c4a87a,color:#e8e4dc
style RECS fill:#3a332c,stroke:#d8c8a8,color:#e8e4dc
style DATA fill:#3a332c,stroke:#7090a8,color:#e8e4dc
style OCA fill:#3a332c,stroke:#6a8a6a,color:#e8e4dc
style IXP fill:#3a332c,stroke:#6a8a6a,color:#e8e4dc
style TV fill:#2a2520,stroke:#b0a898,color:#e8e4dc
style MOB fill:#2a2520,stroke:#b0a898,color:#e8e4dc
style WEB fill:#2a2520,stroke:#b0a898,color:#e8e4dc
Primary distributed database for scale-out workloads: viewing history, user profiles, bookmarks.
Custom caching layer storing session data, watch history, and recommendations. Maintains 3 copies across AZs.
Used for billing, account data, and transactional workloads.
Adopted for globally consistent transactional needs requiring strong consistency.
Object storage for encoded video assets, logs, and analytics data.
Zuul is the front door for all requests from devices and web applications to Netflix's backend. It is a JVM-based L7 application gateway using a filter-based architecture, re-built on Netty for asynchronous non-blocking I/O in Zuul 2.
graph LR
REQ["Client
Request"] --> PRE["Pre-Routing Filters
(auth, rate limit)"]
PRE --> ROUTE["Routing Filters
(forward to origin)"]
ROUTE --> EUREKA["Eureka
Service Discovery"]
EUREKA --> RIBBON["Ribbon
Load Balancer"]
RIBBON --> ORIGIN["Origin
Microservice"]
ORIGIN --> POST["Post-Routing Filters
(metrics, headers)"]
POST --> RESP["Response"]
ROUTE --> HYSTRIX["Hystrix
Circuit Breaker"]
style REQ fill:#2a2520,stroke:#b0a898,color:#e8e4dc
style PRE fill:#3a332c,stroke:#d4a050,color:#e8e4dc
style ROUTE fill:#3a332c,stroke:#d4a050,color:#e8e4dc
style POST fill:#3a332c,stroke:#d4a050,color:#e8e4dc
style EUREKA fill:#5c4a32,stroke:#c4a87a,color:#e8e4dc
style RIBBON fill:#3a332c,stroke:#c4a87a,color:#e8e4dc
style HYSTRIX fill:#3a332c,stroke:#b87878,color:#e8e4dc
style ORIGIN fill:#3a332c,stroke:#7090a8,color:#e8e4dc
style RESP fill:#2a2520,stroke:#b0a898,color:#e8e4dc
Re-architected on Netty for asynchronous, non-blocking I/O. Supports persistent connections (WebSockets, SSE) at scale. Filters can be loaded dynamically at runtime.
Server-client service discovery. Microservices self-register on startup; servers replicate state for high availability.
Circuit breaker pattern: when failure rate exceeds threshold, requests are short-circuited to prevent cascading failures. Provides fallback mechanisms.
Client-side load balancer that works with Eureka. Supports round-robin, weighted, and availability-filtering strategies.
Netflix built its own purpose-built CDN called Open Connect, responsible for serving 100% of Netflix's video traffic. Specialized caching appliances (OCAs) are deployed inside ISP networks and at Internet Exchange Points, provided free of charge to qualifying ISPs.
graph TD
subgraph Ingest["1. Ingest"]
SRC["Source Media
from Studios"]
end
subgraph Encode["2. Encoding (AWS)"]
CHUNK["Split into Chunks"]
EC2["Parallel Encode
(100s of EC2)"]
VMAF["VMAF Quality
Measurement"]
CODEC["Multi-Codec Output
(H.264, HEVC, VP9, AV1)"]
end
subgraph Distribute["3. Distribution"]
S3["Amazon S3
(encoded assets)"]
FILL["Off-Peak Fill
to OCAs"]
end
subgraph Playback["4. Playback"]
STEER["Steering Service
(best OCA selection)"]
MANIFEST["Manifest
(stream options)"]
OCA["OCA Appliance
(100+ TB SSD)"]
CLIENT["Client Player"]
end
SRC --> CHUNK
CHUNK --> EC2
EC2 --> VMAF
VMAF --> CODEC
CODEC --> S3
S3 --> FILL
FILL --> OCA
STEER --> MANIFEST
MANIFEST --> CLIENT
CLIENT -->|"stream"| OCA
style SRC fill:#2a2520,stroke:#b0a898,color:#e8e4dc
style EC2 fill:#3a332c,stroke:#7090a8,color:#e8e4dc
style VMAF fill:#3a332c,stroke:#d4a050,color:#e8e4dc
style S3 fill:#3a332c,stroke:#8b7355,color:#e8e4dc
style OCA fill:#5c4a32,stroke:#6a8a6a,color:#e8e4dc
style STEER fill:#3a332c,stroke:#c4a87a,color:#e8e4dc
style CLIENT fill:#2a2520,stroke:#b0a898,color:#e8e4dc
Each title is analyzed for visual complexity and a custom encoding ladder is generated. This reduces bandwidth by up to 40% without sacrificing perceptual quality, measured by Netflix's VMAF metric. Roughly 95% of traffic is delivered via direct connections between OCAs and residential ISPs.
Netflix pioneered chaos engineering after a three-day outage in August 2008 caused by database corruption in their monolithic architecture. The core insight: "the best way to avoid failure is to fail constantly."
graph TD
subgraph Simian["Simian Army (Original)"]
CM["Chaos Monkey
(kill instances)"]
LM["Latency Monkey
(inject delays)"]
CONF["Conformity Monkey
(best practices)"]
DOC["Doctor Monkey
(health checks)"]
SEC["Security Monkey
(vuln detection)"]
end
subgraph Regional["Regional Chaos"]
CG["Chaos Gorilla
(AZ outage)"]
CK["Chaos Kong
(Region outage)"]
end
subgraph Modern["Modern Tooling"]
FIT["FIT
(Failure Injection)"]
SPIN["Spinnaker
(CD Platform)"]
end
CM --> SPIN
FIT --> ZUUL2["Zuul Edge
(applies failures)"]
style CM fill:#5c4a32,stroke:#b87878,color:#e8e4dc
style FIT fill:#3a332c,stroke:#d4a050,color:#e8e4dc
style SPIN fill:#3a332c,stroke:#c4a87a,color:#e8e4dc
style ZUUL2 fill:#3a332c,stroke:#d4a050,color:#e8e4dc
style CG fill:#3a332c,stroke:#b87878,color:#e8e4dc
style CK fill:#3a332c,stroke:#b87878,color:#e8e4dc
| Tool | Function | Status |
|---|---|---|
| Chaos Monkey | Randomly terminates production instances | Active (standalone) |
| Latency Monkey | Injects artificial delays in RESTful communication | Retired |
| Conformity Monkey | Shuts down non-conforming instances | Folded into Spinnaker |
| Doctor Monkey | Health-checks instances, removes unhealthy ones | Retired |
| Janitor Monkey | Cleans up unused resources | Replaced by Swabbie |
| Security Monkey | Detects security vulnerabilities and policy violations | Retired |
| Chaos Gorilla | Simulates outage of entire AWS Availability Zone | Retired |
| Chaos Kong | Simulates loss of entire AWS Region | Retired |
Introduced in October 2014 for more precise failure injection than Simian Army tools. FIT pushes failure metadata to Zuul, where edge filters apply injected failures to matching requests, enabling fine-grained chaos experiments per team.
Netflix processes 2+ trillion events per day through its real-time data infrastructure. The Keystone Stream Processing Platform is the data backbone — a petabyte-scale real-time event streaming and processing system handling ~3 PB incoming and ~7 PB outgoing data daily.
graph LR
subgraph Producers["Event Producers"]
APP["Netflix App
(user actions)"]
SVC2["Microservices
(system events)"]
end
subgraph Kafka["Apache Kafka (100+ clusters)"]
FRONT["Fronting Kafka
(ingestion)"]
SEC2["Secondary Kafka
(derived topics)"]
end
subgraph Keystone["Keystone Services"]
MSG["Messaging Service
(produce & transport)"]
ROUTE["Routing Service
(to sinks)"]
end
subgraph Processing["Stream Processing"]
FLINK["Apache Flink
(20,000+ jobs)"]
SQL["Streaming SQL
(1,200+ processors)"]
end
subgraph Sinks["Data Sinks"]
S3B["Amazon S3"]
ES["Elasticsearch"]
CASS["Cassandra"]
end
APP --> MSG
SVC2 --> MSG
MSG --> FRONT
FRONT --> FLINK
FRONT --> ROUTE
ROUTE --> S3B
ROUTE --> ES
ROUTE --> SEC2
FLINK --> CASS
SQL --> FLINK
style FRONT fill:#5c4a32,stroke:#7090a8,color:#e8e4dc
style FLINK fill:#3a332c,stroke:#7090a8,color:#e8e4dc
style MSG fill:#3a332c,stroke:#c4a87a,color:#e8e4dc
style ROUTE fill:#3a332c,stroke:#c4a87a,color:#e8e4dc
style SQL fill:#3a332c,stroke:#d4a050,color:#e8e4dc
style S3B fill:#3a332c,stroke:#8b7355,color:#e8e4dc
style APP fill:#2a2520,stroke:#b0a898,color:#e8e4dc
Traditional Hadoop/Hive batch processing for offline analytics.
Real-time event routing with Kafka as the central message bus.
Stream Processing as a Service: managed Flink jobs with self-service UI.
Federated ownership with Streaming SQL abstraction over Flink. 1,200 processors created in one year.
Netflix's recommendation system drives 75–80% of all viewing hours and saves an estimated $1 billion per year in subscriber retention. It operates across three computation modes: offline, nearline, and online.
graph TD
USER["User Actions
(plays, ratings, scrolls)"]
USER --> KAFKA3["Kafka Topics"]
subgraph Offline["Offline (Hours)"]
SPARK["Spark / Hadoop
(model training)"]
S3C["S3
(model artifacts)"]
end
subgraph Nearline["Nearline (Seconds)"]
MANHATTAN["Manhattan
(event processing)"]
CASS2["Cassandra
(intermediate results)"]
end
subgraph Online["Online (Milliseconds)"]
BLEND["Blending Service
(real-time context)"]
EVC["EVCache
(hot results)"]
end
KAFKA3 --> SPARK
KAFKA3 --> MANHATTAN
SPARK --> S3C
S3C --> EVC
MANHATTAN --> CASS2
CASS2 --> BLEND
EVC --> BLEND
BLEND --> HOMEPAGE["Personalized
Homepage"]
style USER fill:#2a2520,stroke:#b0a898,color:#e8e4dc
style KAFKA3 fill:#3a332c,stroke:#7090a8,color:#e8e4dc
style SPARK fill:#3a332c,stroke:#d8c8a8,color:#e8e4dc
style MANHATTAN fill:#3a332c,stroke:#d8c8a8,color:#e8e4dc
style BLEND fill:#5c4a32,stroke:#d8c8a8,color:#e8e4dc
style EVC fill:#3a332c,stroke:#c4a87a,color:#e8e4dc
style CASS2 fill:#3a332c,stroke:#c4a87a,color:#e8e4dc
style HOMEPAGE fill:#3a332c,stroke:#d4a050,color:#e8e4dc
| Mode | Latency | Complexity | Use Case |
|---|---|---|---|
| Offline | Hours | Unlimited | Model training, batch feature engineering, large-scale matrix factorization |
| Nearline | Seconds–minutes | Moderate | Incremental model updates, event-driven re-ranking |
| Online | Milliseconds | Latency-limited | Live recommendation serving, blending precomputed results with real-time signals |
Combines collaborative filtering, content-based filtering, deep neural networks, and graph-based models.
Single models handle homepage ranking, search ordering, and notification personalization simultaneously.
Explore/exploit strategies for artwork personalization — different users see different promotional images.
Every algorithm change is tested on live traffic before full rollout across the subscriber base.
Netflix built the Cosmos platform for media processing — combining microservices, asynchronous workflows, and serverless functions. Development started in 2018, reaching production in 2019 using a "strangler fig" migration pattern around the legacy Reloaded system.
graph TD
subgraph API["Optimus — API Layer"]
OPT["Maps external requests
to internal models"]
end
subgraph Workflow["Plato — Workflow Orchestration"]
PLATO["Forward-chaining
rule engine (DAGs)"]
end
subgraph Compute["Stratum — Serverless Compute"]
STRAT["Stateless functions
(encoding, QA)"]
end
TIMESTONE["Timestone
Priority Queue"]
OPT --> TIMESTONE
TIMESTONE --> PLATO
PLATO --> TIMESTONE
TIMESTONE --> STRAT
STRAT --> TIMESTONE
subgraph Output["Delivery"]
VMAF2["VMAF Scoring"]
CDN2["Open Connect
CDN"]
end
STRAT --> VMAF2
VMAF2 --> CDN2
style OPT fill:#3a332c,stroke:#b0a898,color:#e8e4dc
style PLATO fill:#3a332c,stroke:#b0a898,color:#e8e4dc
style STRAT fill:#3a332c,stroke:#b0a898,color:#e8e4dc
style TIMESTONE fill:#5c4a32,stroke:#c4a87a,color:#e8e4dc
style VMAF2 fill:#3a332c,stroke:#d4a050,color:#e8e4dc
style CDN2 fill:#3a332c,stroke:#6a8a6a,color:#e8e4dc
| Subsystem | Role | Details |
|---|---|---|
| Optimus | API Layer | Entry point for all media processing requests; maps external to internal business models |
| Plato | Workflow Orchestration | Forward-chaining rule engine supporting DAG-based workflows lasting minutes to years |
| Stratum | Serverless Compute | Generates typed RPC clients; runs stateless compute-intensive functions on elastic EC2 |
| Timestone | Messaging | High-scale, low-latency priority queuing system connecting all three subsystems |
A Cosmos microservice for per-title video encoding optimization. Analyzes content complexity to generate custom encoding ladders, performing parallel chunk encoding across hundreds of EC2 instances with multi-codec support (H.264, H.265/HEVC, VP9, AV1).
Netflix's container platform Titus, built on Apache Mesos with Docker containers running on EC2, launches up to 500,000 containers and 200,000 clusters per day across tens of thousands of EC2 VMs in seven regionally isolated stacks.
graph TD
subgraph Scheduler["Titus Master"]
MASTER["Leader-Elected
Scheduler"]
ZK["Zookeeper
(leader election)"]
CASST["Cassandra
(persistence)"]
end
subgraph Agents["Titus Agents (EC2 VMs)"]
A1["Agent Pool 1"]
A2["Agent Pool 2"]
A3["Agent Pool N"]
end
subgraph Containers["Docker Containers"]
C1["Microservice A"]
C2["Microservice B"]
C3["Batch Job"]
end
MASTER --> ZK
MASTER --> CASST
MASTER -->|"placement"| Agents
Agents --> Containers
style MASTER fill:#5c4a32,stroke:#8b7355,color:#e8e4dc
style ZK fill:#3a332c,stroke:#8b7355,color:#e8e4dc
style CASST fill:#3a332c,stroke:#8b7355,color:#e8e4dc
style A1 fill:#3a332c,stroke:#b0a898,color:#e8e4dc
style A2 fill:#3a332c,stroke:#b0a898,color:#e8e4dc
style A3 fill:#3a332c,stroke:#b0a898,color:#e8e4dc
Open-source multi-cloud CD platform built by Netflix. Pipelines are composed of Stages (decomposed into Tasks) that can run in parallel or serially.
graph LR
subgraph Spinnaker["Spinnaker CD"]
APP["Application"]
CLUST["Cluster"]
SG["Server Group
(load-balanced)"]
end
subgraph Triggers["Triggers"]
JENK["Jenkins"]
CRON["Cron"]
PIPE["Pipeline
Completion"]
end
subgraph Strategies["Deploy Strategies"]
BG["Blue/Green"]
HL["Highlander"]
CAN["Canary"]
end
Triggers --> APP
APP --> CLUST
CLUST --> SG
SG --> Strategies
style APP fill:#3a332c,stroke:#c4a87a,color:#e8e4dc
style CLUST fill:#3a332c,stroke:#c4a87a,color:#e8e4dc
style SG fill:#3a332c,stroke:#c4a87a,color:#e8e4dc
style BG fill:#3a332c,stroke:#6a8a6a,color:#e8e4dc
style CAN fill:#3a332c,stroke:#d4a050,color:#e8e4dc
Spinnaker supports deployment to AWS EC2, Kubernetes, GCE, GKE, Azure, Cloud Foundry, and Oracle Cloud. Netflix uses blue/green, highlander, and canary deploy strategies with integrated canary analysis.
Cross-cutting architectural patterns employed throughout Netflix's microservices ecosystem, drawn from their extensive open-source contributions and conference talks.
graph TD
subgraph Resilience["Resilience Patterns"]
CB["Circuit Breaker
(Hystrix)"]
BH["Bulkhead Isolation
(thread pools)"]
FB["Fallback
Mechanisms"]
end
subgraph Routing["Routing Patterns"]
GW["API Gateway
(Zuul)"]
SD["Service Discovery
(Eureka)"]
LB["Client-Side LB
(Ribbon)"]
end
subgraph Data["Data Patterns"]
ES2["Event Sourcing
(Kafka as log)"]
CQRS["CQRS
(read/write split)"]
CPDB["Control/Data
Plane Separation"]
end
subgraph Migration["Migration Patterns"]
SF["Strangler Fig
(Cosmos)"]
end
CB --> FB
GW --> SD
SD --> LB
style CB fill:#3a332c,stroke:#b87878,color:#e8e4dc
style GW fill:#3a332c,stroke:#d4a050,color:#e8e4dc
style SD fill:#3a332c,stroke:#c4a87a,color:#e8e4dc
style ES2 fill:#3a332c,stroke:#7090a8,color:#e8e4dc
style CPDB fill:#3a332c,stroke:#7090a8,color:#e8e4dc
style SF fill:#3a332c,stroke:#6a8a6a,color:#e8e4dc
| Layer | Technologies |
|---|---|
| Edge / Gateway | Zuul 2 (Netty), Ribbon, Eureka |
| Microservices | Java, Spring Boot, gRPC |
| Containers | Titus (Mesos + Docker on EC2) |
| Databases | Cassandra, CockroachDB, MySQL |
| Caching | EVCache (Memcached) |
| Messaging | Apache Kafka (100+ clusters) |
| Stream Processing | Apache Flink (20,000+ jobs) |
| Batch Processing | Apache Spark, Hadoop/Hive |
| CDN | Open Connect (custom OCAs) |
| CI/CD | Spinnaker, Jenkins |
| Chaos | Chaos Monkey, FIT |
| Media Processing | Cosmos (Optimus + Plato + Stratum) |
| ML / Recommendations | Manhattan, Hydra, collaborative filtering, deep learning |
| Monitoring | Atlas (metrics), Mantis (real-time ops) |