Architecture Maps

Netflix Architecture

A picture show of the cloud-native microservices landscape that streams entertainment to 260+ million subscribers — compiled from public talks, blog posts, and open-source repositories.

Public Sources Only 1,000+ Microservices AWS Multi-Region Updated: Mar 2026
01

Enterprise Overview

Netflix runs a cloud-native microservices architecture on Amazon Web Services across multiple regions. The full AWS migration completed in 2016. The system is composed of over 1,000 loosely coupled microservices, split between a control plane (AWS) for all backend logic and a data plane (Open Connect CDN) for video delivery.

1,000+
Microservices
260M+
Subscribers
500K
Containers/Day
2T+
Events/Day
100+
Kafka Clusters
High-Level Architecture: Control Plane vs Data Plane
graph TD
    subgraph Clients["Client Devices"]
        TV["Smart TVs"]
        MOB["Mobile Apps"]
        WEB["Web Browsers"]
    end

    subgraph Control["Control Plane — AWS"]
        ZUUL["Zuul API Gateway"]
        SVC["Microservices
(1,000+)"] RECS["Recommendation
Engine"] DATA["Data Platform
(Kafka + Flink)"] end subgraph Deliver["Data Plane — Open Connect CDN"] OCA["OCA Appliances
(inside ISP networks)"] IXP["IXP Peering
Sites"] end Clients --> ZUUL ZUUL --> SVC SVC --> RECS SVC --> DATA SVC -->|"steering &
manifests"| Deliver Clients -->|"video streams"| Deliver style ZUUL fill:#5c4a32,stroke:#c4a87a,color:#e8e4dc style SVC fill:#3a332c,stroke:#c4a87a,color:#e8e4dc style RECS fill:#3a332c,stroke:#d8c8a8,color:#e8e4dc style DATA fill:#3a332c,stroke:#7090a8,color:#e8e4dc style OCA fill:#3a332c,stroke:#6a8a6a,color:#e8e4dc style IXP fill:#3a332c,stroke:#6a8a6a,color:#e8e4dc style TV fill:#2a2520,stroke:#b0a898,color:#e8e4dc style MOB fill:#2a2520,stroke:#b0a898,color:#e8e4dc style WEB fill:#2a2520,stroke:#b0a898,color:#e8e4dc

Core Data Stores

Apache Cassandra

Primary distributed database for scale-out workloads: viewing history, user profiles, bookmarks.

Data

EVCache (Memcached)

Custom caching layer storing session data, watch history, and recommendations. Maintains 3 copies across AZs.

Cache

MySQL

Used for billing, account data, and transactional workloads.

Transactions

CockroachDB

Adopted for globally consistent transactional needs requiring strong consistency.

Global

Amazon S3

Object storage for encoded video assets, logs, and analytics data.

Storage
02

Edge & API Gateway

Zuul is the front door for all requests from devices and web applications to Netflix's backend. It is a JVM-based L7 application gateway using a filter-based architecture, re-built on Netty for asynchronous non-blocking I/O in Zuul 2.

Zuul Filter Pipeline & Service Mesh
graph LR
    REQ["Client
Request"] --> PRE["Pre-Routing Filters
(auth, rate limit)"] PRE --> ROUTE["Routing Filters
(forward to origin)"] ROUTE --> EUREKA["Eureka
Service Discovery"] EUREKA --> RIBBON["Ribbon
Load Balancer"] RIBBON --> ORIGIN["Origin
Microservice"] ORIGIN --> POST["Post-Routing Filters
(metrics, headers)"] POST --> RESP["Response"] ROUTE --> HYSTRIX["Hystrix
Circuit Breaker"] style REQ fill:#2a2520,stroke:#b0a898,color:#e8e4dc style PRE fill:#3a332c,stroke:#d4a050,color:#e8e4dc style ROUTE fill:#3a332c,stroke:#d4a050,color:#e8e4dc style POST fill:#3a332c,stroke:#d4a050,color:#e8e4dc style EUREKA fill:#5c4a32,stroke:#c4a87a,color:#e8e4dc style RIBBON fill:#3a332c,stroke:#c4a87a,color:#e8e4dc style HYSTRIX fill:#3a332c,stroke:#b87878,color:#e8e4dc style ORIGIN fill:#3a332c,stroke:#7090a8,color:#e8e4dc style RESP fill:#2a2520,stroke:#b0a898,color:#e8e4dc

Zuul 2

Re-architected on Netty for asynchronous, non-blocking I/O. Supports persistent connections (WebSockets, SSE) at scale. Filters can be loaded dynamically at runtime.

Gateway

Eureka

Server-client service discovery. Microservices self-register on startup; servers replicate state for high availability.

Discovery

Hystrix

Circuit breaker pattern: when failure rate exceeds threshold, requests are short-circuited to prevent cascading failures. Provides fallback mechanisms.

Resilience

Ribbon

Client-side load balancer that works with Eureka. Supports round-robin, weighted, and availability-filtering strategies.

Load Balancing
03

Open Connect CDN

Netflix built its own purpose-built CDN called Open Connect, responsible for serving 100% of Netflix's video traffic. Specialized caching appliances (OCAs) are deployed inside ISP networks and at Internet Exchange Points, provided free of charge to qualifying ISPs.

Content Delivery Pipeline
graph TD
    subgraph Ingest["1. Ingest"]
        SRC["Source Media
from Studios"] end subgraph Encode["2. Encoding (AWS)"] CHUNK["Split into Chunks"] EC2["Parallel Encode
(100s of EC2)"] VMAF["VMAF Quality
Measurement"] CODEC["Multi-Codec Output
(H.264, HEVC, VP9, AV1)"] end subgraph Distribute["3. Distribution"] S3["Amazon S3
(encoded assets)"] FILL["Off-Peak Fill
to OCAs"] end subgraph Playback["4. Playback"] STEER["Steering Service
(best OCA selection)"] MANIFEST["Manifest
(stream options)"] OCA["OCA Appliance
(100+ TB SSD)"] CLIENT["Client Player"] end SRC --> CHUNK CHUNK --> EC2 EC2 --> VMAF VMAF --> CODEC CODEC --> S3 S3 --> FILL FILL --> OCA STEER --> MANIFEST MANIFEST --> CLIENT CLIENT -->|"stream"| OCA style SRC fill:#2a2520,stroke:#b0a898,color:#e8e4dc style EC2 fill:#3a332c,stroke:#7090a8,color:#e8e4dc style VMAF fill:#3a332c,stroke:#d4a050,color:#e8e4dc style S3 fill:#3a332c,stroke:#8b7355,color:#e8e4dc style OCA fill:#5c4a32,stroke:#6a8a6a,color:#e8e4dc style STEER fill:#3a332c,stroke:#c4a87a,color:#e8e4dc style CLIENT fill:#2a2520,stroke:#b0a898,color:#e8e4dc
Per-Title Encoding

Each title is analyzed for visual complexity and a custom encoding ladder is generated. This reduces bandwidth by up to 40% without sacrificing perceptual quality, measured by Netflix's VMAF metric. Roughly 95% of traffic is delivered via direct connections between OCAs and residential ISPs.

04

Chaos Engineering

Netflix pioneered chaos engineering after a three-day outage in August 2008 caused by database corruption in their monolithic architecture. The core insight: "the best way to avoid failure is to fail constantly."

Chaos Engineering Toolchain
graph TD
    subgraph Simian["Simian Army (Original)"]
        CM["Chaos Monkey
(kill instances)"] LM["Latency Monkey
(inject delays)"] CONF["Conformity Monkey
(best practices)"] DOC["Doctor Monkey
(health checks)"] SEC["Security Monkey
(vuln detection)"] end subgraph Regional["Regional Chaos"] CG["Chaos Gorilla
(AZ outage)"] CK["Chaos Kong
(Region outage)"] end subgraph Modern["Modern Tooling"] FIT["FIT
(Failure Injection)"] SPIN["Spinnaker
(CD Platform)"] end CM --> SPIN FIT --> ZUUL2["Zuul Edge
(applies failures)"] style CM fill:#5c4a32,stroke:#b87878,color:#e8e4dc style FIT fill:#3a332c,stroke:#d4a050,color:#e8e4dc style SPIN fill:#3a332c,stroke:#c4a87a,color:#e8e4dc style ZUUL2 fill:#3a332c,stroke:#d4a050,color:#e8e4dc style CG fill:#3a332c,stroke:#b87878,color:#e8e4dc style CK fill:#3a332c,stroke:#b87878,color:#e8e4dc

Simian Army Roster

Tool Function Status
Chaos Monkey Randomly terminates production instances Active (standalone)
Latency Monkey Injects artificial delays in RESTful communication Retired
Conformity Monkey Shuts down non-conforming instances Folded into Spinnaker
Doctor Monkey Health-checks instances, removes unhealthy ones Retired
Janitor Monkey Cleans up unused resources Replaced by Swabbie
Security Monkey Detects security vulnerabilities and policy violations Retired
Chaos Gorilla Simulates outage of entire AWS Availability Zone Retired
Chaos Kong Simulates loss of entire AWS Region Retired
Failure Injection Testing (FIT)

Introduced in October 2014 for more precise failure injection than Simian Army tools. FIT pushes failure metadata to Zuul, where edge filters apply injected failures to matching requests, enabling fine-grained chaos experiments per team.

05

Data Platform

Netflix processes 2+ trillion events per day through its real-time data infrastructure. The Keystone Stream Processing Platform is the data backbone — a petabyte-scale real-time event streaming and processing system handling ~3 PB incoming and ~7 PB outgoing data daily.

Keystone Pipeline Architecture
graph LR
    subgraph Producers["Event Producers"]
        APP["Netflix App
(user actions)"] SVC2["Microservices
(system events)"] end subgraph Kafka["Apache Kafka (100+ clusters)"] FRONT["Fronting Kafka
(ingestion)"] SEC2["Secondary Kafka
(derived topics)"] end subgraph Keystone["Keystone Services"] MSG["Messaging Service
(produce & transport)"] ROUTE["Routing Service
(to sinks)"] end subgraph Processing["Stream Processing"] FLINK["Apache Flink
(20,000+ jobs)"] SQL["Streaming SQL
(1,200+ processors)"] end subgraph Sinks["Data Sinks"] S3B["Amazon S3"] ES["Elasticsearch"] CASS["Cassandra"] end APP --> MSG SVC2 --> MSG MSG --> FRONT FRONT --> FLINK FRONT --> ROUTE ROUTE --> S3B ROUTE --> ES ROUTE --> SEC2 FLINK --> CASS SQL --> FLINK style FRONT fill:#5c4a32,stroke:#7090a8,color:#e8e4dc style FLINK fill:#3a332c,stroke:#7090a8,color:#e8e4dc style MSG fill:#3a332c,stroke:#c4a87a,color:#e8e4dc style ROUTE fill:#3a332c,stroke:#c4a87a,color:#e8e4dc style SQL fill:#3a332c,stroke:#d4a050,color:#e8e4dc style S3B fill:#3a332c,stroke:#8b7355,color:#e8e4dc style APP fill:#2a2520,stroke:#b0a898,color:#e8e4dc

Four Innovation Phases

Phase 1: Batch ETL

Traditional Hadoop/Hive batch processing for offline analytics.

Historical

Phase 2: Keystone Pipeline

Real-time event routing with Kafka as the central message bus.

Real-time

Phase 3: SPaaS

Stream Processing as a Service: managed Flink jobs with self-service UI.

Managed

Phase 4: Data Mesh

Federated ownership with Streaming SQL abstraction over Flink. 1,200 processors created in one year.

Current
06

Recommendation Engine

Netflix's recommendation system drives 75–80% of all viewing hours and saves an estimated $1 billion per year in subscriber retention. It operates across three computation modes: offline, nearline, and online.

Three-Mode Recommendation Flow
graph TD
    USER["User Actions
(plays, ratings, scrolls)"] USER --> KAFKA3["Kafka Topics"] subgraph Offline["Offline (Hours)"] SPARK["Spark / Hadoop
(model training)"] S3C["S3
(model artifacts)"] end subgraph Nearline["Nearline (Seconds)"] MANHATTAN["Manhattan
(event processing)"] CASS2["Cassandra
(intermediate results)"] end subgraph Online["Online (Milliseconds)"] BLEND["Blending Service
(real-time context)"] EVC["EVCache
(hot results)"] end KAFKA3 --> SPARK KAFKA3 --> MANHATTAN SPARK --> S3C S3C --> EVC MANHATTAN --> CASS2 CASS2 --> BLEND EVC --> BLEND BLEND --> HOMEPAGE["Personalized
Homepage"] style USER fill:#2a2520,stroke:#b0a898,color:#e8e4dc style KAFKA3 fill:#3a332c,stroke:#7090a8,color:#e8e4dc style SPARK fill:#3a332c,stroke:#d8c8a8,color:#e8e4dc style MANHATTAN fill:#3a332c,stroke:#d8c8a8,color:#e8e4dc style BLEND fill:#5c4a32,stroke:#d8c8a8,color:#e8e4dc style EVC fill:#3a332c,stroke:#c4a87a,color:#e8e4dc style CASS2 fill:#3a332c,stroke:#c4a87a,color:#e8e4dc style HOMEPAGE fill:#3a332c,stroke:#d4a050,color:#e8e4dc
Mode Latency Complexity Use Case
Offline Hours Unlimited Model training, batch feature engineering, large-scale matrix factorization
Nearline Seconds–minutes Moderate Incremental model updates, event-driven re-ranking
Online Milliseconds Latency-limited Live recommendation serving, blending precomputed results with real-time signals

ML Approaches

Ensemble Methods

Combines collaborative filtering, content-based filtering, deep neural networks, and graph-based models.

Multi-Task Learning ("Hydra")

Single models handle homepage ranking, search ordering, and notification personalization simultaneously.

Contextual Bandits

Explore/exploit strategies for artwork personalization — different users see different promotional images.

A/B Testing Platform

Every algorithm change is tested on live traffic before full rollout across the subscriber base.

07

Cosmos Studio Platform

Netflix built the Cosmos platform for media processing — combining microservices, asynchronous workflows, and serverless functions. Development started in 2018, reaching production in 2019 using a "strangler fig" migration pattern around the legacy Reloaded system.

Cosmos Three-Subsystem Architecture
graph TD
    subgraph API["Optimus — API Layer"]
        OPT["Maps external requests
to internal models"] end subgraph Workflow["Plato — Workflow Orchestration"] PLATO["Forward-chaining
rule engine (DAGs)"] end subgraph Compute["Stratum — Serverless Compute"] STRAT["Stateless functions
(encoding, QA)"] end TIMESTONE["Timestone
Priority Queue"] OPT --> TIMESTONE TIMESTONE --> PLATO PLATO --> TIMESTONE TIMESTONE --> STRAT STRAT --> TIMESTONE subgraph Output["Delivery"] VMAF2["VMAF Scoring"] CDN2["Open Connect
CDN"] end STRAT --> VMAF2 VMAF2 --> CDN2 style OPT fill:#3a332c,stroke:#b0a898,color:#e8e4dc style PLATO fill:#3a332c,stroke:#b0a898,color:#e8e4dc style STRAT fill:#3a332c,stroke:#b0a898,color:#e8e4dc style TIMESTONE fill:#5c4a32,stroke:#c4a87a,color:#e8e4dc style VMAF2 fill:#3a332c,stroke:#d4a050,color:#e8e4dc style CDN2 fill:#3a332c,stroke:#6a8a6a,color:#e8e4dc
Subsystem Role Details
Optimus API Layer Entry point for all media processing requests; maps external to internal business models
Plato Workflow Orchestration Forward-chaining rule engine supporting DAG-based workflows lasting minutes to years
Stratum Serverless Compute Generates typed RPC clients; runs stateless compute-intensive functions on elastic EC2
Timestone Messaging High-scale, low-latency priority queuing system connecting all three subsystems
Video Encoding Service (VES)

A Cosmos microservice for per-title video encoding optimization. Analyzes content complexity to generate custom encoding ladders, performing parallel chunk encoding across hundreds of EC2 instances with multi-codec support (H.264, H.265/HEVC, VP9, AV1).

08

Infrastructure & Containers

Netflix's container platform Titus, built on Apache Mesos with Docker containers running on EC2, launches up to 500,000 containers and 200,000 clusters per day across tens of thousands of EC2 VMs in seven regionally isolated stacks.

Titus Container Platform
graph TD
    subgraph Scheduler["Titus Master"]
        MASTER["Leader-Elected
Scheduler"] ZK["Zookeeper
(leader election)"] CASST["Cassandra
(persistence)"] end subgraph Agents["Titus Agents (EC2 VMs)"] A1["Agent Pool 1"] A2["Agent Pool 2"] A3["Agent Pool N"] end subgraph Containers["Docker Containers"] C1["Microservice A"] C2["Microservice B"] C3["Batch Job"] end MASTER --> ZK MASTER --> CASST MASTER -->|"placement"| Agents Agents --> Containers style MASTER fill:#5c4a32,stroke:#8b7355,color:#e8e4dc style ZK fill:#3a332c,stroke:#8b7355,color:#e8e4dc style CASST fill:#3a332c,stroke:#8b7355,color:#e8e4dc style A1 fill:#3a332c,stroke:#b0a898,color:#e8e4dc style A2 fill:#3a332c,stroke:#b0a898,color:#e8e4dc style A3 fill:#3a332c,stroke:#b0a898,color:#e8e4dc

Spinnaker — Continuous Delivery

Open-source multi-cloud CD platform built by Netflix. Pipelines are composed of Stages (decomposed into Tasks) that can run in parallel or serially.

Spinnaker Deployment Abstractions
graph LR
    subgraph Spinnaker["Spinnaker CD"]
        APP["Application"]
        CLUST["Cluster"]
        SG["Server Group
(load-balanced)"] end subgraph Triggers["Triggers"] JENK["Jenkins"] CRON["Cron"] PIPE["Pipeline
Completion"] end subgraph Strategies["Deploy Strategies"] BG["Blue/Green"] HL["Highlander"] CAN["Canary"] end Triggers --> APP APP --> CLUST CLUST --> SG SG --> Strategies style APP fill:#3a332c,stroke:#c4a87a,color:#e8e4dc style CLUST fill:#3a332c,stroke:#c4a87a,color:#e8e4dc style SG fill:#3a332c,stroke:#c4a87a,color:#e8e4dc style BG fill:#3a332c,stroke:#6a8a6a,color:#e8e4dc style CAN fill:#3a332c,stroke:#d4a050,color:#e8e4dc
Multi-Cloud Support

Spinnaker supports deployment to AWS EC2, Kubernetes, GCE, GKE, Azure, Cloud Foundry, and Oracle Cloud. Netflix uses blue/green, highlander, and canary deploy strategies with integrated canary analysis.

09

Architectural Patterns

Cross-cutting architectural patterns employed throughout Netflix's microservices ecosystem, drawn from their extensive open-source contributions and conference talks.

Key Patterns & Their Implementations
graph TD
    subgraph Resilience["Resilience Patterns"]
        CB["Circuit Breaker
(Hystrix)"] BH["Bulkhead Isolation
(thread pools)"] FB["Fallback
Mechanisms"] end subgraph Routing["Routing Patterns"] GW["API Gateway
(Zuul)"] SD["Service Discovery
(Eureka)"] LB["Client-Side LB
(Ribbon)"] end subgraph Data["Data Patterns"] ES2["Event Sourcing
(Kafka as log)"] CQRS["CQRS
(read/write split)"] CPDB["Control/Data
Plane Separation"] end subgraph Migration["Migration Patterns"] SF["Strangler Fig
(Cosmos)"] end CB --> FB GW --> SD SD --> LB style CB fill:#3a332c,stroke:#b87878,color:#e8e4dc style GW fill:#3a332c,stroke:#d4a050,color:#e8e4dc style SD fill:#3a332c,stroke:#c4a87a,color:#e8e4dc style ES2 fill:#3a332c,stroke:#7090a8,color:#e8e4dc style CPDB fill:#3a332c,stroke:#7090a8,color:#e8e4dc style SF fill:#3a332c,stroke:#6a8a6a,color:#e8e4dc

Technology Stack Summary

Layer Technologies
Edge / Gateway Zuul 2 (Netty), Ribbon, Eureka
Microservices Java, Spring Boot, gRPC
Containers Titus (Mesos + Docker on EC2)
Databases Cassandra, CockroachDB, MySQL
Caching EVCache (Memcached)
Messaging Apache Kafka (100+ clusters)
Stream Processing Apache Flink (20,000+ jobs)
Batch Processing Apache Spark, Hadoop/Hive
CDN Open Connect (custom OCAs)
CI/CD Spinnaker, Jenkins
Chaos Chaos Monkey, FIT
Media Processing Cosmos (Optimus + Plato + Stratum)
ML / Recommendations Manhattan, Hydra, collaborative filtering, deep learning
Monitoring Atlas (metrics), Mantis (real-time ops)
10

Acronym Reference

AV1AOMedia Video 1 (open codec)
AWSAmazon Web Services
AZAvailability Zone
CDContinuous Delivery
CDNContent Delivery Network
CQRSCommand Query Responsibility Segregation
DAGDirected Acyclic Graph
EC2Elastic Compute Cloud
ETLExtract, Transform, Load
FITFailure Injection Testing
HEVCHigh Efficiency Video Coding (H.265)
IXPInternet Exchange Point
JVMJava Virtual Machine
L7OSI Layer 7 (Application Layer)
MLMachine Learning
OCAOpen Connect Appliance
OSSOpen Source Software
RPCRemote Procedure Call
S3Simple Storage Service
SPaaSStream Processing as a Service
SSDSolid State Drive
SSEServer-Sent Events
VESVideo Encoding Service
VMAFVideo Multimethod Assessment Fusion
Diagram
100%
Scroll to zoom · Drag to pan · Esc to close