Interactive architecture map of PostgreSQL's internal subsystems — process model, memory management, query pipeline, MVCC, WAL, replication, and extensibility.
PostgreSQL is a multi-process, object-relational database that uses MVCC for concurrency and Write-Ahead Logging for durability. The postmaster supervises all child processes, allocates shared memory at startup, and forks one backend per client connection.
graph TB
subgraph Clients["Client Applications"]
C1["psql / App 1"]
C2["App 2"]
C3["App N"]
end
PM["Postmaster
Supervisor Daemon"]
subgraph Backends["Backend Processes"]
B1["Backend 1"]
B2["Backend 2"]
B3["Backend N"]
end
subgraph BG["Background Processes"]
CK["Checkpointer"]
BW["BG Writer"]
WW["WAL Writer"]
AV["Autovacuum"]
SC["Stats Collector"]
end
subgraph SM["Shared Memory"]
SB["Shared Buffer Pool"]
WB["WAL Buffers"]
CL["CLOG"]
LK["Lock Tables"]
end
subgraph Storage["Disk Storage"]
DF["Data Files
(base/)"]
WF["WAL Files
(pg_wal/)"]
XA["CLOG
(pg_xact/)"]
end
C1 & C2 & C3 --> PM
PM --> B1 & B2 & B3
B1 & B2 & B3 --> SM
BG --> SM
SM --> Storage
style PM fill:#7B4F9E,stroke:#9B6FBE,color:#F5F0E8
style SB fill:#1E7268,stroke:#2A9D8F,color:#E8E4F0
style WB fill:#1E7268,stroke:#2A9D8F,color:#E8E4F0
style CL fill:#1E7268,stroke:#2A9D8F,color:#E8E4F0
style LK fill:#1E7268,stroke:#2A9D8F,color:#E8E4F0
style DF fill:#2E2E4F,stroke:#E6A817,color:#E8E4F0
style WF fill:#2E2E4F,stroke:#E6A817,color:#E8E4F0
style XA fill:#2E2E4F,stroke:#E6A817,color:#E8E4F0
style B1 fill:#252540,stroke:#5B8FB9,color:#E8E4F0
style B2 fill:#252540,stroke:#5B8FB9,color:#E8E4F0
style B3 fill:#252540,stroke:#5B8FB9,color:#E8E4F0
style CK fill:#252540,stroke:#9B6FBE,color:#E8E4F0
style BW fill:#252540,stroke:#9B6FBE,color:#E8E4F0
style WW fill:#252540,stroke:#9B6FBE,color:#E8E4F0
style AV fill:#252540,stroke:#9B6FBE,color:#E8E4F0
style SC fill:#252540,stroke:#9B6FBE,color:#E8E4F0
style C1 fill:#1C1C35,stroke:#B0A8C0,color:#E8E4F0
style C2 fill:#1C1C35,stroke:#B0A8C0,color:#E8E4F0
style C3 fill:#1C1C35,stroke:#B0A8C0,color:#E8E4F0
PostgreSQL uses a multi-process model (not threads). The postmaster listens on port 5432, forks one backend per client connection, and supervises background maintenance processes. When a child crashes, the postmaster terminates all children and initiates crash recovery.
graph TD
PM["Postmaster
PID 1 · ServerLoop"]
subgraph Maintenance["Background Maintenance"]
CK["Checkpointer
Flushes dirty pages"]
BW["BG Writer
Proactive page flush"]
WW["WAL Writer
Flushes WAL buffers"]
AV["Autovacuum Launcher
Spawns workers"]
SC["Stats Collector
Activity statistics"]
AR["Archiver
WAL archiving"]
IO["IO Workers
Async page reads"]
end
subgraph Repl["Replication"]
WS["WAL Sender
Streams WAL to standby"]
WR["WAL Receiver
(on standby only)"]
SU["Startup Process
WAL replay"]
end
subgraph Workers["Backend Workers"]
B1["Client Backend"]
B2["Client Backend"]
PQ["Parallel Query
Worker"]
BW2["Background
Worker (ext)"]
end
PM --> Maintenance
PM --> Repl
PM --> Workers
style PM fill:#7B4F9E,stroke:#9B6FBE,color:#F5F0E8
style CK fill:#1C1C35,stroke:#E6A817,color:#E8E4F0
style BW fill:#1C1C35,stroke:#E6A817,color:#E8E4F0
style WW fill:#1C1C35,stroke:#E6A817,color:#E8E4F0
style AV fill:#1C1C35,stroke:#E6A817,color:#E8E4F0
style SC fill:#1C1C35,stroke:#E6A817,color:#E8E4F0
style AR fill:#1C1C35,stroke:#E6A817,color:#E8E4F0
style IO fill:#1C1C35,stroke:#E6A817,color:#E8E4F0
style WS fill:#1C1C35,stroke:#2A9D8F,color:#E8E4F0
style WR fill:#1C1C35,stroke:#2A9D8F,color:#E8E4F0
style SU fill:#1C1C35,stroke:#2A9D8F,color:#E8E4F0
style B1 fill:#1C1C35,stroke:#5B8FB9,color:#E8E4F0
style B2 fill:#1C1C35,stroke:#5B8FB9,color:#E8E4F0
style PQ fill:#1C1C35,stroke:#5B8FB9,color:#E8E4F0
style BW2 fill:#1C1C35,stroke:#D4839E,color:#E8E4F0
The postmaster supports three shutdown modes, each triggered by a different signal. Smart shutdown waits for clients to disconnect; fast shutdown terminates backends immediately and checkpoints; immediate shutdown exits ungracefully and requires crash recovery on restart.
| Mode | Signal | Behavior |
|---|---|---|
| Smart | SIGTERM |
Wait for all clients to disconnect, then checkpoint and exit |
| Fast | SIGINT |
Terminate backends immediately, perform final checkpoint |
| Immediate | SIGQUIT |
Ungraceful exit; crash recovery required on restart |
Processes communicate through shared memory (buffer pool, lock tables, PGPROC array, CLOG) for data and Unix signals for control flow. The max_connections parameter (default 100) limits backend processes, and max_worker_processes (default 8) limits custom background workers.
At startup, the postmaster allocates a large shared memory segment containing the buffer pool, WAL buffers, CLOG cache, lock tables, and process tracking arrays. Each backend inherits access to this shared region.
graph LR
subgraph SharedMem["Shared Memory Region"]
SB["Shared Buffer Pool
shared_buffers (128 MB)"]
WB["WAL Buffers
wal_buffers (~3%)"]
CL["CLOG Buffers
Transaction states"]
PP["PGPROC Array
Per-backend tracking"]
LT["Lock Tables
Heavy/Light/Spin"]
end
subgraph Local["Per-Backend Memory"]
WM["work_mem (4 MB)
Sorts, hashes"]
MM["maintenance_work_mem
(64 MB) VACUUM, INDEX"]
TB["temp_buffers (8 MB)
Temp table cache"]
end
B["Backend Process"] --> SharedMem
B --> Local
style SB fill:#1E7268,stroke:#2A9D8F,color:#E8E4F0
style WB fill:#1E7268,stroke:#2A9D8F,color:#E8E4F0
style CL fill:#1E7268,stroke:#2A9D8F,color:#E8E4F0
style PP fill:#1E7268,stroke:#2A9D8F,color:#E8E4F0
style LT fill:#1E7268,stroke:#2A9D8F,color:#E8E4F0
style WM fill:#252540,stroke:#D4839E,color:#E8E4F0
style MM fill:#252540,stroke:#D4839E,color:#E8E4F0
style TB fill:#252540,stroke:#D4839E,color:#E8E4F0
style B fill:#7B4F9E,stroke:#9B6FBE,color:#F5F0E8
All data is stored in 8 KB pages. The buffer pool maintains three parallel structures: buffer blocks (actual page data), buffer descriptors (64-byte metadata per slot tracking dirty/valid/pinned state), and a partitioned hash table mapping (tablespace, relation, fork, block) to slot index for O(1) lookup.
When a free buffer slot is needed, a clock hand sweeps through descriptors. If a page's usage count is greater than zero, it is decremented and skipped. If usage count is zero and the page is unpinned, it is selected as the eviction victim. Each page access increments usage count up to a cap of 5, approximating LRU without linked-list overhead.
For large sequential scans, VACUUM, and bulk writes, PostgreSQL uses a small ring buffer (256 KB - 16 MB) instead of the main pool. This prevents large operations from evicting useful cached pages ("buffer pollution").
Every SQL statement flows through five stages: Parser, Analyzer, Rewriter, Planner, and Executor. PostgreSQL's source code describes this as "the most complicated subsystem."
graph LR
SQL["SQL Text"] --> P["Parser
Flex + Bison"]
P --> A["Analyzer
Semantic validation"]
A --> R["Rewriter
View expansion"]
R --> PL["Planner
Cost-based optimizer"]
PL --> E["Executor
Volcano / pull model"]
E --> RES["Result Tuples"]
CAT["System Catalogs
pg_class, pg_proc"]
STATS["Statistics
pg_statistic"]
CAT -.-> A
CAT -.-> R
STATS -.-> PL
style SQL fill:#252540,stroke:#B0A8C0,color:#E8E4F0
style P fill:#1C1C35,stroke:#7B4F9E,color:#E8E4F0
style A fill:#1C1C35,stroke:#5B8FB9,color:#E8E4F0
style R fill:#1C1C35,stroke:#2A9D8F,color:#E8E4F0
style PL fill:#1C1C35,stroke:#E6A817,color:#E8E4F0
style E fill:#1C1C35,stroke:#D4839E,color:#E8E4F0
style RES fill:#252540,stroke:#B0A8C0,color:#E8E4F0
style CAT fill:#2E2E4F,stroke:#5B8FB9,color:#E8E4F0
style STATS fill:#2E2E4F,stroke:#E6A817,color:#E8E4F0
Uses Flex for tokenization and Bison for grammar rules (~15,000 lines in gram.y). Produces a raw parse tree of Node structures. Only checks syntax -- does not verify that tables exist.
Performs semantic analysis: looks up tables, columns, functions in system catalogs. Resolves names to OIDs, infers data types, inserts implicit casts. Produces a Query struct.
Applies rule-based transformations from pg_rewrite. Expands views into their underlying queries. Can produce multiple query trees from one input (INSTEAD rules).
Cost-based optimizer evaluating scan types (SeqScan, IndexScan, BitmapScan), join algorithms (Nested Loop, Hash, Merge), and uses GEQO for 12+ joins. Outputs a PlannedStmt.
Implements the pull-based Volcano model: each plan node has ExecInit, ExecProcNode (returns one tuple), and ExecEnd. The top node pulls tuples recursively from children.
DDL statements (CREATE, ALTER, DROP) bypass the planner/executor entirely and go through ProcessUtility, a separate dispatch function for utility commands.
PostgreSQL implements Snapshot Isolation via MVCC. Multiple physical versions of each row coexist. Readers never block writers. Each tuple carries header fields tracking which transaction created and deleted it.
graph LR
subgraph INSERT["INSERT (XID 100)"]
T1["Tuple v1
t_xmin=100
t_xmax=0"]
end
subgraph UPDATE["UPDATE (XID 200)"]
T2["Tuple v1
t_xmin=100
t_xmax=200"]
T3["Tuple v2
t_xmin=200
t_xmax=0"]
T2 -->|"t_ctid"| T3
end
subgraph DELETE["DELETE (XID 300)"]
T4["Tuple v2
t_xmin=200
t_xmax=300"]
end
T1 -->|"UPDATE"| T2
T3 -->|"DELETE"| T4
style T1 fill:#1E7268,stroke:#2A9D8F,color:#E8E4F0
style T2 fill:#2E2E4F,stroke:#A63D40,color:#E8E4F0
style T3 fill:#1E7268,stroke:#2A9D8F,color:#E8E4F0
style T4 fill:#2E2E4F,stroke:#A63D40,color:#E8E4F0
| Field | Purpose |
|---|---|
t_xmin |
Transaction ID that inserted this tuple version |
t_xmax |
Transaction ID that deleted/updated this version (0 if still live) |
t_cid |
Command ID within the transaction (distinguishes multiple operations) |
t_ctid |
Pointer to the next version (self-referencing if latest) |
Default level. Takes a new snapshot per statement. Allows non-repeatable reads between statements in the same transaction.
Takes one snapshot at transaction start. Prevents dirty reads and non-repeatable reads. Detects serialization anomalies at commit time.
True serializability via SSI (added in PG 9.1). Tracks rw-conflict dependencies and aborts transactions with dangerous orderings to prevent write skew.
A tuple is visible if its t_xmin committed before the snapshot was taken AND its t_xmax is either 0, belongs to an aborted transaction, or committed after the snapshot. The CLOG (pg_xact/) tracks transaction states with 2 bits per XID. Hint bits on tuple headers cache commit/abort status to avoid repeated CLOG lookups.
The WAL guarantees durability: no dirty page is flushed to disk until the corresponding WAL record is on stable storage. WAL is stored in pg_wal/ as 16 MB segment files, each record identified by a 64-bit Log Sequence Number (LSN).
graph TD
TX["Transaction
INSERT / UPDATE / DELETE"]
XI["XLogInsert()
Append to WAL buffers"]
XF["XLogFlush()
Flush to pg_wal/"]
CM["COMMIT
Durable on disk"]
subgraph Recovery["Crash Recovery"]
SX["StartupXLOG()
Read pg_control"]
RP["Find Redo Point
(last checkpoint LSN)"]
RL["Replay WAL Records
(idempotent)"]
DN["Consistent State
Normal ops begin"]
end
TX --> XI
XI --> XF
XF --> CM
SX --> RP
RP --> RL
RL --> DN
style TX fill:#7B4F9E,stroke:#9B6FBE,color:#F5F0E8
style XI fill:#1C1C35,stroke:#E6A817,color:#E8E4F0
style XF fill:#1C1C35,stroke:#E6A817,color:#E8E4F0
style CM fill:#1E7268,stroke:#2A9D8F,color:#E8E4F0
style SX fill:#1C1C35,stroke:#A63D40,color:#E8E4F0
style RP fill:#1C1C35,stroke:#A63D40,color:#E8E4F0
style RL fill:#1C1C35,stroke:#A63D40,color:#E8E4F0
style DN fill:#1E7268,stroke:#2A9D8F,color:#E8E4F0
A checkpoint flushes all dirty buffers from shared memory to disk and writes a checkpoint record to WAL. The redo point is the LSN where crash recovery would begin. Controlled by checkpoint_timeout (default 5 min) and max_wal_size (default 1 GB). The checkpointer spreads I/O using checkpoint_completion_target (default 0.9).
After each checkpoint, the first modification to any page writes the entire 8 KB image to WAL (not just the delta). This protects against torn pages during OS crashes. Subsequent modifications before the next checkpoint only write deltas. Controlled by full_page_writes (default on -- should never be disabled).
minimal -- only crash recovery. replica (default since PG 10) -- adds archiving and replication data. logical -- adds logical decoding information for logical replication.
PostgreSQL supports six built-in index access methods, each with different data structures and query capabilities. The B-tree is the default and handles the vast majority of use cases.
graph TD
Q["What query pattern?"]
Q -->|"=, <, >, BETWEEN"| BT["B-tree
(Lehman-Yao)"]
Q -->|"= only"| HA["Hash Index"]
Q -->|"Geometric, KNN"| GI["GiST
(Generalized Search)"]
Q -->|"Contains, FTS"| GIN["GIN
(Inverted Index)"]
Q -->|"Quadtree, Trie"| SP["SP-GiST
(Space-Partitioned)"]
Q -->|"Time-series range"| BR["BRIN
(Block Range)"]
style Q fill:#7B4F9E,stroke:#9B6FBE,color:#F5F0E8
style BT fill:#1C1C35,stroke:#E6A817,color:#E8E4F0
style HA fill:#1C1C35,stroke:#5B8FB9,color:#E8E4F0
style GI fill:#1C1C35,stroke:#2A9D8F,color:#E8E4F0
style GIN fill:#1C1C35,stroke:#D4839E,color:#E8E4F0
style SP fill:#1C1C35,stroke:#A63D40,color:#E8E4F0
style BR fill:#1C1C35,stroke:#2D6A4F,color:#E8E4F0
Lehman-Yao concurrent B-tree with right-link pointers. O(log n) lookups. Supports range queries, sorting, index-only scans, and deduplication (PG 13+).
Bucket-based index for equality-only comparisons. WAL-logged since PG 10. Faster than B-tree for pure equality on high-cardinality columns but cannot sort or do range queries.
Balanced tree framework supporting arbitrary data types via operator classes. "Lossy" indexes with false positives filtered by the executor. Used for geometry, FTS, range types, KNN.
Framework for non-balanced, space-partitioning structures: quadtrees, k-d trees, radix tries. Best for data with natural clustering but uneven distribution (IPs, phone numbers).
Inverted index mapping element values to row sets. Entry tree (B-tree of keys) + posting lists/trees. Ideal for tsvector, arrays, JSONB containment, trigram similarity.
Stores min/max summaries for ranges of physical pages. 1000x smaller than B-tree on correlated columns. Only effective when values correlate with physical row order.
MVCC never removes old tuple versions in-place. Dead tuples accumulate from DELETEs and UPDATEs and must be reclaimed by VACUUM. Without it, tables bloat and eventually suffer catastrophic XID wraparound.
graph LR
subgraph Phase1["Phase 1: Heap Scan"]
HS["Scan heap pages"]
VM["Check Visibility Map
Skip all-visible pages"]
DT["Collect dead TIDs
into array"]
end
subgraph Phase2["Phase 2: Index Vacuum"]
IS["Scan each index"]
RE["Remove entries
pointing to dead TIDs"]
end
subgraph Phase3["Phase 3: Heap Vacuum"]
FS["Mark dead slots
in Free Space Map"]
UV["Update Visibility Map
Set all-visible bits"]
end
Phase1 --> Phase2
Phase2 --> Phase3
style HS fill:#1C1C35,stroke:#7B4F9E,color:#E8E4F0
style VM fill:#1C1C35,stroke:#7B4F9E,color:#E8E4F0
style DT fill:#1C1C35,stroke:#7B4F9E,color:#E8E4F0
style IS fill:#1C1C35,stroke:#E6A817,color:#E8E4F0
style RE fill:#1C1C35,stroke:#E6A817,color:#E8E4F0
style FS fill:#1C1C35,stroke:#2A9D8F,color:#E8E4F0
style UV fill:#1C1C35,stroke:#2A9D8F,color:#E8E4F0
The autovacuum launcher wakes every autovacuum_naptime (default 1 min) and checks pg_stat_all_tables for tables exceeding the dead tuple threshold: dead_tuples > threshold + scale_factor * live_tuples (defaults: 50 + 20%). Up to autovacuum_max_workers (default 3) workers run concurrently, throttled by cost-based delay to limit production impact.
XIDs are 32-bit (~4 billion). Without freezing, after ~2 billion transactions old XIDs would appear "in the future" and tuples would vanish. VACUUM freezes tuples by replacing t_xmin with FrozenTransactionId (XID 2). Key thresholds:
| Parameter | Default | Purpose |
|---|---|---|
vacuum_freeze_min_age |
50M txns | Minimum tuple age before freezing |
vacuum_freeze_table_age |
150M txns | Triggers aggressive full-table vacuum scan |
autovacuum_freeze_max_age |
200M txns | Forces anti-wraparound vacuum regardless of dead tuples |
Rewrites the entire table into a new file, compacting and eliminating all dead space. Requires an ACCESS EXCLUSIVE lock (blocks all reads and writes). Reclaims disk space to the OS, unlike regular VACUUM which only marks space for internal reuse. Use sparingly.
PostgreSQL supports two replication paradigms: physical (byte-level WAL streaming for exact cluster copies) and logical (row-level change streaming for selective table replication across versions).
graph LR
subgraph Primary["Primary Server"]
BE["Backend
Writes data"]
WAL1["pg_wal/
WAL segments"]
WS["WAL Sender"]
end
subgraph Standby["Standby Server"]
WR["WAL Receiver"]
WAL2["pg_wal/
WAL segments"]
SU["Startup Process
Replay WAL"]
RO["Read-Only
Queries (Hot Standby)"]
end
BE --> WAL1
WAL1 --> WS
WS -->|"XLogData messages"| WR
WR --> WAL2
WAL2 --> SU
SU --> RO
style BE fill:#7B4F9E,stroke:#9B6FBE,color:#F5F0E8
style WS fill:#1C1C35,stroke:#2A9D8F,color:#E8E4F0
style WR fill:#1C1C35,stroke:#2A9D8F,color:#E8E4F0
style WAL1 fill:#2E2E4F,stroke:#E6A817,color:#E8E4F0
style WAL2 fill:#2E2E4F,stroke:#E6A817,color:#E8E4F0
style SU fill:#1C1C35,stroke:#A63D40,color:#E8E4F0
style RO fill:#1C1C35,stroke:#5B8FB9,color:#E8E4F0
graph LR
subgraph Pub["Publisher"]
WS2["WAL Sender"]
LD["Logical Decoder
pgoutput plugin"]
RB["Reorder Buffer
Reassemble TXs"]
end
subgraph Sub["Subscriber"]
AW["Apply Worker
Map + apply changes"]
TS["Table Sync Worker
Initial COPY"]
end
WS2 --> LD
LD --> RB
RB -->|"Streaming protocol"| AW
TS -->|"Initial sync"| AW
style WS2 fill:#1C1C35,stroke:#2A9D8F,color:#E8E4F0
style LD fill:#1C1C35,stroke:#7B4F9E,color:#E8E4F0
style RB fill:#1C1C35,stroke:#E6A817,color:#E8E4F0
style AW fill:#1C1C35,stroke:#5B8FB9,color:#E8E4F0
style TS fill:#1C1C35,stroke:#D4839E,color:#E8E4F0
| Aspect | Physical | Logical |
|---|---|---|
| Scope | Entire cluster (byte-for-byte copy) | Selected tables / columns |
| Cross-version | Same major version only | Cross-version supported |
| Subscriber writes | Read-only standby | Writable subscriber |
| WAL level | replica | logical |
| Setup | Simpler (pg_basebackup) | Publication + subscription |
Slots prevent WAL recycling for disconnected consumers. Physical slots track LSN position; logical slots also track the oldest XID needed for catalog lookups. Failover slots (PG 17+) can be synced to standbys for seamless promotion.
PostgreSQL's extension system allows adding new data types, functions, operators, index methods, and languages without modifying core code. The hook architecture provides ~30 interception points for monitoring, security, and query transformation.
graph TD
subgraph Hooks["Hook Categories"]
QH["Query Hooks
post_parse_analyze
planner_hook
join_search_hook"]
EH["Executor Hooks
ExecutorStart/Run/
Finish/End"]
SH["Security Hooks
ClientAuthentication
object_access_hook"]
IH["Init Hooks
shmem_startup
emit_log_hook"]
end
subgraph Stack["Hook Stacking"]
E1["Extension A
_PG_init()"]
E2["Extension B
_PG_init()"]
NULL["NULL (default)"]
end
E2 -->|"saved prev"| E1
E1 -->|"saved prev"| NULL
style QH fill:#1C1C35,stroke:#7B4F9E,color:#E8E4F0
style EH fill:#1C1C35,stroke:#E6A817,color:#E8E4F0
style SH fill:#1C1C35,stroke:#A63D40,color:#E8E4F0
style IH fill:#1C1C35,stroke:#2A9D8F,color:#E8E4F0
style E1 fill:#252540,stroke:#5B8FB9,color:#E8E4F0
style E2 fill:#252540,stroke:#D4839E,color:#E8E4F0
style NULL fill:#252540,stroke:#B0A8C0,color:#E8E4F0
Uses all four executor hooks to track query elapsed time, call counts, and I/O statistics across all backends.
Uses planner_hook for hypertable chunk rewriting, post_parse_analyze_hook for query transformation, and ProcessUtility_hook for DDL interception.
Uses executor and utility hooks for comprehensive audit logging of all database operations including DDL and DML.
Uses executor hooks to automatically log query execution plans for statements exceeding a configurable duration threshold.
Beyond hooks, PostgreSQL offers custom access methods (CREATE ACCESS METHOD), custom data types with operator classes, Foreign Data Wrappers for external data sources, custom background workers, custom WAL resource managers (PG 15+), and pluggable storage engines via the Table Access Method API (PG 12+).