DuckDB — Tech Guide

01

Quick Reference

The most common DuckDB operations at a glance. DuckDB is an in-process SQL OLAP database management system designed for analytical workloads. Think of it as "SQLite for analytics."

Essential Commands

Start CLI

duckdb

Open an in-memory database

Open / Create File

duckdb my_data.duckdb

Persistent database on disk

Read Parquet

SELECT * FROM 'data.parquet';

Query Parquet files directly

Read CSV

SELECT * FROM 'data.csv';

Auto-detect CSV schema

Read JSON

SELECT * FROM 'data.json';

Newline-delimited or array JSON

Export to Parquet

COPY tbl TO 'out.parquet';

Write query results to Parquet

Describe Table

DESCRIBE my_table;

Show column names and types

Show Tables

SHOW TABLES;

List all tables in current database

CLI Dot Commands

Command	Description
`.help`	Show all dot commands
`.tables`	List all tables
`.schema`	Show schema for all tables
`.mode`	Set output mode (box, csv, json, markdown, table, line)
`.timer on`	Show query execution time
`.maxrows N`	Set maximum rows displayed
`.open file.db`	Open a different database file
`.quit`	Exit the CLI
`.changes on`	Show number of changed rows after DML
`.columns`	Column-oriented output mode

02

Installation & Setup

DuckDB is designed for minimal friction. No server process, no external dependencies, no configuration files. A single binary or library is all you need.

Platform Installation

macOS (Homebrew)

brew install duckdb

Latest stable release

Linux (apt)

sudo apt install duckdb

Or download binary from GitHub

Windows (winget)

winget install DuckDB.cli

Also available via Chocolatey

Python

pip install duckdb

Python 3.7+ supported

Node.js

npm install duckdb

Native addon via node-gyp

Rust

cargo add duckdb

Via duckdb-rs crate

Database Modes

In-Memory vs Persistent: By default DuckDB runs entirely in memory. Pass a filename to persist data to disk. You can also attach multiple databases simultaneously.

-- In-memory (data lost on exit)
duckdb

-- Persistent database file
duckdb analytics.duckdb

-- Read-only mode
duckdb -readonly analytics.duckdb

-- Attach additional databases
ATTACH 'other.duckdb' AS other_db;
SELECT * FROM other_db.main.my_table;

Configuration Settings

-- Show all settings
SELECT * FROM duckdb_settings();

-- Set memory limit
SET memory_limit = '4GB';

-- Set number of threads
SET threads = 4;

-- Enable progress bar for long queries
SET enable_progress_bar = true;

-- Set default sort order
SET default_order = 'DESC';

-- Preserve insertion order (default: true)
SET preserve_insertion_order = true;

Extensions

DuckDB supports a rich extension ecosystem. Core extensions are auto-loaded on first use; community extensions require explicit installation.

-- Install an extension
INSTALL httpfs;

-- Load an extension
LOAD httpfs;

-- Install + load in one step (auto-loadable extensions)
INSTALL spatial;
LOAD spatial;

-- List installed extensions
SELECT * FROM duckdb_extensions();

-- Update all extensions
UPDATE EXTENSIONS;

-- Install from community repository
INSTALL delta FROM community;

Extension	Purpose	Auto-load
`httpfs`	Read files over HTTP/S3	Yes
`json`	JSON reading/writing	Yes
`parquet`	Parquet support	Yes
`icu`	Unicode collations & timezone	Yes
`fts`	Full-text search	No
`spatial`	Geospatial operations	No
`excel`	Read/write Excel files	No
`sqlite_scanner`	Read from SQLite databases	No
`postgres_scanner`	Read from PostgreSQL	No
`delta`	Read Delta Lake tables	No

03

Data Import & Export

DuckDB can directly query files without importing them first. Parquet, CSV, JSON, and Excel files can all be read with a simple SELECT statement, or loaded into tables with COPY or CREATE TABLE AS.

Parquet Files

Parquet is DuckDB's sweet spot. Columnar format, compressed, schema-embedded. DuckDB pushes predicates down into Parquet files for efficient partial reads.

-- Query a single Parquet file
SELECT * FROM 'sales_2024.parquet';

-- Query multiple files with glob
SELECT * FROM 'sales_*.parquet';

-- Query a directory of Parquet files
SELECT * FROM read_parquet('data/**/*.parquet');

-- Hive-partitioned Parquet
SELECT * FROM read_parquet(
    's3://bucket/data/**/*.parquet',
    hive_partitioning = true
);

-- Get Parquet metadata without reading data
SELECT * FROM parquet_metadata('data.parquet');
SELECT * FROM parquet_schema('data.parquet');

-- Export to Parquet
COPY (
    SELECT * FROM sales WHERE year = 2024
) TO 'sales_2024.parquet' (FORMAT PARQUET);

-- Export with compression
COPY tbl TO 'out.parquet' (
    FORMAT PARQUET,
    COMPRESSION ZSTD,
    ROW_GROUP_SIZE 100000
);

CSV Files

-- Auto-detect everything (delimiter, types, headers)
SELECT * FROM 'data.csv';

-- Explicit options
SELECT * FROM read_csv(
    'data.csv',
    header = true,
    delim = '|',
    dateformat = '%m/%d/%Y',
    columns = {
        'id': 'INTEGER',
        'name': 'VARCHAR',
        'amount': 'DECIMAL(10,2)'
    }
);

-- Read multiple CSVs with glob
SELECT * FROM read_csv('logs_*.csv', union_by_name = true);

-- Include filename column
SELECT *, filename
FROM read_csv('data/*.csv', filename = true);

-- Export to CSV
COPY tbl TO 'output.csv' (HEADER, DELIMITER ',');

-- Export with gzip compression
COPY tbl TO 'output.csv.gz' (HEADER);

JSON Files

-- Newline-delimited JSON
SELECT * FROM 'data.ndjson';

-- JSON array format
SELECT * FROM read_json('data.json', format = 'array');

-- Explicit JSON schema
SELECT * FROM read_json(
    'events.json',
    columns = {
        'ts': 'TIMESTAMP',
        'event': 'VARCHAR',
        'payload': 'JSON'
    }
);

-- Extract nested JSON fields
SELECT
    data->'$.user.name' AS name,
    data->'$.user.age' AS age
FROM read_json('users.json');

-- Export to JSON
COPY tbl TO 'output.json';
COPY tbl TO 'output.ndjson' (FORMAT JSON, ARRAY false);

Other Formats

Excel Files

INSTALL spatial; -- or excel
SELECT * FROM st_read('data.xlsx');

Requires spatial or excel extension

SQLite Databases

INSTALL sqlite_scanner;
LOAD sqlite_scanner;
SELECT * FROM sqlite_scan('app.db', 'users');

Query SQLite files directly

PostgreSQL

INSTALL postgres_scanner;
LOAD postgres_scanner;
SELECT * FROM postgres_scan(
  'host=localhost dbname=prod', 'orders');

Query Postgres tables without ETL

CREATE TABLE AS

CREATE TABLE sales AS
  SELECT * FROM 'sales.parquet'
  WHERE amount > 100;

Import file data into a DuckDB table

COPY Statement Reference

Option	Values	Description
`FORMAT`	CSV, PARQUET, JSON	Output format
`HEADER`	true/false	Include column headers (CSV)
`DELIMITER`	Character	Field separator (CSV)
`COMPRESSION`	GZIP, ZSTD, SNAPPY, LZ4	Compression algorithm
`PARTITION_BY`	Column list	Hive-style partitioning
`PER_THREAD_OUTPUT`	true/false	One file per thread
`ROW_GROUP_SIZE`	Integer	Parquet row group size

04

SQL Extensions & Syntax

DuckDB extends standard SQL with numerous ergonomic features. These extensions reduce boilerplate, improve readability, and enable analytical patterns that are cumbersome in other databases.

SELECT Enhancements

-- EXCLUDE columns from SELECT *
SELECT * EXCLUDE (internal_id, created_at)
FROM users;

-- REPLACE columns in SELECT *
SELECT * REPLACE (upper(name) AS name)
FROM users;

-- COLUMNS expression — apply function to matching columns
SELECT min(COLUMNS('amount_*'))
FROM invoices;

-- COLUMNS with lambda
SELECT COLUMNS(c -> c LIKE '%_id')
FROM orders;

-- GROUP BY ALL — auto-group by all non-aggregate columns
SELECT region, product, sum(amount)
FROM sales
GROUP BY ALL;

-- ORDER BY ALL — sort by every column
SELECT * FROM products
ORDER BY ALL;

FROM-first Syntax

FROM-first queries are a DuckDB innovation. Write FROM before SELECT for more natural data exploration. You can even omit SELECT entirely.

-- FROM-first: more natural for exploration
FROM sales
SELECT region, sum(amount)
GROUP BY region;

-- Implicit SELECT * (just FROM)
FROM sales;
-- equivalent to: SELECT * FROM sales;

-- FROM-first with WHERE
FROM sales
WHERE year = 2024
SELECT region, count(*);

-- FROM-first with LIMIT
FROM 'huge_file.parquet'
LIMIT 10;

List & Struct Types

-- Create a list (array)
SELECT [1, 2, 3] AS nums;

-- Create a struct (named tuple)
SELECT {'x': 1, 'y': 2} AS point;

-- Create a MAP
SELECT map(['a', 'b'], [1, 2]) AS m;

-- Access struct fields
SELECT point.x, point.y
FROM (SELECT {'x': 10, 'y': 20} AS point);

-- Unnest lists into rows
SELECT unnest(['a', 'b', 'c']) AS letter;

-- List comprehension
SELECT [x * 2 FOR x IN [1, 2, 3]] AS doubled;

-- Filter within list
SELECT [x FOR x IN [1, 2, 3, 4, 5] IF x > 3];

String Functions & Patterns

-- String slicing (Python-style)
SELECT 'DuckDB'[1:4];  -- 'Duck'

-- String similarity
SELECT jaro_winkler_similarity('duck', 'dock');

-- Regular expressions
SELECT regexp_extract(
    'Price: $42.50',
    '(\d+\.\d+)', 1
);  -- '42.50'

-- String aggregation
SELECT string_agg(name, ', ' ORDER BY name)
FROM users
GROUP BY department;

-- Format strings
SELECT format('{} has {} items', name, count)
FROM inventory;

Advanced SQL Clauses

-- QUALIFY — filter window function results (like HAVING for aggregates)
SELECT name, department, salary
FROM employees
QUALIFY row_number() OVER (
    PARTITION BY department
    ORDER BY salary DESC
) = 1;

-- SAMPLE — random sampling
SELECT * FROM big_table
USING SAMPLE 10 PERCENT;

SELECT * FROM big_table
USING SAMPLE 1000 ROWS;

-- PIVOT and UNPIVOT
PIVOT sales
ON quarter
USING sum(amount)
GROUP BY product;

UNPIVOT monthly_data
ON jan, feb, mar, apr
INTO NAME month VALUE revenue;

-- ASOF JOIN — join on nearest timestamp
SELECT t.*, q.price
FROM trades t
ASOF JOIN quotes q
    ON t.ticker = q.ticker
    AND t.ts >= q.ts;

Macros & User-Defined Functions

-- Scalar macro
CREATE MACRO add(a, b) AS a + b;
SELECT add(3, 4);  -- 7

-- Table macro
CREATE MACRO top_n(tbl, n) AS TABLE
    SELECT * FROM tbl LIMIT n;
SELECT * FROM top_n(sales, 5);

-- Macro with default parameters
CREATE MACRO pct(val, total := 100) AS
    round(val / total * 100, 2);

-- Recursive CTE with macro
CREATE MACRO fibonacci(n) AS TABLE
    WITH RECURSIVE fib(a, b, i) AS (
        VALUES (0, 1, 1)
        UNION ALL
        SELECT b, a + b, i + 1
        FROM fib WHERE i < n
    )
    SELECT a AS value FROM fib;

Enum & Custom Types

-- Create an ENUM type
CREATE TYPE mood AS ENUM ('happy', 'sad', 'neutral');

CREATE TABLE journal (
    entry_date DATE,
    feeling mood,
    notes VARCHAR
);

-- Enums use minimal storage (dictionary encoding)
INSERT INTO journal VALUES
    ('2024-01-15', 'happy', 'Great day!'),
    ('2024-01-16', 'neutral', 'Uneventful');

05

Window Functions & Analytics

Window functions are essential for analytical queries. DuckDB supports the full SQL window function specification with optimized execution for large-scale aggregations.

Ranking Functions

-- ROW_NUMBER: unique sequential rank
SELECT
    name, department, salary,
    row_number() OVER (
        PARTITION BY department
        ORDER BY salary DESC
    ) AS rank
FROM employees;

-- RANK: ties get same rank, gaps in sequence
SELECT
    product, revenue,
    rank() OVER (ORDER BY revenue DESC) AS rnk
FROM products;

-- DENSE_RANK: ties get same rank, no gaps
SELECT
    student, score,
    dense_rank() OVER (ORDER BY score DESC) AS drnk
FROM exam_results;

-- NTILE: divide into N equal groups
SELECT
    customer, total_spend,
    ntile(4) OVER (ORDER BY total_spend DESC) AS quartile
FROM customers;

Aggregate Window Functions

-- Running total
SELECT
    date, amount,
    sum(amount) OVER (
        ORDER BY date
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) AS running_total
FROM transactions;

-- Moving average (7-day)
SELECT
    date, value,
    avg(value) OVER (
        ORDER BY date
        ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
    ) AS moving_avg_7d
FROM metrics;

-- Percentage of total
SELECT
    category, revenue,
    round(
        revenue * 100.0 / sum(revenue) OVER (),
        2
    ) AS pct_of_total
FROM sales;

-- Min/Max within partition
SELECT
    employee, department, salary,
    salary - min(salary) OVER (PARTITION BY department) AS above_dept_min,
    max(salary) OVER (PARTITION BY department) - salary AS below_dept_max
FROM employees;

Value Functions

-- LAG / LEAD — access previous/next rows
SELECT
    date, price,
    lag(price, 1) OVER (ORDER BY date) AS prev_price,
    price - lag(price, 1) OVER (ORDER BY date) AS daily_change,
    lead(price, 1) OVER (ORDER BY date) AS next_price
FROM stock_prices;

-- FIRST_VALUE / LAST_VALUE
SELECT
    ts, event,
    first_value(event) OVER (
        PARTITION BY user_id
        ORDER BY ts
    ) AS first_event,
    last_value(event) OVER (
        PARTITION BY user_id
        ORDER BY ts
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS last_event
FROM events;

-- NTH_VALUE
SELECT
    nth_value(product, 3) OVER (
        ORDER BY revenue DESC
        ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
    ) AS third_best
FROM products;

QUALIFY Clause

QUALIFY is to window functions what HAVING is to GROUP BY. It filters rows after window computation, eliminating the need for subqueries.

-- Top earner per department (no subquery needed)
SELECT name, department, salary
FROM employees
QUALIFY row_number() OVER (
    PARTITION BY department ORDER BY salary DESC
) = 1;

-- De-duplicate: keep latest record per user
SELECT *
FROM user_events
QUALIFY row_number() OVER (
    PARTITION BY user_id ORDER BY ts DESC
) = 1;

-- Products with above-average revenue in their category
SELECT product, category, revenue
FROM products
QUALIFY revenue > avg(revenue) OVER (PARTITION BY category);

Named Windows

-- Define a window once, reuse multiple times
SELECT
    date, ticker, price,
    avg(price) OVER w AS avg_price,
    min(price) OVER w AS min_price,
    max(price) OVER w AS max_price,
    lag(price) OVER w AS prev_price
FROM stock_prices
WINDOW w AS (
    PARTITION BY ticker
    ORDER BY date
    ROWS BETWEEN 29 PRECEDING AND CURRENT ROW
);

06

Python Bindings

DuckDB's Python API is designed for seamless integration with the data science ecosystem. It can query Pandas DataFrames, Polars DataFrames, and Arrow tables directly without any data copying.

Basic Usage

import duckdb

# In-memory connection (default)
con = duckdb.connect()

# Persistent database
con = duckdb.connect('analytics.duckdb')

# Execute SQL
result = con.sql("SELECT 42 AS answer")
print(result.fetchall())  # [(42,)]

# Use the module-level API (implicit connection)
duckdb.sql("SELECT * FROM 'data.parquet' LIMIT 5").show()

# Execute and fetch as different formats
result = duckdb.sql("SELECT * FROM 'sales.csv'")
result.fetchdf()      # Pandas DataFrame
result.fetchnumpy()   # Dict of NumPy arrays
result.arrow()        # PyArrow Table
result.pl()           # Polars DataFrame

Query DataFrames Directly

Zero-copy integration. DuckDB can query Pandas and Polars DataFrames by name without importing them into a table. This uses Apache Arrow under the hood for zero-copy data access.

import pandas as pd
import duckdb

# Create a Pandas DataFrame
df = pd.DataFrame({
    'name': ['Alice', 'Bob', 'Charlie'],
    'score': [95, 82, 91]
})

# Query the DataFrame by variable name
duckdb.sql("SELECT * FROM df WHERE score > 85").show()

# Join a DataFrame with a Parquet file
duckdb.sql("""
    SELECT d.name, d.score, p.grade
    FROM df d
    JOIN 'grades.parquet' p ON d.name = p.name
""").show()

# Polars integration
import polars as pl
lf = pl.LazyFrame({'x': [1, 2, 3]})
duckdb.sql("SELECT sum(x) FROM lf").show()

Relational API

import duckdb

# Chain operations without SQL strings
con = duckdb.connect()

result = (
    con.read_parquet('sales.parquet')
       .filter('amount > 100')
       .aggregate('region, sum(amount) AS total')
       .order('total DESC')
       .limit(10)
       .fetchdf()
)

# Mix relational API with SQL
rel = con.read_csv('data.csv')
con.sql("SELECT * FROM rel WHERE age > 30").show()

Bulk Operations & Performance

# Insert from DataFrame (fast)
con.sql("CREATE TABLE users AS SELECT * FROM df")

# Append to existing table
con.sql("INSERT INTO users SELECT * FROM new_df")

# Parameterized queries
con.execute(
    "SELECT * FROM users WHERE age > ? AND city = ?",
    [25, 'NYC']
).fetchdf()

# Transaction control
con.begin()
con.sql("INSERT INTO orders VALUES (1, 'widget', 42.0)")
con.sql("UPDATE inventory SET qty = qty - 1 WHERE item = 'widget'")
con.commit()

# Register a Python function as UDF
from duckdb.typing import VARCHAR
con.create_function(
    'reverse_str',
    lambda s: s[::-1],
    [VARCHAR], VARCHAR
)
con.sql("SELECT reverse_str('DuckDB')").show()

07

Node.js Bindings

DuckDB's Node.js client provides both callback-style and Promise-based APIs. The native addon runs DuckDB in-process for maximum performance.

Setup & Connection

// Install
// npm install duckdb
// or: npm install @duckdb/node-api (newer API)

const duckdb = require('duckdb');

// In-memory database
const db = new duckdb.Database(':memory:');

// Persistent database
const db = new duckdb.Database('analytics.duckdb');

// Read-only
const db = new duckdb.Database(
    'analytics.duckdb',
    { access_mode: 'READ_ONLY' }
);

Queries & Results

// Callback-based API
db.all("SELECT * FROM 'data.parquet' LIMIT 10", (err, rows) => {
    if (err) throw err;
    console.log(rows);
});

// Parameterized query
db.all(
    "SELECT * FROM users WHERE age > ? AND city = ?",
    25, 'NYC',
    (err, rows) => { console.log(rows); }
);

// Streaming results (large datasets)
const con = db.connect();
const stmt = con.run("SELECT * FROM big_table");
stmt.each((err, row) => {
    // Process one row at a time
    console.log(row);
});

// Prepared statements
const stmt = db.prepare("INSERT INTO events VALUES (?, ?, ?)");
stmt.run(1, 'click', new Date());
stmt.run(2, 'scroll', new Date());
stmt.finalize();

Modern Promise-based API

// Using @duckdb/node-api (recommended for new projects)
import { DuckDBInstance } from '@duckdb/node-api';

const instance = await DuckDBInstance.create('analytics.duckdb');
const con = await instance.connect();

// Run a query
const result = await con.run(
    "SELECT * FROM 'sales.parquet' WHERE amount > 1000"
);

// Get rows as objects
const rows = result.getRows();
console.log(rows);

// Stream chunks for large results
const reader = await con.runAndReadAll(
    "SELECT * FROM big_table"
);
const chunks = reader.getChunks();

Integration with Express / Fastify

const express = require('express');
const duckdb = require('duckdb');

const app = express();
const db = new duckdb.Database('analytics.duckdb');

app.get('/api/sales', (req, res) => {
    const { region, year } = req.query;
    db.all(
        `SELECT region, sum(amount) as total
         FROM sales
         WHERE region = ? AND year = ?
         GROUP BY region`,
        region, parseInt(year),
        (err, rows) => {
            if (err) return res.status(500).json({ error: err.message });
            res.json(rows);
        }
    );
});

08

WASM & Browser Usage

DuckDB-WASM brings the full power of DuckDB to the browser. Run analytical SQL queries entirely client-side, directly on Parquet files, CSVs, or in-memory data. No server required.

Quick Start (CDN)

<!-- Load DuckDB-WASM from CDN -->
<script src="https://cdn.jsdelivr.net/npm/@duckdb/duckdb-wasm/dist/duckdb-browser-blocking.js"></script>

<script type="module">
import * as duckdb from
    'https://cdn.jsdelivr.net/npm/@duckdb/duckdb-wasm/+esm';

// Select a bundle (browser picks best for platform)
const JSDELIVR_BUNDLES = duckdb.getJsDelivrBundles();
const bundle = await duckdb.selectBundle(JSDELIVR_BUNDLES);

// Instantiate
const worker = new Worker(bundle.mainWorker);
const logger = new duckdb.ConsoleLogger();
const db = new duckdb.AsyncDuckDB(logger, worker);
await db.instantiate(bundle.mainModule, bundle.pthreadWorker);

// Connect and query
const con = await db.connect();
const result = await con.query("SELECT 42 AS answer");
console.log(result.toArray());
</script>

Loading Data in the Browser

// Register a file from URL
await db.registerFileURL(
    'remote.parquet',
    'https://example.com/data.parquet',
    duckdb.DuckDBDataProtocol.HTTP,
    false
);
const result = await con.query(
    "SELECT * FROM 'remote.parquet'"
);

// Register a file from user upload (File/Blob)
const fileInput = document.querySelector('input[type=file]');
fileInput.addEventListener('change', async (e) => {
    const file = e.target.files[0];
    await db.registerFileHandle(
        file.name, file,
        duckdb.DuckDBDataProtocol.BROWSER_FILEREADER,
        true
    );
    const result = await con.query(
        `SELECT * FROM '${file.name}' LIMIT 100`
    );
    renderTable(result);
});

// Insert data from JavaScript arrays
await con.query(`
    CREATE TABLE metrics AS
    SELECT * FROM (VALUES
        ('cpu', 0.85, '2024-01-01'),
        ('mem', 0.62, '2024-01-01')
    ) AS t(name, value, date)
`);

WASM vs Native Performance

WASM runs at roughly 50-70% of native speed for most queries. It excels at lightweight analytics on files up to a few hundred MB. For multi-GB workloads, use the native CLI or Python bindings.

Best For

Interactive dashboards, data exploration, offline-capable apps, SQL playgrounds

Max Dataset Size

~2 GB practical limit in browsers (memory constraints)

Thread Support

Multi-threaded via SharedArrayBuffer (requires COOP/COEP headers)

Extensions

Subset of extensions available (parquet, json, httpfs via fetch API)

Observable / Jupyter Integration

// In Observable notebooks
db = DuckDBClient.of({
    flights: FileAttachment("flights.parquet")
});

// Then use SQL cells directly:
// SELECT carrier, count(*) FROM flights GROUP BY carrier

// In Jupyter with JupySQL
%load_ext sql
%sql duckdb:///:memory:
%%sql
SELECT * FROM 'https://example.com/data.parquet'
LIMIT 10;

09

Remote Data & httpfs

The httpfs extension enables DuckDB to read files directly from HTTP servers, S3-compatible storage, Google Cloud Storage, and Azure Blob Storage. No downloads required -- query remote data with SQL.

HTTP / HTTPS

-- Install and load httpfs
INSTALL httpfs;
LOAD httpfs;

-- Query a Parquet file over HTTP
SELECT count(*)
FROM 'https://example.com/data/sales.parquet';

-- Query a CSV from GitHub
SELECT *
FROM read_csv(
    'https://raw.githubusercontent.com/user/repo/main/data.csv'
)
LIMIT 10;

-- Remote Parquet with predicate pushdown
-- Only downloads the row groups that match
SELECT *
FROM 'https://example.com/huge.parquet'
WHERE year = 2024 AND region = 'US';

Amazon S3

-- Configure S3 credentials
SET s3_region = 'us-east-1';
SET s3_access_key_id = 'AKIAIOSFODNN7EXAMPLE';
SET s3_secret_access_key = 'wJalrXUtnFEMI/K7MDENG/bPxRfiCY';

-- Or use credential chain (IAM, env vars, ~/.aws/credentials)
SET s3_use_credential_chain = true;

-- Query S3 Parquet files
SELECT *
FROM 's3://my-bucket/data/sales/*.parquet';

-- Hive-partitioned data on S3
SELECT *
FROM read_parquet(
    's3://data-lake/events/**/*.parquet',
    hive_partitioning = true
)
WHERE year = 2024 AND month = 3;

-- Write results to S3
COPY (
    SELECT * FROM processed_data
) TO 's3://my-bucket/output/results.parquet';

Google Cloud Storage & Azure

-- Google Cloud Storage
SET s3_endpoint = 'storage.googleapis.com';
SET s3_access_key_id = 'your-key';
SET s3_secret_access_key = 'your-secret';

SELECT * FROM 's3://gcs-bucket/data.parquet';

-- Or use gs:// protocol directly
SELECT * FROM 'gs://my-bucket/data.parquet';

-- Azure Blob Storage
SET azure_storage_connection_string = 'DefaultEndpointsProtocol=...';

SELECT * FROM 'az://container/data.parquet';

Secrets Manager

DuckDB Secrets Manager provides a centralized way to manage credentials. Secrets persist across sessions when using a persistent database.

-- Create a secret for S3
CREATE SECRET my_s3_secret (
    TYPE S3,
    KEY_ID 'AKIAIOSFODNN7EXAMPLE',
    SECRET 'wJalrXUtnFEMI/K7MDENG/bPxRfiCY',
    REGION 'us-east-1'
);

-- Create a secret using credential chain
CREATE SECRET (
    TYPE S3,
    PROVIDER CREDENTIAL_CHAIN
);

-- List secrets (credentials are masked)
SELECT * FROM duckdb_secrets();

-- Drop a secret
DROP SECRET my_s3_secret;

10

DuckDB vs SQLite

DuckDB and SQLite share the same philosophy (embeddable, serverless, zero-config) but are optimized for fundamentally different workloads. SQLite excels at OLTP; DuckDB excels at OLAP.

Architecture Comparison

Aspect	DuckDB	SQLite
Storage Model	Columnar	Row-oriented
Optimized For	OLAP (analytical queries)	OLTP (transactional queries)
Best At	Aggregations, scans, joins on large data	Single-row lookups, inserts, updates
Concurrency	Multiple readers, single writer (MVCC)	Multiple readers, single writer (WAL mode)
Type System	Strict types, rich type system	Flexible affinity system
Parallel Execution	Yes, intra-query parallelism	No, single-threaded execution
Vectorized Engine	Yes (processes data in batches)	No (tuple-at-a-time)
File Format Support	Parquet, CSV, JSON, Excel native	Own format only (via extensions)
Window Functions	Full support + QUALIFY	Basic support
Nested Types	Lists, structs, maps, unions	JSON text (via extension)
Database Size	Tested to 100s of GB	Tested to TB scale
Binary Size	~40 MB	~1 MB

Performance Characteristics

Aggregate 1M rows

DuckDB

SQLite

Full table scan

DuckDB

SQLite

Point lookup (PK)

DuckDB

SQLite

Single INSERT

DuckDB

SQLite

Rule of thumb: If your query touches many rows and computes aggregates, DuckDB wins. If your query looks up or modifies individual rows, SQLite wins. Many projects use both -- DuckDB for analytics, SQLite for application state.

When to Choose DuckDB

Data Analysis

Ad-hoc queries on CSV/Parquet files without ETL pipelines

ETL Pipelines

Transform and aggregate data between systems

Data Science

SQL in Jupyter/Python as an alternative to pandas

Embedded Analytics

OLAP engine inside applications (dashboards, BI)

Log Analysis

Query JSON/CSV log files without infrastructure

Data Lake Queries

Query S3/GCS Parquet files directly via httpfs

Querying SQLite from DuckDB

-- You can use both together!
INSTALL sqlite_scanner;
LOAD sqlite_scanner;

-- Attach a SQLite database
ATTACH 'app.sqlite' AS sqlite_db (TYPE SQLITE);

-- Run analytical queries on SQLite data
SELECT
    region,
    count(*) AS orders,
    sum(total) AS revenue,
    avg(total) AS avg_order
FROM sqlite_db.orders
GROUP BY region
ORDER BY revenue DESC;

-- Join SQLite data with Parquet files
SELECT o.*, p.category
FROM sqlite_db.orders o
JOIN 'products.parquet' p
    ON o.product_id = p.id;