DuckDB

Writes in DuckDB-Iceberg

2025-11-28T00:00:00+00:00

Over the past several months, the DuckDB Labs team has been hard at work on the DuckDB-Iceberg extension, with full read support and initial write support released in v1.4.0. Today, we are happy to announce delete and update support for Iceberg v2 tables is available in v1.4.2!

The Iceberg open table format has become extremely popular in the past two years, with many databases announcing support for the open table format originally developed at Netflix. This past year the DuckDB team has made Iceberg integration a priority and today we are happy to announce another step in that direction. In this blog post we will describe the current feature set of DuckDB-Iceberg in DuckDB v1.4.2.

Getting Started

To experiment with the new DuckDB-Iceberg features, you will need to connect to your favorite Iceberg REST Catalog. There are many ways to connect to an Iceberg REST Catalog: please have a look at the Connecting to REST Catalogs for connecting to catalogs like Apache Polaris or Lakekeeper and the Connecting to S3Tables page if you would like to connect to Amazon S3 Tables.

ATTACH 'warehouse_name' AS iceberg_catalog (
    TYPE iceberg,
    other options
);

Inserts, Deletes and Updates

Support for creating tables and inserting to tables was already added in DuckDB v1.4.0: you can use standard DuckDB SQL syntax to insert data into your Iceberg table.

CREATE TABLE iceberg_catalog.default.simple_table (
    col1 INTEGER,
    col2 VARCHAR
);
INSERT INTO iceberg_catalog.default.simple_table
    VALUES (1, 'hello'), (2, 'world'), (3, 'duckdb is great');

You can also use any DuckDB table scan function to insert data into an Iceberg table:

INSERT INTO iceberg_catalog.default.more_data
    SELECT * FROM read_parquet('path/to/parquet');

Starting with v1.4.2, the standard SQL syntax also works for deletes and updates:

DELETE FROM iceberg_catalog.default.simple_table WHERE col1 = 2;
UPDATE iceberg_catalog.default.simple_table SET col1 = col1 + 5 WHERE col1 = 1;
SELECT * FROM iceberg_catalog.default.simple_table;

┌───────┬─────────────────┐
│ col1  │      col2       │
│ int32 │     varchar     │
├───────┼─────────────────┤
│     3 │ duckdb is great │
│     6 │ hello           │
└───────┴─────────────────┘

The Iceberg write support current has two limitations:

The update support is limited to tables that are not partitioned and not sorted. Attempting to perform update, insert or delete operations on partitioned or sorted tables using DuckDB-Iceberg will result in an error.

DuckDB-Iceberg only writes positional deletes for DELETE and UPDATE statements. Copy-on-write functionality is not yet supported.

Functions for Table Properties

Currently, DuckDB-Iceberg only supports merge-on-read semantics. Within Iceberg Table Metadata, table properties can be used to describe what form of deletes or updates are allowed. DuckDB-Iceberg will respect write.update.mode and write.delete.mode table properties for updates and deletes. If a table has these properties and they are not merge-on-read, DuckDB will throw an error and the UPDATE or DELETE will not be committed. Version v1.4.2 introduces three new functions to add, remove, and view table properties for an Iceberg table:

set_iceberg_table_properties
iceberg_table_properties
remove_iceberg_table_properties

You can use them as follows:

-- to set table properties
CALL set_iceberg_table_properties(iceberg_catalog.default.simple_table, {
    'write.update.mode': 'merge-on-read',
    'write.file.size': '100000kb'
});
-- to read table properties
SELECT * FROM iceberg_table_properties(iceberg_catalog.default.simple_table);

┌───────────────────┬───────────────┐
│        key        │     value     │
│      varchar      │    varchar    │
├───────────────────┼───────────────┤
│ write.update.mode │ merge-on-read │
│ write.file.size   │ 100000kb      │
└───────────────────┴───────────────┘

-- to remove table properties
CALL remove_iceberg_table_properties(
    iceberg_catalog.default.simple_table,
    ['some.other.property']
);

Iceberg Table Metadata

DuckDB-Iceberg also allows you to view the metadata of your Iceberg tables using the iceberg_metadata() and iceberg_snapshots() functions.

SELECT * FROM iceberg_metadata(iceberg_catalog.default.table_1);

┌──────────────────────┬──────────────────────┬──────────────────┬─────────┬──────────────────┬─────────────────────────────────────────────────────────────┬─────────────┬──────────────┐
│    manifest_path     │ manifest_sequence_…  │ manifest_content │ status  │     content      │                         file_path                           │ file_format │ record_count │
│       varchar        │        int64         │     varchar      │ varchar │     varchar      │                          varchar                            │   varchar   │    int64     │
├──────────────────────┼──────────────────────┼──────────────────┼─────────┼──────────────────┼─────────────────────────────────────────────────────────────┼─────────────┼──────────────┤
│ s3://warehouse/def…  │                    1 │ DATA             │ ADDED   │ EXISTING         │ s3:///simple_table/data/019a6ecc-9e9e-7…  │ parquet     │            3 │
│ s3://warehouse/def…  │                    2 │ DELETE           │ ADDED   │ POSITION_DELETES │ s3:///simple_table/data/d65b1db8-9fa8-4…  │ parquet     │            1 │
│ s3://warehouse/def…  │                    3 │ DELETE           │ ADDED   │ POSITION_DELETES │ s3:///simple_table/data/8d1b92dc-5f6e-4…  │ parquet     │            1 │
│ s3://warehouse/def…  │                    3 │ DATA             │ ADDED   │ EXISTING         │ s3:///simple_table/data/019a6ecf-5261-7…  │ parquet     │            1 │
└──────────────────────┴──────────────────────┴──────────────────┴─────────┴──────────────────┴─────────────────────────────────────────────────────────────┴─────────────┴──────────────┘

SELECT * FROM iceberg_snapshots(iceberg_catalog.default.simple_table);

┌─────────────────┬─────────────────────┬─────────────────────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│ sequence_number │     snapshot_id     │      timestamp_ms       │                                                manifest_list                                                 │
│     uint64      │       uint64        │        timestamp        │                                                   varchar                                                    │
├─────────────────┼─────────────────────┼─────────────────────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│               1 │ 1790528822676766947 │ 2025-11-10 17:24:55.075 │ s3:///simple_table/data/snap-1790528822676766947-f09658c4-ca52-4305-943f-6a8073529fef.avro │
│               2 │ 6333537230056014119 │ 2025-11-10 17:27:35.602 │ s3:///simple_table/data/snap-6333537230056014119-316d09bc-549d-46bc-ae13-a9fab5cbf09b.avro │
│               3 │ 7452040077415501383 │ 2025-11-10 17:27:52.169 │ s3:///simple_table/data/snap-7452040077415501383-93dee94e-9ec1-45fa-aec2-13ef434e50eb.avro │
└─────────────────┴─────────────────────┴─────────────────────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Time Travel

Time travel is also possible via snapshot ids or timestamps using the AT (VERSION => ...) or AT (TIMESTAMP => ...) syntax.

-- via snapshot id
SELECT *
FROM iceberg_catalog.default.simple_table AT (
	VERSION => snapshot_id
);

┌───────┬─────────────────┐
│ col1  │      col2       │
│ int32 │     varchar     │
├───────┼─────────────────┤
│     1 │ hello           │
│     3 │ duckdb is great │
└───────┴─────────────────┘

-- via timestamp
SELECT *
FROM iceberg_catalog.default.simple_table AT (
    TIMESTAMP => '2025-11-10 17:27:45.602'
);

┌───────┬─────────────────┐
│ col1  │      col2       │
│ int32 │     varchar     │
├───────┼─────────────────┤
│     1 │ hello           │
│     3 │ duckdb is great │
└───────┴─────────────────┘

Viewing Requests to the Iceberg REST Catalog

You may also be curious as to what requests DuckDB is making to the Iceberg REST Catalog. To do so, enable HTTP logging, run your workload, then select from the HTTP logs.

CALL enable_logging('HTTP');
SELECT * FROM iceberg_catalog.default.simple_table;
SELECT request.type, request.url, response.status
FROM duckdb_logs_parsed('HTTP');

┌─────────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────────────┐
│  type   │                                                                             url                          │       status       │
│ varchar │                                                                           varchar                        │      varchar       │
├─────────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────┤
│ GET     │ https:///iceberg/v1//iceberg-testing/namespaces/default                     │ NULL               │
│ HEAD    │ https:///iceberg/v1//iceberg-testing/namespaces/default/tables/simple_table │ NULL               │
│ GET     │ https:///iceberg/v1//iceberg-testing/namespaces/default/tables/simple_table │ NULL               │
│ GET     │ https:///data/snap-5943683398986255948-c2217dde-6036-4e07-88f2-…                       │ OK_200             │
│ GET     │ https:///data/f8c95b93-7b6b-4a24-8557-b98b553723d4-m0.avro                             │ OK_200             │
│ GET     │ https:///data/214a7988-da39-4dac-aa3a-4a73d3ead405-m0.avro                             │ OK_200             │
│ GET     │ https:///data/019a7244-c6e8-7bc9-9dd4-7249fcb04959.parquet                             │ PartialContent_206 │
│ GET     │ https:///data/019a7244-fcb5-7308-96ec-1c9e32509eab.parquet                             │ PartialContent_206 │
│ GET     │ https:///data/7f14bb06-f57a-42b4-ba7f-053a65152759-m0.avro                             │ OK_200             │
│ GET     │ https:///data/71f8b43d-51e7-40e7-be88-e8d869836ecd-deletes.parq…                       │ PartialContent_206 │
│ GET     │ https:///data/64f6c6e2-2f54-470e-b990-b201bc615042-m0.avro                             │ OK_200             │
│ GET     │ https:///data/4e54afed-6dd8-4ba0-88fb-16f972ac1d91-deletes.parq…                       │ PartialContent_206 │
├─────────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────┤
│ 12 rows                                                                                                                       3 columns │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Here we can see calls to the Iceberg REST Catalog, followed by calls to the storage endpoint. The first three calls to the Iceberg REST Catalog are to verify the schema still exists and to get the latest metadata.json of the DuckDB-Iceberg table. Next, it queries the manifest list, manifest files, and eventually the files with data and deletes. The data and delete files are stored locally in a cache to speed up subsequent reads.

Transactions

DuckDB is an ACID-compliant database that supports transactions. Work on DuckDB-Iceberg has been made with this in mind. Within a transaction, the following conditions will hold for Iceberg tables.

The first time a table is read in a transaction, its snapshot information is stored in the transaction and will remain consistent within that transaction.
Updates, inserts and deletes will only be committed to an Iceberg Table when the transaction is committed (i.e., COMMIT);

Point #1 is important for read performance. If you wish to do analytics on an Iceberg table and you do not need to get the latest version of the table every time, running your analytics in a transaction will prevent fetching the latest version for every query.

-- truncate the logs
CALL truncate_duckdb_logs();
CALL enable_logging('HTTP')
BEGIN;
-- first read gets latest snapshot information
SELECT * FROM iceberg_catalog.default.simple_table;
-- subsequent read reads from local cached data
SELECT * FROM iceberg_catalog.default.simple_table;
-- get logs
SELECT request.type, request.url, response.status
FROM duckdb_logs_parsed('HTTP');

┌─────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────┬────────────────────┐
│  type   │                                                  url                                                        │       status       │
│ varchar │                                                varchar                                                      │      varchar       │
├─────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────┼────────────────────┤
│ GET     │ https:///iceberg/v1//iceberg-testing/namespaces/default                        │ NULL               │
│ HEAD    │ https:///iceberg/v1//iceberg-testing/namespaces/default/tables/simple_table    │ NULL               │
│ GET     │ https:///iceberg/v1//iceberg-testing/namespaces/default/tables/simple_table    │ NULL               │
│ GET     │ https:///data/snap-5943683398986255948-c2217dde-6036-4e07-88f2-1…                         │ OK_200             │
│ GET     │ https:///data/f8c95b93-7b6b-4a24-8557-b98b553723d4-m0.avro                                │ OK_200             │
│ GET     │ https:///data/214a7988-da39-4dac-aa3a-4a73d3ead405-m0.avro                                │ OK_200             │
│ GET     │ https:///data/019a7244-c6e8-7bc9-9dd4-7249fcb04959.parquet                                │ PartialContent_206 │
│ GET     │ https:///data/019a7244-fcb5-7308-96ec-1c9e32509eab.parquet                                │ PartialContent_206 │
│ GET     │ https:///data/7f14bb06-f57a-42b4-ba7f-053a65152759-m0.avro                                │ OK_200             │
│ GET     │ https:///data/71f8b43d-51e7-40e7-be88-e8d869836ecd-deletes.parquet                        │ PartialContent_206 │
│ GET     │ https:///data/64f6c6e2-2f54-470e-b990-b201bc615042-m0.avro                                │ OK_200             │
│ GET     │ https:///data/4e54afed-6dd8-4ba0-88fb-16f972ac1d91-deletes.parquet                        │ PartialContent_206 │
├─────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────┴────────────────────┤
│ 12 rows                                                                                                                          3 columns │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

Here we see all the same requests we saw in the previous section. However, now we are in a transaction, which means the second time we read from iceberg_catalog.default.simple_table, we do not need to query the REST Catalog for table updates. This means DuckDB-Iceberg performs no extra requests when reading a table a second time, significantly improving performance.

Conclusion and Future Work

With these features, DuckDB-Iceberg now has a strong base support for the Iceberg tables, which enables users to unlock the analytical powers of DuckDB on their Iceberg tables. There is still more work to come and the Iceberg table specification has many more features the DuckDB team would like to support in DuckDB-Iceberg. If you feel any feature is a priority for your analytical workloads, please reach out to us in the DuckDB-Iceberg GitHub repository or get in touch with our engineers.

Below is a list of improvements planned for the near future (in no particular order):

Performance improvements
Updates / deletes / inserts to partitioned tables
Updates / deletes / inserts to sorted tables
Schema evolution
Support for Iceberg v3 tables, focusing on binary deletion vectors and row lineage tracking

Data-at-Rest Encryption in DuckDB

2025-11-19T00:00:00+00:00

If you would like to use encryption in DuckDB, we recommend using the latest stable version, v1.4.2. For more details, see the latest release blog post.

Many years ago, we read the excellent “Code Book” by Simon Singh. Did you know that Mary, Queen of Scots, used an encryption method harking back to Julius Caesar to encrypt her more saucy letters? But alas: the cipher was broken and the contents of the letters got her executed.

These days, strong encryption software and hardware is a commodity. Modern CPUs come with specialized cryptography instructions, and operating systems small and big contain mostly-robust cryptography software like OpenSSL.

Databases store arbitrary information, it is clear that many if not most datasets of any value should perhaps not be plainly available to everyone. Even if stored on tightly controlled hardware like a cloud virtual machine, there have been many cases of files being lost through various privilege escalations. Unsurprisingly, compliance frameworks like the common SOC 2 “highly recommend” encrypting data when stored on storage mediums like hard drives.

However, database systems and encryption have a somewhat problematic track record. Even PostgreSQL, the self-proclaimed “The World's Most Advanced Open Source Relational Database” has very limited options for data encryption. SQLite, the world’s “Most Widely Deployed and Used Database Engine” does not support data encryption out-of-the-box, its encryption extension is a $2000 add-on.

DuckDB has supported Parquet Modular Encryption for a while. This feature allows reading and writing Parquet files with encrypted columns. However, while Parquet files are great and reports of their impending death are greatly exaggerated, they cannot – for example – be updated in place, a pretty basic feature of a database management system.

Starting with DuckDB 1.4.0, DuckDB supports transparent data encryption of data-at-rest using industry-standard AES encryption.

DuckDB's encryption does not yet meet the official NIST requirements.

Some Basics of Encryption

There are many different ways to encrypt data, some more secure than others. In database systems and elsewhere, the standard is the Advanced Encryption Standard (AES), which is a block cipher algorithm standardized by US NIST. AES is a symmetric encryption algorithm, meaning that the same key is used for both encryption and decryption of data.

For this reason, most systems choose to only support randomized encryption, meaning that identical plaintexts will always yield different ciphertexts (if used correctly!). The most commonly used industry standard and recommended encryption algorithm is AES – Galois Counter Mode (AES-GCM). This is because on top of its ability to randomize encryption, it also authenticates data by calculating a tag to ensure data has not been tampered with.

DuckDB v1.4 supports encryption at rest using AES-GCM-256 and AES-CTR-256 (counter mode) ciphers. AES-CTR is a simpler and faster version of AES-GCM, but less secure, since it does not provide authentication by calculating a tag. The 256 refers to the size of the key in bits, meaning that DuckDB now only supports GCM with 32-byte keys.

GCM and CTR both require as input a (1) plaintext, (2) an initialization vector (IV) and (3) an encryption key. Plaintext is the text that a user wants to encrypt. An IV is a unique bytestream of usually 16 bytes, that ensures that identical plaintexts get encrypted into different ciphertexts. A number used once (nonce) is a bytestream of usually 12 bytes, that together with a 4-byte counter construct the IV. Note that the IV needs to be unique for every encrypted block, but it does not necessarily have to be random. Reuse of the same IV is problematic, since an attacker could XOR the two ciphertexts and extract both messages. The tag in AES-GCM is calculated after all blocks are encrypted, pretty much like a checksum, but it adds an integrity check that securely authenticates the entire ciphertext.

Implementation in DuckDB

Before diving deeper into how we actually implemented encryption in DuckDB, we’ll explain some things about the DuckDB file format.

DuckDB has one main database header which stores data that enables it to correctly load and verify a DuckDB database. At the start of each DuckDB main database header, the magic bytes (“DUCKDB”) are stored and read upon initialization to verify whether the file is a valid DuckDB database file. The magic bytes are followed by four 8-byte of flags that can be set for different purposes.

When a database is encrypted in DuckDB, the main database header remains plaintext at all times, since the main header contains no sensitive data about the contents of the database file. Upon initializing an encrypted database, DuckDB sets the first bit in the first flag to indicate that the database is encrypted. After setting this bit, additional metadata is stored that is necessary for encryption. This metadata entails the (1) database identifier, (2) 8 bytes of additional metadata for e.g. the encryption cipher used, and (3) the encrypted canary.

The database identifier is used as a “salt”, and consists of 16 randomly generated bytes created upon initialization of each database. The salt is often used to ensure uniqueness, i.e., it makes sure that identical input keys or passwords are transformed into different derived keys. The 8-bytes of metadata comprise the key derivation function (first byte), usage of additional authenticated data (second byte), the encryption cipher (third byte), and the key length (fifth byte). After the metadata, the main header uses the encrypted canary to check if the input key is correct.

Encryption Key Management

To encrypt data in DuckDB, you can use practically any plaintext or base64 encoded string, but we recommend using a secure 32-byte base64 key. The user itself is responsible for the key management and thus for using a secure key. Instead of directly using the plain key provided by the user, DuckDB always derives a more secure key by means of a key derivation function (kdf). The kdf is a function that reduces or extends the input key to a 32-byte secure key. If the correctness of the input key is checked by deriving the secure key and decrypting the canary, the derived key is managed in a secure encryption key cache. This cache manages encryption keys for the current DuckDB context and ensures that the derived encryption keys are never swapped to disk by locking its memory. To strengthen security even more, the original input keys are immediately wiped from memory when the input keys are transformed into secure derived keys.

DuckDB Block Structure

After the main database header, DuckDB stores two 4KB database headers that contain more information about e.g. the block (header) size and the storage version used. After keeping the main database header plaintext, all remaining headers and blocks are encrypted when encryption is used.

Blocks in DuckDB are by default 256KB, but their size is configurable. At the start of each plaintext block there is an 8-byte block header, which stores an 8-byte checksum. The checksum is a simple calculation that is often used in database systems to check for any corrupted data.

For encrypted blocks however, its block header consists of 40 bytes instead of 8 bytes for the checksum. The block header for encrypted blocks contains a 16-byte nonce/IV and, optionally, a 16-byte tag, depending on which encryption cipher is used. The nonce and tag are stored in plaintext, but the checksum is encrypted for better security. Note that the block header always needs to be 8-bytes aligned to calculate the checksum.

Write-Ahead-Log Encryption

The write ahead log (WAL) in database systems is a crash recovery mechanism to ensure durability. It is an append-only file that is used in scenarios where the database crashed or is abruptly closed, and when not all changes are written yet to the main database file. The WAL makes sure these changes can be replayed up to the last checkpoint; which is a consistent snapshot of the database at a certain point in time. This means, when a checkpoint is enforced, which happens in DuckDB by either (1) closing the database or (2) reaching a certain threshold for storage, the WAL gets written into the main database file.

In DuckDB, you can force the creation of a WAL by setting

PRAGMA disable_checkpoint_on_shutdown;
PRAGMA wal_autocheckpoint = '1TB';

This way you’ll disable a checkpointing on closing the database, meaning that the WAL does not get merged into the main database file. In addition, by setting wal_autocheckpoint to a high threshold, this will avoid intermediate checkpoints to happen and the WAL will persist. For example, we can create a persistent WAL file by first setting the above PRAGMAs, then attach an encrypted database, and then create a table where we insert 3 values.

ATTACH 'encrypted.db' AS enc (
    ENCRYPTION_KEY 'asdf',
    ENCRYPTION_CIPHER 'GCM'
);
CREATE TABLE enc.test (a INTEGER, b INTEGER);
INSERT INTO enc.test VALUES (11, 22), (13, 22), (12, 21)

If we now close the DuckDB process, we can see that there is a .wal file shown: encrypted.db.wal. But how is the WAL created internally?

Before writing new entries (inserts, updates, deletes) to the database, these entries are essentially logged and appended to the WAL. Only after logged entries are flushed to disk, a transaction is considered as committed. A plaintext WAL entry has the following structure:

Since the WAL is append-only, we encrypt a WAL entry per value. For AES-GCM this means that we append a nonce and a tag to each entry. The structure in which we do this is depicted in below. When we serialize an encrypted entry to the encrypted WAL, we first store the length in plaintext, because we need to know how many bytes we should decrypt. The length is followed by a nonce, which on its turn is followed by the encrypted checksum and the encrypted entry itself. After the entry, a 16-byte tag is stored for verification.

Encrypting the WAL is triggered by default when an encryption key is given for any (un)encrypted database.

Temporary File Encryption

Temporary files are used to store intermediate data that is often necessary for large, out-of-core operations such as sorting, large joins and window functions. This data could contain sensitive information and can, in case of a crash, remain on disk. To protect this leftover data, DuckDB automatically encrypts temporary files too.

The Structure of Temporary Files

There are three different types of temporary files in DuckDB: (1) temporary files that have the same layout as a regular 256KB block, (2) compressed temporary files and (3) temporary files that exceed the standard 256KB block size. The former two are suffixed with .tmp, while the latter is distinguished by a suffix with .block. To keep track of the size of .block temporary files, they are always prefixed with its length. As opposed to regular database blocks, temporary files do not contain a checksum to check for data corruption, since the calculation of a checksum is somewhat expensive.

Encrypting Temporary Files

Temporary files are encrypted (1) automatically when you attach an encrypted database or (2) when you use the setting SET temp_file_encryption = true. In the latter case, the main database file is plaintext, but the temporary files will be encrypted. For the encryption of temporary files DuckDB internally generates temporary keys. This means that when the database crashes, the temporary keys are also lost. Temporary files cannot be decrypted in this case and are then essentially garbage.

To force DuckDB to produce temporary files, you can use a simple trick by just setting the memory limit low. This will create temporary files once the memory limit is exceeded. For example, we can create a new encrypted database, load this database with TPC-H data (SF 1), and then set the memory limit to 1 GB. If we then perform a large join, we force DuckDB to spill intermediate data to disk. For example:

SET memory_limit = '1GB';
ATTACH 'tpch_encrypted.db' AS enc (
    ENCRYPTION_KEY 'asdf',
    ENCRYPTION_CIPHER 'cipher'
);
USE enc;
CALL dbgen(sf = 1);

ALTER TABLE lineitem
    RENAME TO lineitem1;
CREATE TABLE lineitem2 AS
    FROM lineitem1;
CREATE OR REPLACE TABLE ans AS
    SELECT l1.* , l2.*
    FROM lineitem1 l1
    JOIN lineitem2 l2 USING (l_orderkey , l_linenumber);

This sequence of commands will result in encrypted temporary files being written to disk. Once the query completes or when the DuckDB shell is exited, the temporary files are automatically cleaned up. In case of a crash however, it may happen that temporary files will be left on disk and need to be cleaned up manually.

How to Use Encryption in DuckDB

In DuckDB, you can (1) encrypt an existing database, (2) initialize a new, empty encrypted database or (3) reencrypt a database. For example, let's create a new database, load this database with TPC-H data of scale factor 1 and then encrypt this database.

INSTALL tpch;
LOAD tpch;
ATTACH 'encrypted.duckdb' AS encrypted (ENCRYPTION_KEY 'asdf');
ATTACH 'unencrypted.duckdb' AS unencrypted;
USE unencrypted;
CALL dbgen(sf = 1);
COPY FROM DATABASE unencrypted TO encrypted;

There is not a trivial way to prove that a database is encrypted, but correctly encrypted data should look like random noise and has a high entropy. So, to check whether a database is actually encrypted, we can use tools to calculate the entropy or visualize the binary, such as ent and binocle.

When we use ent after executing the above chunk of SQL, i.e., ent encrypted.duckdb, this will result in an entropy of 7.99999 bits per byte. If we do the same for the plaintext (unencrypted) database, this results in 7.65876 bits per byte. Note that the plaintext database also has a high entropy, but this is due to compression.

Let’s now visualize both the plaintext and encrypted data with binocle. For the visualization we created both a plaintext DuckDB database with scale factor of 0.001 of TPC-H data and an encrypted one:

Click here to see the entropy of a plaintext database

Announcing DuckDB 1.4.2 LTS

2025-11-12T00:00:00+00:00

In this blog post, we highlight a few important fixes and convenience improvements in DuckDB v1.4.2, the second patch release in DuckDB's 1.4 LTS line. To see the complete list of updates, please consult the release notes on GitHub.

While this is a patch release, we are shipping some small features. In LTS releases, these can come in two forms:

We add small opt-in features such as accessing the profiler's output from the logger in this release. These features have been highly-requested from the community and we are confident that these will not cause any issues for people upgrading to the latest release. In fact, using them carefully can help detect and understand performance regressions.
Some of DuckDB's extensions that are marked as “experimental” are shipping full-fledged features. For example, this is how we have rolled out support for Iceberg deletes and updates. Extensions are opt-in by nature, so if you stick to core DuckDB and its stable extensions, changes in the experimental extensions will not affect the stability of your installation.

To install the new version, please visit the installation page. Note that it can take a few hours to days for some client libraries (e.g., R, Rust) to be released due to the extra changes and review rounds required.

Features and Improvements

Iceberg Improvements

Similarly to the v1.4.1 release blog post, we can start with some good news for our Iceberg users: DuckDB v1.4.2 ships a number of improvements for the iceberg extension. Insert, update, and delete statements are all supported now:

Click to see the SQL code sample for Iceberg updates.

Relational Charades: Turning Movies into Tables

2025-10-27T00:00:00+00:00

“Your scientists were so preoccupied with whether they could,
they didn't stop to think if they should.”
Dr. Ian Malcolm, Jurassic Park (1993)

Here at team DuckDB, we love tables. Tables are a timeless elegant abstraction that precedes literature by about a thousand years. Relational tables specifically can represent any kind of information imaginable. But just because something can be done it is not always a great idea to do so. Can we build a rocket propelled by a nuclear chain reaction that irradiates the land it flies over? Yes. Should we? Probably not.

Disclaimer

Array-like data like images and videos are a textbook example of something that might not benefit from storing them in a database. While of course any binary data can be added to tables as BLOBs, there is not that much added value from it. Sure, it's harder to lose the image compared to the industry standard solution, storing a file name that points to the image. But there are not that many meaningful operations that can be done on BLOBs other than store and load. Without adding some overhyped AI tech, you can't even ask the database what the picture shows.

Array data also has its own world of highly specialized file formats and compression algorithms. Just think of the ubiquitous MPEG-4 standard to store movies. They are approximate (not exact, lossy) formats that are designed around human perception models, which is why they can avoid storing things people do not notice. They achieve impressive compression rates, with a two hour "Full" HD movie compressing to about 2 GB using MPEG-4.

Ignoring the Disclaimer

But what would it feel like to turn a movie into a table? (Very) deep down, a movie is just a series of fast-moving pictures ("frames"), typically at something around 25 frames per second. At that speed, our monkey-brain cannot distinguish the separate images any more and is fooled into thinking that we are watching smooth movement. Side note for the younger generations, a strip of pictures was the way we shipped around movies for more than 100 years.

So a series of pictures it is. Each picture can be further deconstructed into a two-dimensional array (a "matrix") of points, so-called "pixels". Every pixel in turn consists of three numbers, one each for the intensity of red, green and blue, or RGB for short. Note that we're ignoring the audio tracks in this post, but in principle it would work the exact same way, just with a different kind of intensity.

As an added complexity, the relational model (famously) does not require an absolute order of records. So all the various offsets have to be made explicit to not lose information. This of course greatly increases the size of our data set. We end up with a table that looks like this:

x	r	g	b
0	4	5	1
1	4	5	1
2	5	6	2
3	8	9	4
4	9	10	5
5	11	12	8
6	11	12	8
7	11	12	8
8	9	10	5
9	9	10	5

We have the time offset or frame number i, we have x and y for the pixel position in the frame, and r, g and b for the color components red, green and blue. Quite involved.

But now the movie is just a single table. If only just we had a conventional and guaranteed total order of rows, we could in theory skip all columns except for r, g and b, because with a known resolution all other columns can be inferred. This is coincidentally also how actual movie data files are stored, ignoring compression. This is another reason why maybe relational tables are not the best place for a movie to live in, but if all you have is a hammer. We could also have used some more modern features of SQL and use nested fields (a LIST in DuckDB), but let's keep it to a table even System R could have dealt with. In addition, having explicit offsets does not require nebulous conventions or additional metadata to know along with axis order the array data was serialized.

Experiments

To investigate this daft idea further (for Science!), we convert the 1963 classic "Charade", a "romantic screwball comedy mystery film" starring Audrey Hepburn and Cary Grant to a DuckDB table. This movie was picked because it is accidentally in the Public Domain because a screw-up in the wording of the copyright notice (no, really). Because of this, you can actually freely download this movie from the Internet Archive.

Since we're just creating a table, we will use DuckDB's native storage format. Here is the complete code snippet we used to convert the movie. In fact, this code should actually be generic enough to convert anything that ffmpeg can read to a table. Just in case you would want to try this at home on your own movies.

import imageio
import duckdb

# setup movie reading
vid = imageio.get_reader("Charade-1963.mp4", "ffmpeg")
dim_x = vid.get_meta_data()['size'][0]
dim_y = vid.get_meta_data()['size'][1]
rows_per_frame = dim_y * dim_x

# setup a DuckDB database and table
con = duckdb.connect()
con.execute("ATTACH 'charade.duckdb' AS m (STORAGE_VERSION 'latest'); USE m;")
con.execute("CREATE TABLE movie (i BIGINT, y USMALLINT, x USMALLINT, r UTINYINT, g UTINYINT, b UTINYINT)")

# those offsets don't change between frames, so pre-compute them
con.execute("CREATE TEMPORARY TABLE y AS SELECT unnest(list_sort(repeat(range(?), ?))) y", [dim_y, dim_x])
con.execute("CREATE TEMPORARY TABLE x AS SELECT unnest(repeat(range(?), ?)) x", [dim_x, dim_y])

# loop over each frame in the movie and insert the pixel data
for i_idx, im in enumerate(vid):
    v = im.flatten()
    r = v[0:len(v):3]
    g = v[1:len(v):3]
    b = v[2:len(v):3]

    con.execute('''INSERT INTO movie 
        FROM repeat(?, ?) i -- frame offset 
        POSITIONAL JOIN   y -- temp table
        POSITIONAL JOIN   x -- temp table
        POSITIONAL JOIN   r -- numpy scan
        POSITIONAL JOIN   g -- numpy scan
        POSITIONAL JOIN   b -- numpy scan
        ''', [i_idx, rows_per_frame])

This script makes use of not just one, but (at least) two cool DuckDB features. First, we use so-called replacement scans to directly query the NumPy arrays r, g, and b. Note that those have not been created as tables in DuckDB nor registered in any way, but they are referenced by name in the INSERT. What happens here is that DuckDB inspects the Python context for the missing "tables" and finds objects with those names that it can read. The other cool feature is the POSITIONAL JOIN, which lets us stack multiple tables horizontally by position without running an actual (expensive) JOIN. This way, we assemble all the columns we need for a single frame in a bulk INSERT, which executes quite efficiently.

The movie file we have has a frame rate of 25 frames per second at a (DVD-ish) resolution of 720x392 pixels. The total runtime is 01:53:02.56 seconds, which comes down to 169 563 individual frames. Because we have a row for each pixel we end up with 169 563 * 720 x 392 rows, or 47 857 461 120. 47 billion rows! Finally Big Data! When stored as a DuckDB database however, the file size is "only" around 200 GB. Totally doable on a laptop!

DuckDB's lightweight compression performs quite well here, given that in a naive binary format we would have to store at least 15 bytes per row. If we multiply that with the row count (47 billion, remember) we would end up at around 700 GB in storage for this hypothetical naive format.

Of course, by turning the data into a relational table we add a bunch of previously implicit information due to the lack of ordering in relations. If we just stored the raw pixel bytes, for example as an implicitly ordered series of BMP (bitmap) files, we would end up with the same amount of bytes as the rows above times three, or 133 GB. Even including materializing all the offsets, the DuckDB file still manages to end up at a comparable size (200 GB). And of course, comparing the size of the table with the MPEG-4 version of the movie is not entirely fair because MPEG-4 is a lossy compression format. Databases can't just randomly decide to compromise on the numerical accuracy of the tables they store!

To prove that the transformation is accurate, let's try to turn the table data for one random frame back into a human-consumable picture: we will retrieve the corresponding rows from DuckDB, and use some Python magic to turn them back into a PNG image file:

import duckdb
import numpy as np
import PIL.Image

frame = 48000

con = duckdb.connect('charade.duckdb', read_only=True)
dim_y, dim_x = con.execute("SELECT max(y) + 1 dim_y, max(x) + 1 dim_x FROM movie WHERE i=0").fetchone()

res = con.execute("SELECT r, g, b FROM movie WHERE i = ? ORDER BY y, x", [frame]).fetchnumpy()
v = np.zeros(dim_y * dim_x * 3, dtype=np.uint8)
v[0:len(v):3] = res['r']
v[1:len(v):3] = res['g']
v[2:len(v):3] = res['b']

img = PIL.Image.fromarray(v.reshape((dim_y, dim_x, 3)))
img.save(f'frame.png')

And voila, we can see a wonderful frame with Audrey and Cary appear. This trick can also be used to create a sequence of pictures and write them to a MPEG-4 file again using – for example – the moviepy library.

But now that we have a table, we can have some fun with it. Let's do some basic exploration first: we start with DESCRIBE, which basically tells us the schema. We knew this of course.

DESCRIBE movie;

column_name	column_type	null	key	default	extra
i	BIGINT	YES	NULL	NULL	NULL
y	USMALLINT	YES	NULL	NULL	NULL
x	USMALLINT	YES	NULL	NULL	NULL
r	UTINYINT	YES	NULL	NULL	NULL
g	UTINYINT	YES	NULL	NULL	NULL
b	UTINYINT	YES	NULL	NULL	NULL

No surprises there. How many rows are there?

FROM movie SELECT count(*);

count_star()
47857461120

Ah yes, 47 billion. What are the numeric properties of the columns? DuckDB has this neat SUMMARIZE statement that computes single-pass summary statistics on a table (or arbitrary query).

SUMMARIZE movie;

This one is admittedly a bit of a flex. DuckDB can compute elaborate summary statistics on all the 47 billion rows in ca. 20 minutes on a MacBook. Here are the results:

column_name	column_type	max	approx_unique	avg	std	q25	q50	q75	count
i	BIGINT	169562	150076	84781.0	48948.621846957954	42429	84751	127137	47857461120
y	USMALLINT	391	430	195.5	113.16028455346597	98	196	294	47857461120
x	USMALLINT	719	840	359.5	207.84589644146592	180	359	540	47857461120
r	UTINYINT	255	252	65.32575855816732	44.85627602555231	27	54	96	47857461120
g	UTINYINT	249	249	56.79713844669577	37.03562456032193	28	44	77	47857461120
b	UTINYINT	255	252	43.249715985643995	38.39218963268899	16	28	61	47857461120

Since we're basically storing a lot of colors, just how many different combinations of red, green and blue are there, DuckDB?

FROM (FROM movie SELECT DISTINCT r, g, b)
SELECT count(*);

Any seasoned data engineer would rightfully caution you to run a DISTINCT on this many rows. There have just been too many production outages caused by overflowing aggregations. But thanks to DuckDB's larger-than-memory aggregate hash table, we can confidently issue this query. We even get a nice progress bar and (since 1.4.0) a surprisingly accurate estimate of how long the query will take.

count_star()
826568

So roughly 800 thousand different colors. Computing this took about 2 minutes in the end. But what are the frequencies of those colors? Let's compute a histogram of the 10 most-used colors!

FROM movie
SELECT r, g, b, count(*) AS ct
GROUP BY ALL
ORDER BY ct DESC
LIMIT 10;

r	g	b	ct
17	20	15	106521429
23	25	15	93004303
23	25	13	85552738
13	22	15	81734796
22	24	13	76560295
24	26	15	75376896
15	19	8	74285763
23	24	19	72904497
22	24	12	69269099
24	26	16	62230136

The most common colors here seems to be dark shades of grey. Makes sense! Keep in mind that the MPEG-4 compression is lossy and will probably produce some odd colors as rounding artifacts.

But we can also have some more fun. We have an analytical database system. How about we compute the average frame for every thousand frames and stitch the results back into a movie? It's just a big aggregation. We first create the actual averages:

CREATE TABLE averages AS
    FROM movie
    SELECT
        i // 1000 AS idx,
        y,
        x,
        avg(r)::UTINYINT AS r,
        avg(g)::UTINYINT AS g,
        avg(b)::UTINYINT AS b
GROUP BY ALL
ORDER BY idx, y, x;

Then, we use Python again to turn this averages table into a movie:

# some setup omitted

# fetch a bunch of frames in bulk
res = con.execute("SELECT r, g, b FROM averages ORDER BY i, y, x").fetchnumpy()

# split the rgb arrays by frame again
r_splits = np.split(res['r'], num_frames)
g_splits = np.split(res['g'], num_frames)
b_splits = np.split(res['b'], num_frames)

# generate pictures
image_files = []
for i in range(num_frames):
    v = np.zeros(dim_y * dim_x * 3, dtype=np.uint8)
    v[0:len(v):3] = r_splits[i]
    v[1:len(v):3] = g_splits[i]
    v[2:len(v):3] = b_splits[i]
    image_files.append(v.reshape((dim_y, dim_x, 3), order='C'))

# write movie file
clip = moviepy.video.io.ImageSequenceClip.ImageSequenceClip(image_files, fps=25)
clip.write_videofile('averages.mp4')

There is some wrangling here because we want to retrieve the whole frame dataset in bulk and not run a query for every single one. We then use NumPy to split them into frames and stitch the RGB-channels together into the three-dimensional array that the image libraries like. This does not achieve any business purpose but the results are kind of funny, here is average frame #68, with apologies to the actors:

We can also stitch all the averages together to make a somewhat twitchy average movie:

Click here to see the twitchy movie generated from “Charade”:

Frozen DuckLakes for Multi-User, Serverless Data Access

2025-10-24T00:00:00+00:00

We show how you can build high-performance data lakes with no moving parts.

Uncovering Financial Crime with DuckDB and Graph Queries

2025-10-22T00:00:00+00:00

Following the money is harder than it looks. Sophisticated criminals hide their tracks using long, complex chains of transactions, hoping to obscure the origin of illicit funds. Unraveling these networks is a classic graph problem: you're looking for suspicious patterns and hidden paths in a vast web of accounts and transactions.

For years, this kind of analysis often meant exporting data to a specialized graph database, adding complexity and overhead. But what if you could perform this powerful graph analysis directly within your daily driver database?

This is where DuckDB's extensibility shines. In this blog post, we'll dive into a financial dataset and use DuckDB with a graph query extension to identify the kinds of patterns that could indicate a money laundering scheme or otherwise high-risk accounts.

From Relational Tables to a Property Graph

Before we can hunt for suspicious activity, we need to understand our data. We're using the LDBC Financial Benchmark dataset, which simulates a financial network. To attach to the database with the dataset, run:

ATTACH 'https://blobs.duckdb.org/data/finbench.duckdb' AS finbench;
USE finbench;

To follow along with the examples in this post, it is recommended to use DuckDB v1.4.1.

In this blog post we will use a subset of the dataset with tables for Person, Account, and the AccountTransferAccount table that links them.

Let's start by getting a feel for the scale of our network:

SELECT
    (SELECT count(*) FROM Person) AS num_persons,
    (SELECT count(*) FROM Account) AS num_accounts,
    (SELECT count(*) FROM AccountTransferAccount) AS num_transfers;

┌─────────────┬──────────────┬───────────────┐
│ num_persons │ num_accounts │ num_transfers │
│    int64    │    int64     │     int64     │
├─────────────┼──────────────┼───────────────┤
│     785     │     2055     │     8132      │
└─────────────┴──────────────┴───────────────┘

This query gives us a quick overview of the number of entities and connections we're dealing with. As the schema diagram above illustrates, these tables of accounts and transfers already form a graph, a structure made of nodes or vertexes (the entities), connected by edges (representing relations – hint hint – between the entities).

To make our queries more powerful, we'll use the Property Graph model. This is just a formal way of saying we can add descriptive labels, which are general types like Account and Person, as well as specific properties, like accountId and nickname.

If you're thinking this sounds a lot like the relational model, you're exactly right. A Person table is just a collection of nodes with the label Person, and its columns are the properties. This natural mapping is what makes a high-performance relational database like DuckDB a perfect foundation for graph analytics.

Property Graphs in DuckDB

To write our graph queries, we could use DuckDB and the SQL we are familiar with. But let us make life a little simpler for ourselves and leverage DuckDB's rich extension ecosystem. We will be using DuckPGQ, a community extension that adds support to DuckDB's parser for a new visual graph syntax. This new syntax is SQL / Property Graph Queries (SQL/PGQ), which is part of the official SQL:2023 standard. SQL/PGQ is partially inspired by the popular graph query language Cypher.

The DuckPGQ extension started out as a research prototype and is now available as a community extension.

Installing and loading the extension is as simple as it gets:

INSTALL duckpgq FROM community;
LOAD duckpgq;

In DuckPGQ, the first step is to create the property graph, acting as a layer on top of the tables we have created earlier.

CREATE PROPERTY GRAPH finbench
VERTEX TABLES (
    Person,
    Account
)
EDGE TABLES (
    AccountTransferAccount
        SOURCE KEY (fromId) REFERENCES Account (accountId)
        DESTINATION KEY (toId) REFERENCES Account (accountId)
        LABEL Transfer,
    PersonOwnAccount
        SOURCE KEY (personId) REFERENCES Person (personId)
        DESTINATION KEY (accountId) REFERENCES Account (accountId)
        LABEL PersonOwn
);

During the creation of the property graph, we make a clear distinction between VERTEX tables and EDGE tables. For VERTEX tables, we only have to specify the name of the table. For EDGE tables, a little more work is required since for both the SOURCE and the DESTINATION, we need to specify the column in the edge table that forms the key for the SOURCE or DESTINATION. This is the same principle as defining a FOREIGN KEY constraint, linking our edge table back to the node tables it connects. The LABEL clause gives a clean name to the relationship type. While our table is named AccountTransferAccount, the edges within it represent a Transfer relationship. This is the name we'll use in our graph queries.

Now that we have created our property graph, we are ready to investigate the financial data and uncover its secrets!

Graph Processing

When talking about graph processing in databases, we typically refer to these types of operations:

Pattern matching, finding a pattern in our data.
Path-finding, finding a path in our data, potentially of variable length.

Let's see how we can leverage DuckDB and DuckPGQ for these two tasks.

Hunting for Suspicious Activities

As previously mentioned, SQL/PGQ introduces a visual graph syntax to formulate graph patterns more naturally. Now, let's use it to hunt for patterns that might indicate a money laundering scheme.

A common technique used to hide illicit funds is called smurfing. The goal of smurfing is to break down a single large transfer, potentially triggering reporting requirements, into smaller transactions over time.

We can search for this behavior by looking for pairs of accounts with a high number of transactions but a relatively low average amount. Let's set the threshold for the average amount at $50,000 and see if we can find any high-frequency relationships:

SELECT
    fromName,
    count(amount) AS number_of_transactions,
    round(avg(amount), 2) AS avg_amount,
    toName
FROM GRAPH_TABLE (finbench
    MATCH (a:Account)-[t:Transfer]->(a2:Account)
    COLUMNS (a.nickname AS fromName,
             t.amount,
             a2.nickname AS toName
            )
)
GROUP BY ALL
HAVING avg_amount < 50_000
ORDER BY number_of_transactions DESC, avg_amount ASC
LIMIT 5;

Running the query leads us to the following result:

┌───────────────────┬────────────────────────┬────────────┬───────────────────┐
│     fromName      │ number_of_transactions │ avg_amount │      toName       │
│      varchar      │         int64          │   double   │      varchar      │
├───────────────────┼────────────────────────┼────────────┼───────────────────┤
│ Noe Trites        │                      1 │   49365.04 │ Dale Croucher     │
│ Madeleine Bussing │                      1 │   46663.56 │ Delphine Primiano │
│ Bonnie Centeno    │                      1 │   46663.56 │ Maile Boon        │
│ Darci Sheedy      │                      1 │   44856.02 │ Carmella Estelle  │
│ Marguerita Gurne  │                      1 │   44393.68 │ Delphine Primiano │
└───────────────────┴────────────────────────┴────────────┴───────────────────┘

The query worked, but the result does not show any signs of suspicious activity with the number of transactions always being 1. To understand why, let's break down how the query was constructed.

The magic happens inside the FROM clause. The GRAPH_TABLE (finbench ...) function allows us to run a graph query over the property graph we have just created and treat its output like a regular table.

The MATCH (a:Account)-[t:Transfer]->(a2:Account) clause is the core of our pattern. It visually describes what we're looking for: a simple transfer from one account (a:Account) to another (a2:Account). The () denote nodes and the [] denotes the connecting edge with the ASCII-style arrow -> showing the direction of the edge. The COLUMNS(...) clause then acts like a SELECT list for our pattern, pulling out the nickname from the accounts and the amount from the transfer.

The beauty of SQL/PGQ is that the result of this graph pattern match can be seamlessly returned into the standard SQL we already know. We use GROUP BY ALL to aggregate all transfers between the same two people, and our HAVING avg_amount < 50_000 clause filters for the smurfing pattern we defined.

We know our query is correct but also that this simple smurfing pattern is not present in our dataset. This means we must investigate further, using potentially more complex patterns. This leads us to a more powerful feature of graph queries: finding structural patterns that are very difficult to express with traditional SQL JOINs, such as transaction paths.

Finding Paths in the Transactions

Another classical example of possible fraudulent behavior is a cycle of transactions where the money circles back to the person who sent the first transaction in the chain. Writing a SQL query to answer this question is, using the traditional syntax, incredibly difficult. Try it yourself, after first reading this section! We will show the answer in the next section.

With SQL/PGQ writing queries that involve finding paths has become significantly easier. The diagram above illustrates the query pattern we'll use with the goal of finding a path of one or more transfers between two different accounts (A1 and A2) owned by the same person (P). Remember that persons can own multiple account. With the following query we will try to find cycles between all the accounts that the person with id 125 owns:

FROM GRAPH_TABLE(finbench
    MATCH p = ANY SHORTEST
                  (p:Person)-[o1:PersonOwn]->(a1:Account)
                  -[t:Transfer]->+
                  (a2:Account)<-[o2:PersonOwn]-(p:Person)
WHERE
    p.personId = 125 AND a1.accountId <> a2.accountId
    COLUMNS (
        path_length(p) AS path_length,
        a1.accountId AS start_account,
        a2.accountId AS end_account
    )
)
ORDER BY path_length;

The result shows that there are cycles for this person owning multiple accounts of varying lengths:

┌─────────────┬─────────────────────┬─────────────────────┐
│ path_length │    start_account    │     end_account     │
│    int64    │        int64        │        int64        │
├─────────────┼─────────────────────┼─────────────────────┤
│           8 │ 4753267931712848113 │ 4794926228266025204 │
│           8 │ 4769874955338776819 │ 4794926228266025204 │
│           8 │ 4796615078126289138 │ 4769874955338776819 │
│           9 │ 4753267931712848113 │ 4769874955338776819 │
│           9 │ 4769874955338776819 │ 4753267931712848113 │
│           9 │ 4794926228266025204 │ 4753267931712848113 │
│           9 │ 4796615078126289138 │ 4753267931712848113 │
│           9 │ 4796615078126289138 │ 4794926228266025204 │
│          12 │ 4794926228266025204 │ 4769874955338776819 │
└─────────────┴─────────────────────┴─────────────────────┘

Once again, the magic happens inside the FROM clause where we now create a MATCH that finds ANY SHORTEST path along the given pattern. The first part of the pattern finds all the accounts owned by the Person 125: (p:Person)-[o1:PersonOwn]->(a1:Account). In the second part the path-finding occurs. Take special note of the +, which indicates that for the pattern (a1:Account)-[t:Transfer]->+(a2:Account) the two nodes (a1) and (a2) do not necessarily need to be connected. With the +, we indicate there must be one or more Transfers between these two accounts, with no upper bound set. Finally, to tie it all up, we take the destination account and check whether it is owned by the same person p.

The results confirm our suspicions. Our query found multiple non-obvious paths between the accounts owned by Person 125, with path lengths ranging from 8 to 12 transfers.

Each row represents a hidden chain of transactions connecting two of the person's accounts. More interestingly, we can see clear cyclical patterns. For instance, the query found a 9-step path from account 4753267931712848113 to 4769874955338776819, and another 9-step path flowing in the opposite direction. This suggests a sophisticated and intentional effort to move money between accounts, a strong indicator that warrants further investigation.

Doing it the Old-Fashioned Way

Earlier, we challenged you to think about how you would find these ownership cycles using traditional SQL. As promised, here is the answer.

Before we dive into the query, there are two important notes to keep in mind when comparing it to the SQL/PGQ version:

Performance Safeguard: The query requires a manual upper bound on the path length (ps.depth < 11) to prevent infinite recursion and potentially quadratic runtimes on dense graphs. The SQL/PGQ ->+ syntax does not require this.
Path Length Difference: You'll notice the path_length in this query's result is two hops shorter than the result from DuckPGQ. This is because this query only counts the Transfer edges, whereas the DuckPGQ query also includes the two PersonOwn edges in its path calculation.

With that in mind, here is the traditional recursive CTE to find the shortest path between any two accounts owned by the same person:

WITH RECURSIVE
    owned_accounts AS (
        SELECT accountId
        FROM PersonOwnAccount
        WHERE personId = 125
    ),
    path_search(start_node, end_node, path, depth) AS (
        -- Base case: a direct transfer from one of the person's accounts
        SELECT
            fromId,
            toId,
            [fromId, toId],
            1
        FROM
            accounttransferaccount
        WHERE
            fromId IN (SELECT accountId FROM owned_accounts)
        UNION ALL
        -- Recursive step: find the next transfer in the path
        SELECT
            ps.start_node,
            t.toId,
            list_append(ps.path, t.toId),
            ps.depth + 1
        FROM path_search ps
        JOIN accounttransferaccount t ON ps.end_node = t.fromId
        WHERE
            t.toId NOT IN (SELECT unnest(ps.path)) AND ps.depth < 11
    )
SELECT distinct start_node, end_node, min(depth) AS path_length
FROM path_search
WHERE end_node IN (SELECT accountId FROM owned_accounts)
  AND start_node <> end_node
GROUP BY ALL
ORDER BY path_length;

As you can see, the logic requires a WITH RECURSIVE clause, manual path tracking in a list, and explicit cycle detection. This is exactly the kind of verbose and complex query that the visual syntax of SQL/PGQ is designed to eliminate.

Wrapping Up

We began this post with a simple goal: to see if we could use DuckDB to hunt for the complex patterns and hidden paths typical of graph analysis. After diving into the Financial Benchmark dataset, the conclusion is clear: you can.

The key takeaway is the drastic improvement in usability. We saw how the visual syntax of SQL/PGQ, enabled by the DuckPGQ extension, transformed a sophisticated "ownership cycle" query from a monstrous recursive CTE into a few readable lines of code. This is exactly the kind of expressive power needed for real-world analytical tasks. For more information and complete documentation on DuckPGQ, be sure to visit its official website: duckpgq.org.

Just as importantly, this entire investigation was performed directly within DuckDB. By leveraging the extension ecosystem, we tapped into the power of graph queries without ever needing to export our data or manage a separate, specialized system. Everything runs on top of DuckDB's high-performance vectorized engine, right where the data lives.

For powerful, analytical graph queries, DuckDB isn't just a viable alternative, it's a powerful, natural solution. The next time you think about analyzing connections in your data, remember that the tools you need are just an INSTALL away.

Streaming Patterns with DuckDB

2025-10-13T00:00:00+00:00

The words “DuckDB” and “streaming” don't usually make it into the same sentence. Maybe this is because DuckDB has been positioned as an all powerful (but very lightweight) OLAP database. Or maybe this is because the ecosystem of streaming analytics has centered around names such as Kafka, Flink and Spark Streaming, and most recently players trying to change the game like Materialize or RisingWave. But can DuckDB be used in the context of streaming analytics? What is streaming analytics in the first place?

Streaming Analytics Patterns

The simplest definition: streaming analytics is the act of updating an analytical view of your data at near real-time speed as new data comes in. For example, if three new sessions have just started in your website, the process of gathering those session events and updating the count (+3) is streaming analytics. Streaming analytics is not, in my modest opinion, just about inserting those 3 session events in a table – that would fit well within the realm of a transactional workload. Streaming analytics is also not pushing this events to a Kafka topic and sinking to another system. If you don't update the analytical view of your data, I wouldn't call it streaming analytics.

Now that we have a definition, let's take a look at three common architectural patterns in streaming analytics. The names given to these patterns are of my own making, but I think they help differentiating them from one another.

In the Materialized View Pattern, it is very common to use a cloud data warehouse with support for materialized views (such as BigQuery or Snowflake). The stream of events is usually sunk to a raw table and a materialized view is created on top. This pattern is generally conceived as having a higher latency than the next two. However, there is not that much benchmarking around to conclude anything.
The Streaming Engine Pattern uses the more traditional ETL approach. A separate process using a streaming engine consumes the messages from the source, queries are then done on the fly and results are stored in a persisted table. Common engines are Spark Streaming, Flink, Kafka Streams or most recently Arroyo. This has traditionally come with a set of complications (e.g., dealing with watermarks, state management, increased memory load for infinite queries, etc.).
The Streaming Database Pattern is similar to the previous one in terms of latency but drastically simplifies the experience. Streaming databases like RisingWave or Materialize can directly read from the streaming source and update your materialized view on the fly. They aim at keeping ACID consistency and allowing clients to query data using the PostgreSQL wire protocol.

“Where does DuckDB fit in all this?” – you may ask. Well, DuckDB fits well with patterns one and two. Even if DuckDB does not support materialized views (yet), we can work around this limitation and implement these patterns to still get very good results.

Interestingly, the streaming engine industry doesn't have many official benchmarks. The Nexmark benchmark seems to be the most common, but there are not many published results comparing engines using this benchmark.

Materialized View Pattern: Cooking Our Own Materialized View with DuckDB

We know that DuckDB is very fast at aggregating data on the fly and also performs very well in transactional workloads (for an OLAP system). So does DuckLake's lakehouse format, thanks to its data inlining feature. In this section we are going to see both DuckDB and DuckLake in action, acting as a sink for Kafka and calculating new metric values based on deltas.

All the patterns are going to do the same thing in different ways. That is, read events from a Kafka topic and update the analytical view, which can be a persisted table or a view on top of a raw table. What happens in between is what differentiates these patterns.

Querying Deltas with DuckDB

The key component in this diagram is what I call “Delta Processor”. This component is basically a function that loops periodically and runs a query to aggregate new data inserted in the raw_events table and to update the analytical view, in this case a persisted table called user_clicks. This is the query that runs periodically to update user_clicks with the new delta:

MERGE INTO user_clicks AS dest
USING (
    SELECT 
        user_id,
        user_name,
        count(*) AS count_of_clicks,
        max(timestamp) AS updated_at
    FROM raw_events
    WHERE event_type = 'CLICK'
      AND (LATEST_UPDATED_AT IS NULL
       OR timestamp > LATEST_UPDATED_AT)
    GROUP BY user_id, user_name
) AS src
ON dest.user_id = src.user_id
WHEN MATCHED THEN 
    UPDATE SET 
        count_of_clicks = dest.count_of_clicks + src.count_of_clicks,
        updated_at = src.updated_at
WHEN NOT MATCHED THEN
    INSERT (user_id, user_name, count_of_clicks, updated_at)
    VALUES (src.user_id, src.user_name, src.count_of_clicks, src.updated_at);

You can check the full pipeline in this repository.

Using DuckLake's Change Data Feed

This pattern is very similar to 1.1 but with some specifics to DuckLake:

We are using DuckLake's Data Inlining to speed up insertion without writing too many small files.
The Delta Processor component can take advantage of DuckLake's Data Change Feed to avoid scanning unnecessary data.
We have an extra component, called “Inline Flusher”, that periodically flushes inlined data from the metadata catalog to parquet files of the specified file size (512 MB by default). This is a maintenance operation that will keep DuckLake performant.

You can check the full pipeline in this repository.

In order to make better use of filter pushdowns and file pruning, partitioning the data by timestamp is recommended

Streaming Engine Pattern: Streaming Engines and DuckDB

Most established streaming engines (Spark Streaming, Flink, Kafka Streams) are JVM based. They can therefore insert data in DuckDB using the JDBC protocol. This pattern is usually a bit difficult to manage. Long running streaming queries tend to consume a lot of memory and restarting interrupted streaming queries always makes me skip a beat. However, it can be a very low latency solution for very large streams of data.

Using Spark Streaming and Sink to DuckDB

In this diagram we can see that most of the components are managed by the Spark Streaming runtime. In Spark Streaming, all of this is contained in a streaming query. When the micro batching mode is being used (like it is the case in this example) you can pass a custom function to the writer that allows you to write each batch in the way you desire. In our case, we just use a JDBC connection and overwrite the destination table (user_clicks).

We can also see that there are no intermediate results being saved, meaning in this particular case we do not have a raw_events table. This is not a pattern that I love since for auditing purposes I would prefer to store the raw data to ensure that my streaming job isn't doing something funky. In this case, Spark Streaming relies on checkpoints to keep the state and make sure that data is processed just once and queries are able to restart without missing or duplicating data consumed from the Kafka topic.

You can check the full pipeline in this repository.

Bonus: Using DuckDB Tributary Extension to Directory Query Kafka

This setup is the most similar thing to the Streaming Databases Pattern that you can do right now with DuckDB. Powered by the tributary DuckDB community extension, you can create a view or a table that reads directly from a Kafka topic. To simulate materialized views, we are using views for this specific example. The following query showcases how simple this process is:

CREATE VIEW IF NOT EXISTS raw_events_view AS
    SELECT
        * EXCLUDE message, 
        decode(message)::JSON AS message 
    FROM
        tributary_scan_topic(TOPIC, "bootstrap.servers":="localhost:9092");

Currently this extension has no state management. Every time this view is queried, we would be reading the whole topic from offset 0. This is not ideal since Kafka has a limited retention policy, which means that at some point it will start flushing messages. A way around this is to materialize this messages to tables and use the offset (or a timestamp) to keep track of what has been ingested.

You can check the full pipeline in this repository.

This is an experimental extension from Query.Farm.

Some Thoughts

Conclusions always feel very subjective, so I rather write about some of my thoughts regarding streaming patterns in general and particularly around DuckDB.

The Materialized View Pattern is usually good enough. My hot take is that most use cases for analytics are usually covered by the Materialized View Pattern without the need of complexity that comes with other patterns. I believe that DuckDB is very well suited for this pattern because for a small OLAP, it does incredibly well at handling large amounts of streaming inserts. In this article DuckDB was pushed to the limit and was able to handle more than one million rows inserted per second. Also note that materialized views are on the DuckDB's long-term roadmap, so this pattern will become even simpler in the near future.

If you are streaming to a lakehouse, you should know that DuckLake's Data Inlining feature was specifically designed to deal with high-throughput inserts while solving the small file problem. This makes DuckLake a great candidate for this pattern if you have a lakehouse-like architecture.

Streaming Engines and Streaming Databases can be hard (or expensive). At scale, Streaming Engines can be hard to manage. It is an evolving field and some work is being done to make forever running streaming queries easier to operate. For example, Apache Fluss is being built with the idea to solve some of the shortcomings described in this post. However, it does add another layer of complexity to an already complex streaming architecture.

Streaming databases are a very elegant solution and have the potential to be very nice to use. However, if you are looking to host the solution, this will require some expertise since these systems are considerably complex (see RisingWave's architecture). This pushes practitioners to buy rather than host and maintain this complex system, which can be costly.

Whatever you choose for your architecture, make sure that the effort you put into it corresponds to your needs. And next time you think of streaming, make sure you also think about DuckDB.

Adoption Metrics and Benchmark Results for DuckDB v1.4 LTS

2025-10-09T00:00:00+00:00

#1 on ClickBench

On October 9, 2025, DuckDB's in-memory variant hit #1 on the popular ClickBench database benchmark:

This result was made possible due several performance optimizations in DuckDB v1.4.

Update As of October 26, the rules of ClickBench changed. The new rules prevent in-memory databases from showing “cold” (and “combined”) results. After this change, DuckDB is the #1 open-source system in hot runs, closely trailing the leader in that category, Umbra, a closed-source research prototype.

#3 Most Admired System on Stack Overflow

In Stack Overflow's 2024 Developer Survey, DuckDB was named among the top-3 most admired database systems. In the 2025 survey, it achieved position #4 (just 0.2% behind SQLite) but it made up for this by a significant increase in usage, which jumped from 1.4% in 2024 to 3.3% in 2025.

20+ Fortune-100 Companies Use DuckDB

We estimated the number of Fortune-100 companies who use DuckDB by cross-checking self-reported affiliations in the DuckDB issue tracker against a list of Fortune-100 companies.

25M+ Downloads / Month

DuckDB's Python packages has almost 25 million monthly downloads on PyPI alone. This is complemented with the downloads of other popular clients such as the CLI, Go, Node.js, Java, R, Rust and so on.

TPC-H SF 100,000

DuckDB is not only fast but it is also scalable. We have recently run the queries of the TPC-H workload on the scale factor 100,000 dataset, which is equivalent to 100,000 GB of CSV files. Obviously, such a data set size requires disk-based execution.

We ran the experiment on an i8g.48xlarge EC2 instance, which has 1.5 TB of RAM and 192 CPU cores (AWS Graviton4, Arm64). This instance has 12 NVMe SSD disks, each 3750 GB in size. We created a RAID-0 array from them to have a single 45 TB partition and formatted it using XFS.

We generated the dataset with the tpchgen-cli tool, a pure Rust implementation of the TPC-H generator. We configured the generator to produce chunks of Parquet files and loaded them into DuckDB. The final DuckDB database was about 27 TB in size (as a single file!).

DuckDB completed all 22 queries of the benchmark using its larger-than-memory processing. For some queries, this required spilling about 7 terabytes of data to disk. The median query runtime was 1.19 hours and the geometric mean runtime was 1.13 hours.

We will publish a detailed write-up on this experiment in the coming weeks.

Announcing DuckDB 1.4.1 LTS

2025-10-07T00:00:00+00:00

In this blog post, we highlight a few important fixes and convenience improvements in DuckDB v1.4.1, the first patch release in DuckDB's 1.4 LTS line. You can find the complete release notes on GitHub.

To install the new version, please visit the installation page.

Iceberg Improvements

The DuckDB iceberg extension received a number of patches:

You can now attach to an Iceberg REST Catalog and specify an access delegation mode. This fixes a bug when using catalogs that did not vend credentials. The ATTACH statement will now look like this:
```
ATTACH 'warehouse_name' AS my_datalake (
    TYPE iceberg,
    ENDPOINT 'endpoint',
    ACCESS_DELEGATION_MODE 'delegation_mode_option',
    SECRET 'my_secret'
);
```
The current ACCESS_DELEGATION_MODE options are vended_credentials (default) and none.
When attaching to AWS-managed REST Catalogs, the http_timeout setting is now respected.
Attempting to rename or replace a table within a transaction now throws a clear error message.
AWS Athena can now read Iceberg tables written by DuckDB.

AWS Improvements

The aws extension received a number of changes, which makes it easier to configure and troubleshoot. See the aws documentation page for more details.

Secret Validation

Since DuckDB v1.4.0, the AWS credential_chain provider looks for any required credentials during CREATE SECRET time, failing if absent/unavailable. Since v1.4.1 this behavior can also be configured via the VALIDATION option as follows:

CREATE OR REPLACE SECRET secret (
    TYPE s3,
    PROVIDER credential_chain,
    VALIDATION 'exists'
);

Two validation modes are supported:

exists (default) requires present credentials.
none allows CREATE SECRET to succeed for credential_chains with no available credentials.

S3 Default Region

Previously, setting the S3 region incorrectly could result in difficult-to-debug situations (Unknown error for HTTP HEAD to ...).

DuckDB v1.4.1 removes us-east-1 as the default S3 region and returns a 301 error code if an incorrect region is used.

Fixes for Missing Data

Users reported two cases where DuckDB omitted some data:

The Parquet reader had a regression which caused it to omit some rows when using predicate pushdown on certain string columns.
In certain edge cases, DuckDB’s ART index could omit rows rows non-deterministically when running on multiple threads. Note that this index is only used when you manually specify an index with CREATE INDEX.

DuckDB v1.4.1 fixes both of these issues.

Autoloading

In DuckDB v1.4.0, the httpfs extension was not always autoloaded. For example, running:

COPY (SELECT 42 AS answer) TO 's3://my_bucket/my_file.parquet';

without loading httpfs manually returned the following error:

Cannot open file "s3://my_bucket/my_file.parquet": No such file or directory

With v1.4.1, autoloading works and DuckDB can write to the bucket right away.

Docker Image

We now officially distribute a Docker image, making it easy to run DuckDB in a containerized environment:

docker run --rm -it -v "$(pwd):/workspace" -w /workspace duckdb/duckdb

DuckDB v1.4.1 (Andium) b390a7c376
Enter ".help" for usage hints.
Connected to a transient in-memory database.
Use ".open FILENAME" to reopen on a persistent database.
D

For more details including operational considerations and using the UI, read the Docker image page.

Redesigning DuckDB's Sort, Again

2025-09-24T00:00:00+00:00

DuckDB v1.4.0 was just released, which includes a complete redesign of DuckDB's sort implementation. We redesigned DuckDB's sort just four years ago, which allowed DuckDB to sort more data than fits in main memory, in parallel, with highly efficient comparisons. This implementation served us well, but since then we've implemented larger-than-memory query processing for more operators, such as the hash join and hash aggregation, which both use a new and improved spillable page layout. We presented this layout in an earlier blog post. We decided to integrate this layout in DuckDB's sort, and completely redesigned the implementation.

Not interested in the implementation? Jump straight to the benchmark!

Two-Phase Sorting

DuckDB implements parallel query execution using Morsel-Driven Parallelism. In DuckDB's implementation of this framework, blocking operators, i.e., operators that must read the entire input before they can output, such as the hash aggregation and sort operators, have the following phases:

Sink: Thread-local accumulation of data from a pipeline
Combine: Signals a thread finishing its Sink phase
Finalize: Called once when all threads have called Combine
GetData: Output data to the next pipeline

For many decades, the preferred option to implement larger-than-memory sorting in database systems has been to generate multiple sorted runs, followed by a merge sort. Specifically, a k-way merge sort produces the lowest amount of I/O during larger-than-memory sorting, which is critical to performance. This approach maps well to Morsel-Driven Parallelism: DuckDB performs thread-local sorting in the Sink phase, followed by a parallel merge sort in the Finalize or GetData phase. Both have been redesigned for DuckDB v1.4.0. We first discuss the new thread-local sort implementation before presenting the new merge design.

Thread-Local Sorting

Sorted runs are generated thread-locally in the Sink phase. The way DuckDB parallelizes this has not changed in v1.4.0: threads generate sorted runs independently, in parallel. What has changed is the physical sorting implementation.

Key Normalization

Database systems that do not compile the required types into the query plan – e.g., DuckDB – suffer from interpretation overhead, especially when comparing tuples while sorting. One way to get around this is Key Normalization. DuckDB's sort already used an ad-hoc version of this prior to v1.4.0, but the new implementation uses the more generic create_sort_key function that is available through SQL.

This function takes any number of inputs and sort conditions, and constructs a BLOB field that produces the specified order. An example from the description of the PR that implemented create_sort_key:

SELECT
    s,
    create_sort_key(s, 'asc nulls last') AS k1,
    create_sort_key(s, 'asc nulls first') AS k2
FROM
    (VALUES ('hello'), ('world'), (NULL)) t(s);

┌─────────┬───────────────┬───────────────┐
│    s    │      k1       │      k2       │
│ varchar │     blob      │     blob      │
├─────────┼───────────────┼───────────────┤
│ hello   │ \x01ifmmp\x00 │ \x02ifmmp\x00 │
│ world   │ \x01xpsme\x00 │ \x02xpsme\x00 │
│ NULL    │ \x02          │ \x01          │
└─────────┴───────────────┴───────────────┘

Because of the binary-comparable nature of the constructed BLOB, the following queries are equivalent:

SELECT * FROM tbl
ORDER BY x DESC NULLS LAST, y ASC NULLS FIRST;

SELECT * FROM tbl
ORDER BY create_sort_key(x, 'DESC NULLS LAST', y, 'ASC NULLS FIRST');

This fixes the problem of interpretation overhead when comparing tuples, as we now only have to consider comparing BLOBs, instead of arbitrary combinations of types in an ORDER BY clause.

Static Integer Comparisons

It's well known that processing strings is a lot slower than processing fixed-size types such as integers. If we would always use the create_sort_key function, even for integers, we'd be leaving a lot of performance on the table. However, if we know the size of the resulting BLOB, we can convert it back to one or more unsigned integers, and use integer comparisons instead.

For example, if we have the following query:

SELECT *
FROM tbl
ORDER BY
    c0::INTEGER ASC NULLS LAST,
    c1::DOUBLE ASC NULLS LAST;

The resulting BLOB from create_sort_key(c0::INTEGER, 'ASC NULLS LAST', c1::DOUBLE, 'ASC NULLS LAST') is less than 16 bytes, so the new sorting implementation will swap the bytes (for big-endian integer comparisons) and store them in two 64-bit unsigned integers. A simplified version of the data structure we use in C++:

struct FixedSortKeyNoPayload {
    uint64_t part0;
    uint64_t part1;
};
struct FixedSortKeyPayload {
    uint64_t part0;
    uint64_t part1;
    data_ptr_t payload;
};

Which can be compared like so:

bool LessThan(const FixedSortKeyPayload &lhs, const FixedSortKeyPayload &rhs) {
    return lhs.part0 < rhs.part0 || (lhs.part0 == rhs.part0 && lhs.part1 < rhs.part1);
}

The payload field is only present if more columns are selected, i.e.:

SELECT many columns
FROM tbl
ORDER BY a few columns;

If only columns are selected that also occur in the ORDER BY clause, the payload field is not needed, as DuckDB can decode the normalized keys.

Non-Contiguous Iteration

Prior to v1.4.0, DuckDB used fixed-size sort keys, but their size was only known when executing the query. This necessitates comparing and moving sort keys dynamically while sorting, which is much less efficient than statically compiled code. The C++ struct that DuckDB uses now, shown above, is known at compile time, which allows it to be sorted with sorting algorithms that implement the C++ std::iterator interface. This means that DuckDB no longer needs to implement a sorting algorithm: it can grab an off-the-shelf C++ implementation!

C++ comes with std::iterator implementations for various data structures such as std::array and std::vector. These data structures, however, require storing all data in a contiguous block of memory. DuckDB uses a default page allocation (= contiguous block of memory) size of 256 KiB. The FixedSortKeyPayload shown above is 24 bytes, so only ~10k tuples fit in a page. We want sorted runs to be much longer than that (for performance reasons that we will not get into in this blog post). To be able to generate longer sorted runs, we implemented an std::iterator that can iterate over non-contiguous blocks of memory:

While this iterator is great at sequential access, some sorting algorithms require random access. With this design, we cannot simply add an offset to a pointer to get the address of a tuple. Instead, we compute the page index and offset within the page using integer division/modulo, as the number of tuples per page is always the same (except for the last page). However, integer division/modulo is not cheap compared to the simple pointer arithmetic that can be used for contiguous blocks of memory, so we use fastmod to reduce the cost.

Sorting Algorithm

With the components described so far, we are able to generate large sorted runs, that can be spilled to storage page-by-page, rather than in an all-or-nothing fashion. We use a combination of three sorting algorithms to achieve good sorting performance and high adaptivity to pre-sorted data:

Vergesort detects and merges runs of (almost) sorted data, which greatly reduces the effort it takes to process, e.g., time-series data, which is often stored in sorted order already. If Vergesort cannot detect any patterns, it falls back to Ska Sort, which performs an adaptive Most Significant Digit (MSD) radix sort on the first 64-bit integer of the sort key. If radix partitions in the recursion become too small, or if the data is not fully sorted after the first 64-bit integer, it falls back to Pattern-defeating quicksort.

Merging

Prior to v1.4.0, DuckDB would materialize the fully-merged data. However, with a k-way merge, it is possible to output chunks of sorted data directly from the sorted runs, in somewhat of a streaming fashion. This means that data can be output before the full merge has been computed. We visualize this for four sorted runs:

Chunk 1 can be output to the next pipeline before all sorted runs have been merged. One of the reasons that this is useful is large ORDER BY ... LIMIT ... queries. If the LIMIT is small, DuckDB uses a min-heap, which is much faster than sorting the entire input. However, for large LIMITs, the min-heap approach becomes worse than fully sorting and then applying the LIMIT. With a k-way merge, the merge can be stopped by a LIMIT at any point, meaning that the cost of fully merging the sorted runs is never incurred.

Traditionally, the k-way merge is evaluated sequentially using a tournament tree. However, with modern multi-core CPUs, this leaves a lot of performance on the table. The question is, how can we do this in parallel?

K-Way Merge Path

Various algorithms to parallelize merge sort exist, such as Merge Path, which DuckDB's sort used prior to v1.4.0, and Bitonic Merge Sort. However, these algorithms parallelize a cascading two-way merge sort, not a k-way merge sort. So, while these algorithms are parallel and skew-resistant, they are unattractive for larger-than-memory sorting, as they produce much more I/O.

For k-way merging, fewer options for parallelization exist. The work can be divided using value-based splitting. However, it is easy to see that parallelism breaks down when the input distribution is extremely skewed, e.g., when half of the input has the same value, as there is no splitting value that can divide the work into evenly-sized tasks. After searching the web, the only skew-resistant parallel k-way merge that we could find is a bachelor thesis from 2014. We wanted a very fine-grained approach, so instead, we generalized Merge Path to k sorted runs.

In the previous figure, there is a horizontal line in each sorted run that indicates how much of each sorted run went into the output chunk. The general idea of Merge Path, as explained in our blog post on sorting four years ago, is to compute where these lines are, i.e., where the sorted runs intersect. Merge Path does this efficiently for merging two sorted runs using binary search.

We generalize this approach to k sorted runs, which allows us to choose an arbitrary output chunk size, and compute where the sorted runs intersect such that when they are merged, the resulting chunk will be of the chosen size. This allows for very fine-grained skew-resistant parallelism, which is not possible when choosing specific splitting values, as the size of the chunks that this produces depends on the data distribution. This is the pseudo-code for k-way merge path:

def compute_intersections(sorted_runs, chunk_size):
    intersections = [0 for _ in range(len(sorted_runs))]
    while chunk_size != 0:
        delta = ceil(chunk_size / len(sorted_runs))
        min_idx = 0
        min_val = sorted_runs[0][intersections[0] + delta]
        for run_idx in range(1, len(sorted_runs)):
            val = sorted_runs[run_idx][intersections[run_idx] + delta]
            if val < min_val:
                min_idx = run_idx
                min_val = val
        intersections[min_idx] += delta
        chunk_size -= delta
    return intersections

This has been greatly simplified, as this does not take into account any edge cases or going out-of-bounds on the sorted runs. The general idea is that we move up the lower bound for the intersection of one sorted run in each iteration of the while loop. This has a worse complexity than the binary search used in the original Merge Path, but it is also has to be called fewer times because a k-way merge can merge all sorted runs in a single pass, rather than in many passes. Profiling shows that this computation takes up just 1-2% of the overall execution time.

Threads can compute the intersections independently, and, therefore, in parallel. Once threads have computed the intersections, they are free to merge the data between the intersections, as the data is guaranteed not to overlap with that of other threads. The merged chunks can immediately be output in parallel due to DuckDB's order-preserving parallelism.

Benchmark

So, how does the new sorting implementation perform compared to the old one? We run a few experiments on my laptop (M1 Max MacBook Pro with 10 threads and 64 GB of memory).

Raw Performance

We first benchmark raw integer sorting performance. We have three types of inputs (pre-sorted ascending, pre-sorted descending, and randomly ordered), at three different sizes (10, 100, and 1000 million rows). We've generated the data using the following queries:

CREATE TABLE ascending10m AS
    SELECT range AS i FROM range(10_000_000);

CREATE TABLE descending10m AS
    SELECT range AS i FROM range(9_999_999, 0, -1);

CREATE TABLE random10m AS
    SELECT range AS i FROM range(10_000_000) ORDER BY random();

-- and so on for 100m and 1000m

We took the median of 5 runs of each of these queries, for each table size:

SELECT any_value(i)
FROM (FROM ascending10m ORDER BY i);

SELECT any_value(i)
FROM (FROM descending10m ORDER BY i);

SELECT any_value(i)
FROM (FROM random10m ORDER BY i);

-- and so on for 100m and 1000m

This query causes DuckDB to evaluate the entire sort, without materializing the whole table as a query result. This allows us to better isolate the performance of the sorting implementation.

Results

Table	Rows [Millions]	Old [s]	New [s]	Speedup vs. Old [x]
Ascending	10	0.110	0.033	3.333
Ascending	100	0.912	0.181	5.038
Ascending	1000	15.302	1.475	10.374
Descending	10	0.121	0.034	3.558
Descending	100	0.908	0.207	4.386
Descending	1000	15.789	1.712	9.222
Random	10	0.120	0.094	1.276
Random	100	1.028	0.587	1.751
Random	1000	17.554	6.493	2.703

This shows that the new implementation is highly adaptive to pre-sorted data: it is roughly 10x faster at ascending/descending data than the old implementation. It has much better raw sorting performance: it is more than 2× faster at sorting randomly ordered data (at 1000 million).

We also plot the results on a log-log scale:

Here, we can see that the new implementation scales much better: the execution time of the new implementation increases less steeply with input size than the old implementation.

Wide Table

The first benchmark evaluated raw sorting performance. In this next benchmark, we sort a wide table, i.e., we select many columns to be sorted by the ORDER BY clause. We sort the lineitem table from TPC-H which has 15 columns, by the l_shipdate column, at scale factors 1 (~6 million rows), 10 (~60 million rows) and 100 (~600 million rows), generated using DuckDB's TPC-H extension.

We took the median execution time of 5 runs of this query for each scale factor:

SELECT any_value(COLUMNS(*))
FROM (FROM lineitem ORDER BY l_shipdate);

Results

Table	SF	Old [s]	New [s]	Speedup vs. Old [x]
TPC-H SF 1 `lineitem` by `l_shipdate`	1	0.328	0.189	1.735
TPC-H SF 10 `lineitem` by `l_shipdate`	10	3.353	1.520	2.205
TPC-H SF 100 `lineitem` by `l_shipdate`	100	273.982	80.919	3.385

We have set the memory limit to 30 GB, so the data no longer fits in memory at scale factor 100. The new implementation is roughly 2× faster at scale factors 1 and 10, and more than 3× faster at scale factor 100. This shows that the new k-way merge sort reduces data movement and I/O, is much more efficient at sorting wide tables than the old cascaded 2-way merge sort.

Again, we plot the results on a log-log scale:

And we can see that the new implementation scales much better, especially when the data no longer fits in main memory.

Thread Scaling

Finally, we benchmark how well the sorting implementation scales with threads. We sort the table with 100 million randomly ordered integers from before, with 1, 2, 4, and 8 threads. We use the same data and query as in the first benchmark, and take the median of five runs.

Results

Threads	Old [s]	New [s]	Old Speedup vs. 1 Thread [x]	New Speedup vs. 1 Thread [x]
1	3.240	4.234	1.000	1.000
2	2.121	2.193	1.527	1.930
4	1.401	1.216	2.312	3.481
8	0.920	0.654	3.521	6.474

As we can see, the new, single-threaded sorting performance is around 30% slower than the old one. This is due to the new sorting implementation using an in-place MSD radix sort, rather than an out-of-place Least Significant Digit (LSD) radix sort. This makes the old implementation perform better specifically on this workload, at the cost of using much more memory.

However, if we increase the number of threads to 2, this advantage is already gone. At 8 threads, the old implementation has a speedup of only ~3.5× over 1 thread, while this speedup is ~6.5× for the new implementation.

Again, we plot the results on a log-log scale:

This shows that the new implementation's parallel scaling is much better than the old implementation.

Conclusion

DuckDB's new sorting implementation has greatly improved performance over the old sorting implementation. It is highly adaptive to pre-sorted data, performs less I/O when sorting data that does not fit in main memory, and scales much better with additional threads.

If you've upgraded to v1.4.0, you can enjoy the improved performance when using the ORDER BY clause. The new sorting implementation has already been integrated into the window operator, so we expect to see a performance improvements when using the OVER clause as well. For v1.5.0, we aim to integrate the new sorting implementation into the joins that use sorting such as the ASOF join.

x	r	g	b
0	4	5	1
1	4	5	1
2	5	6	2
3	8	9	4
4	9	10	5
5	11	12	8
6	11	12	8
7	11	12	8
8	9	10	5
9	9	10	5

x	r	g	b
0	4	5	1
1	4	5	1
2	5	6	2
3	8	9	4
4	9	10	5
5	11	12	8
6	11	12	8
7	11	12	8
8	9	10	5
9	9	10	5

x	r	g	b
0	4	5	1
1	4	5	1
2	5	6	2
3	8	9	4
4	9	10	5
5	11	12	8
6	11	12	8
7	11	12	8
8	9	10	5
9	9	10	5