warc – DuckDB Community Extensions

Search Shortcut cmd + k | ctrl + k

Documentation

warc

Downloads 23this week

GitHub stars 0

Extension repository on GitHub

Extension descriptor (YAML)

Parse WARC (Web ARChive) records for Common Crawl data processing

Maintainer(s): onnimonni

Installing and Loading

INSTALL warc FROM community;
LOAD warc;

Example

-- Parse a WARC record from a gzip-compressed file
SELECT parse_warc(content) FROM read_blob('record.warc.gz');
┌─────────────────────────────────────────────────────────────────────────────┐
│                              parse_warc(content)                            │
│ struct(warc_version varchar, warc_headers varchar, http_version varchar,    │
│        http_status integer, http_headers varchar, http_body blob)           │
├─────────────────────────────────────────────────────────────────────────────┤
│ {'warc_version': '1.0', 'warc_headers': '{"WARC-Type": "response", ...}',   │
│  'http_version': 'HTTP/1.1', 'http_status': 200,                            │
│  'http_headers': '{"content-type": "text/html", ...}',                      │
│  'http_body': <!doctype html>...}                                           │
└─────────────────────────────────────────────────────────────────────────────┘

-- Extract specific fields
SELECT
    (parse_warc(content)).http_status,
    (parse_warc(content)).http_body
FROM read_blob('record.warc.gz');

About warc

The WARC extension parses WARC (Web ARChive) records, the standard format used by Common Crawl and web archiving tools. It enables efficient processing of web archive data directly in DuckDB.

Function

`parse_warc(data)`

Parse a WARC record and return a struct with all components.

Parameters:

data (BLOB or VARCHAR): WARC record data (auto-detects gzip compression)

Returns: STRUCT with fields:

warc_version (VARCHAR): WARC format version (e.g., "1.0")
warc_headers (VARCHAR): JSON object of WARC headers
http_version (VARCHAR): HTTP version (e.g., "HTTP/1.1")
http_status (INTEGER): HTTP status code (e.g., 200)
http_headers (VARCHAR): JSON object of HTTP headers (lowercase keys)
http_body (BLOB): Response body content

Common Crawl Workflow

The recommended workflow for processing Common Crawl data:

Query the columnar index (Parquet) to find records of interest
Fetch only the specific byte ranges you need using HTTP Range requests
Parse with this extension

-- Example: Parse a downloaded Common Crawl record
-- First download: curl -r"46376769-46377713" "https://data.commoncrawl.org/crawl-data/..." > record.warc.gz
SELECT
    (parse_warc(content)).http_status,
    decode((parse_warc(content)).http_body) as html
FROM read_blob('record.warc.gz');

Features

Auto-detects gzip compression
Handles binary content (skips body for non-text responses)
HTTP header keys are lowercased for consistent access
Works with both BLOB and VARCHAR input types

Added Functions

function_name	function_type	description	comment	examples
parse_warc	scalar	NULL	NULL

In this article