Search Shortcut cmd + k | ctrl + k
crawler

SQL-native web crawler with HTML extraction and MERGE support

Maintainer(s): onnimonni

Installing and Loading

INSTALL crawler FROM community;
LOAD crawler;

Example

SELECT url, jq(html.document, 'h1').text as title
FROM crawl(['https://example.com']);

About crawler

The crawler extension provides SQL-native web crawling capabilities for DuckDB.

Features:

  • crawl() table function with automatic rate limiting and robots.txt compliance
  • crawl_url() for LATERAL joins
  • sitemap() for XML sitemap parsing
  • jq() and htmlpath() functions for CSS selector-based extraction
  • html.readability for article extraction
  • html.schema for JSON-LD/microdata parsing
  • CRAWLING MERGE INTO syntax for upsert operations

Example with extraction:

SELECT
    url,
    jq(html.document, '.price', 'data-amount') as price,
    html.readability.title as article_title
FROM crawl(['https://example.com/products']);

Example with MERGE:

CRAWLING MERGE INTO pages
USING crawl(['https://example.com']) AS src
ON (src.url = pages.url)
WHEN MATCHED THEN UPDATE BY NAME
WHEN NOT MATCHED THEN INSERT BY NAME;

For full documentation see: https://github.com/midwork-finds-jobs/duckdb-crawler

Added Functions

function_name function_type description comment examples
crawl table NULL NULL  
crawl_stream table NULL NULL  
crawl_url table NULL NULL  
css_select scalar NULL NULL  
discover scalar NULL NULL  
htmlpath scalar NULL NULL  
jq scalar NULL NULL  
sitemap table NULL NULL  
stream_merge_internal table NULL NULL  

Added Settings

name description input_type scope aliases
crawler_default_delay Default crawl delay in seconds if not in robots.txt DOUBLE GLOBAL []
crawler_max_response_bytes Maximum response body size in bytes (0 = unlimited) BIGINT GLOBAL []
crawler_respect_robots Whether to respect robots.txt directives BOOLEAN GLOBAL []
crawler_timeout_ms HTTP request timeout in milliseconds BIGINT GLOBAL []
crawler_user_agent User agent string for crawler HTTP requests VARCHAR GLOBAL []