crawler

Search Shortcut cmd + k | ctrl + k

Documentation

crawler

Downloads 391this week

GitHub stars 6

Extension repository on GitHub

Extension descriptor (YAML)

SQL-native web crawler with HTML extraction and MERGE support

Maintainer(s): onnimonni

Installing and Loading

INSTALL crawler FROM community;
LOAD crawler;

Example

SELECT url, jq(html.document, 'h1').text as title
FROM crawl(['https://example.com']);

About crawler

The crawler extension provides SQL-native web crawling capabilities for DuckDB.

Features:

crawl() table function with automatic rate limiting and robots.txt compliance
crawl_url() for LATERAL joins
sitemap() for XML sitemap parsing
read_html() for Google Sheets-style IMPORTHTML (tables, lists, JS variables)
jq() and htmlpath() functions for CSS selector-based extraction
html.readability for article extraction
html.schema for JSON-LD/microdata parsing
CRAWLING MERGE INTO syntax for upsert operations

Example with read_html (like Google Sheets =IMPORTHTML):

SELECT * FROM read_html('https://en.wikipedia.org/wiki/...', 'table.wikitable', 1);
SELECT * FROM read_html('https://example.com/page', 'js=jobs');

Example with extraction:

SELECT
    url,
    jq(html.document, '.price', 'data-amount') as price,
    html.readability.title as article_title
FROM crawl(['https://example.com/products']);

Example with MERGE:

CRAWLING MERGE INTO pages
USING crawl(['https://example.com']) AS src
ON (src.url = pages.url)
WHEN MATCHED THEN UPDATE BY NAME
WHEN NOT MATCHED THEN INSERT BY NAME;

For full documentation see: https://github.com/midwork-finds-jobs/duckdb-crawler

Added Functions

function_name	function_type	description	comment
crawl	table	NULL	NULL
crawl_stream	table	NULL	NULL
crawl_url	table	NULL	NULL
css_select	scalar	NULL	NULL
discover	scalar	NULL	NULL
htmlpath	scalar	NULL	NULL
jq	scalar	NULL	NULL
read_html	table	NULL	NULL
sitemap	table	NULL	NULL
stream_merge_internal	table	NULL	NULL

Overloaded Functions

| function_name | function_type | description | comment | examples | |—————|—————|————-|———|———-|

Added Types

| type_name | type_size | logical_type | type_category | internal | |———–|———-:|————–|—————|———-|

Added Settings

name	description	input_type	scope	aliases
crawler_default_delay	Default crawl delay in seconds if not in robots.txt	DOUBLE	GLOBAL	[]
crawler_max_response_bytes	Maximum response body size in bytes (0 = unlimited)	BIGINT	GLOBAL	[]
crawler_respect_robots	Whether to respect robots.txt directives	BOOLEAN	GLOBAL	[]
crawler_timeout_ms	HTTP request timeout in milliseconds	BIGINT	GLOBAL	[]
crawler_user_agent	User agent string for crawler HTTP requests	VARCHAR	GLOBAL	[]