SQL-native web crawler with HTML extraction and MERGE support
Maintainer(s):
onnimonni
Installing and Loading
INSTALL crawler FROM community;
LOAD crawler;
Example
SELECT url, jq(html.document, 'h1').text as title
FROM crawl(['https://example.com']);
About crawler
The crawler extension provides SQL-native web crawling capabilities for DuckDB.
Features:
crawl()table function with automatic rate limiting and robots.txt compliancecrawl_url()for LATERAL joinssitemap()for XML sitemap parsingjq()andhtmlpath()functions for CSS selector-based extractionhtml.readabilityfor article extractionhtml.schemafor JSON-LD/microdata parsingCRAWLING MERGE INTOsyntax for upsert operations
Example with extraction:
SELECT
url,
jq(html.document, '.price', 'data-amount') as price,
html.readability.title as article_title
FROM crawl(['https://example.com/products']);
Example with MERGE:
CRAWLING MERGE INTO pages
USING crawl(['https://example.com']) AS src
ON (src.url = pages.url)
WHEN MATCHED THEN UPDATE BY NAME
WHEN NOT MATCHED THEN INSERT BY NAME;
For full documentation see: https://github.com/midwork-finds-jobs/duckdb-crawler
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| crawl | table | NULL | NULL | |
| crawl_stream | table | NULL | NULL | |
| crawl_url | table | NULL | NULL | |
| css_select | scalar | NULL | NULL | |
| discover | scalar | NULL | NULL | |
| htmlpath | scalar | NULL | NULL | |
| jq | scalar | NULL | NULL | |
| sitemap | table | NULL | NULL | |
| stream_merge_internal | table | NULL | NULL |
Added Settings
| name | description | input_type | scope | aliases |
|---|---|---|---|---|
| crawler_default_delay | Default crawl delay in seconds if not in robots.txt | DOUBLE | GLOBAL | [] |
| crawler_max_response_bytes | Maximum response body size in bytes (0 = unlimited) | BIGINT | GLOBAL | [] |
| crawler_respect_robots | Whether to respect robots.txt directives | BOOLEAN | GLOBAL | [] |
| crawler_timeout_ms | HTTP request timeout in milliseconds | BIGINT | GLOBAL | [] |
| crawler_user_agent | User agent string for crawler HTTP requests | VARCHAR | GLOBAL | [] |