Read High Throughput Sequencing Files Formats in DuckDB via htslib
Installing and Loading
INSTALL duckhts FROM community;
LOAD duckhts;
Example
-- Load the extension
LOAD duckhts;
-- Read a VCF/BCF file (tidy FORMAT columns)
SELECT CHROM, POS, REF, ALT, SAMPLE_ID
FROM read_bcf('test/data/formatcols.vcf.gz', tidy_format := true)
LIMIT 5;
-- Read a BAM/SAM file
SELECT QNAME, RNAME, POS, READ_GROUP_ID, SAMPLE_ID
FROM read_bam('test/data/rg.sam.gz')
LIMIT 5;
About duckhts
DuckHTS provides table functions for common high-throughput sequencing (HTS) formats using htslib. Query VCF/BCF/BAM/CRAM/FASTA/FASTQ/GTF/GFF and tabix-indexed files directly in SQL.
Functions include:
- read_bcf(path, [region, tidy_format])
- read_bam(path, [region, reference, standard_tags, auxiliary_tags])
- read_fasta(path, [region, index_path])
- fasta_index(path, [index_path])
- read_fastq(path, [mate_path, interleaved])
- read_gff(path, [region, attributes_map])
- read_gtf(path, [region, attributes_map])
- read_tabix(path, [region, header, header_names, auto_detect, column_types])
- hts_header(path, [format, mode])
- hts_index(path, [format, index_path])
- hts_index_spans(path, [format, index_path])
- hts_index_raw(path, [format, index_path])
Paired FASTQ is supported via mate_path or interleaved := true. CRAM is supported with an explicit reference file. For GTF/GFF, attributes can be returned as a parsed MAP using attributes_map := true. Optional SAMtags columns and an auxiliary tag map are available via standard_tags and auxiliary_tags. Tabix files can use header/header_names and type inference via auto_detect or explicit column_types.
MSVC builds (windows_amd64/windows_arm64) are not supported. MinGW/RTools is supported on Windows.
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| fasta_index | table | NULL | NULL | |
| read_bam | table | NULL | NULL | |
| read_bcf | table | NULL | NULL | |
| read_fasta | table | NULL | NULL | |
| read_fastq | table | NULL | NULL | |
| read_gff | table | NULL | NULL | |
| read_gtf | table | NULL | NULL | |
| read_hts_header | table | NULL | NULL | |
| read_hts_index | table | NULL | NULL | |
| read_hts_index_raw | table_macro | NULL | NULL | |
| read_hts_index_spans | table_macro | NULL | NULL | |
| read_tabix | table | NULL | NULL |
Overloaded Functions
| function_name | function_type | description | comment | examples | |—————|—————|————-|———|———-|
Added Types
| type_name | type_size | logical_type | type_category | internal | |———–|———-:|————–|—————|———-|
Added Settings
| name | description | input_type | scope | aliases | |——|————-|————|——-|———|