Search Shortcut cmd + k | ctrl + k
duckhts

Read High Throughput Sequencing Files Formats in DuckDB via htslib

Maintainer(s): Sounkou Mahamane Toure

Installing and Loading

INSTALL duckhts FROM community;
LOAD duckhts;

Example

-- Load the extension
LOAD duckhts;

-- Read a VCF/BCF file (tidy FORMAT columns)
SELECT CHROM, POS, REF, ALT, SAMPLE_ID
FROM read_bcf('test/data/formatcols.vcf.gz', tidy_format := true)
LIMIT 5;

-- Read a BAM/SAM file
SELECT QNAME, RNAME, POS, READ_GROUP_ID, SAMPLE_ID
FROM read_bam('test/data/rg.sam.gz')
LIMIT 5;

About duckhts

DuckHTS provides table functions for common high-throughput sequencing (HTS) formats using htslib. Query VCF/BCF/BAM/CRAM/FASTA/FASTQ/GTF/GFF and tabix-indexed files directly in SQL.

Functions include:

  • read_bcf(path, [region, tidy_format])
  • read_bam(path, [region, reference, standard_tags, auxiliary_tags])
  • read_fasta(path, [region, index_path])
  • fasta_index(path, [index_path])
  • read_fastq(path, [mate_path, interleaved])
  • read_gff(path, [region, attributes_map])
  • read_gtf(path, [region, attributes_map])
  • read_tabix(path, [region, header, header_names, auto_detect, column_types])
  • hts_header(path, [format, mode])
  • hts_index(path, [format, index_path])
  • hts_index_spans(path, [format, index_path])
  • hts_index_raw(path, [format, index_path])

Paired FASTQ is supported via mate_path or interleaved := true. CRAM is supported with an explicit reference file. For GTF/GFF, attributes can be returned as a parsed MAP using attributes_map := true. Optional SAMtags columns and an auxiliary tag map are available via standard_tags and auxiliary_tags. Tabix files can use header/header_names and type inference via auto_detect or explicit column_types.

MSVC builds (windows_amd64/windows_arm64) are not supported. MinGW/RTools is supported on Windows.

Added Functions

function_name function_type description comment examples
fasta_index table NULL NULL  
read_bam table NULL NULL  
read_bcf table NULL NULL  
read_fasta table NULL NULL  
read_fastq table NULL NULL  
read_gff table NULL NULL  
read_gtf table NULL NULL  
read_hts_header table NULL NULL  
read_hts_index table NULL NULL  
read_hts_index_raw table_macro NULL NULL  
read_hts_index_spans table_macro NULL NULL  
read_tabix table NULL NULL  

Overloaded Functions

| function_name | function_type | description | comment | examples | |—————|—————|————-|———|———-|

Added Types

| type_name | type_size | logical_type | type_category | internal | |———–|———-:|————–|—————|———-|

Added Settings

| name | description | input_type | scope | aliases | |——|————-|————|——-|———|