Search Shortcut cmd + k | ctrl + k
duckhts

DuckDB extension for reading HTS file formats via htslib

Maintainer(s): sounkou-bioinfo

Installing and Loading

INSTALL duckhts FROM community;
LOAD duckhts;

Example

-- Load the extension
LOAD duckhts;

-- Read a VCF/BCF file (tidy FORMAT columns)
SELECT CHROM, POS, REF, ALT, SAMPLE_ID
FROM read_bcf('test/data/formatcols.vcf.gz', tidy_format := true)
LIMIT 5;

-- Read a BAM/SAM file
SELECT QNAME, RNAME, POS, READ_GROUP_ID, SAMPLE_ID
FROM read_bam('test/data/rg.sam.gz')
LIMIT 5;

About duckhts

DuckHTS provides DuckDB table functions for common high-throughput sequencing (HTS) formats using htslib. Query VCF, BCF, BAM, CRAM, FASTA, FASTQ, GTF, GFF, and tabix-indexed files directly in SQL.

The extension also includes sequence utility UDFs, SAM flag predicate helpers, and HTS metadata helpers.

Functions included in this extension:

Diagnostics

  • duckhts_simd_backend(): Return the current DuckHTS SIMD dispatch label. For explicit scalar or concrete backend requests this is the requested policy; for auto it is the single selected backend when all logical kernels resolve to the same backend, or mixed when per-kernel auto-dispatch resolves to multiple backends. Use duckhts_simd_kernel_info() for per-kernel details.
  • duckhts_simd_requested_backend(): Return the current explicit SIMD backend request, usually auto unless SELECT backend FROM duckhts_simd_set_backend('auto'|'scalar'|backend) was called. The selected per-kernel backend may differ under auto-dispatch across x86, ARM, wasm, and scalar-only builds.
  • duckhts_simd_backend_compiled(backend): Return whether a concrete DuckHTS SIMD backend was compiled into this build. This is independent of whether the current CPU/runtime supports executing that backend; for example avx512 can be compiled but not CPU-supported on the running host.
  • duckhts_simd_backend_cpu_supported(backend): Return whether the current CPU/runtime supports a concrete DuckHTS SIMD backend, independent of whether DuckHTS compiled an implementation for it. Availability is the intersection of compiled and CPU-supported.
  • duckhts_simd_backend_available(backend): Return whether a concrete SIMD backend is usable in the current process. Availability means the backend is compiled into DuckHTS and supported by the current CPU/runtime. auto is a selection request rather than a concrete backend and is not reported as available here.
  • duckhts_simd_info(): Return one row per known concrete DuckHTS SIMD backend with extension-owned selectable, compiled, CPU-supported, available, selected, requested, and dispatch-mode diagnostics. Availability is the intersection of compiled and CPU/runtime-supported. selectable reports whether the backend has a selectable implementation path; explicit selection still requires available = TRUE. selected is TRUE when the current dispatch table uses that backend for at least one logical kernel. auto is a selection request and is not a concrete backend row.
  • duckhts_simd_kernel_info(): Return one row per logical DuckHTS SIMD kernel showing the concrete backend selected by the current immutable dispatch table, the selected capability, the requested backend policy, whether scalar was used as a per-kernel fallback, and the dispatch mode. This is the authoritative diagnostic for mixed auto-dispatch when different kernels resolve to different backends.
  • duckhts_simd_set_backend(backend): Explicitly select the DuckHTS SIMD dispatch policy for this process using a one-row table-function call and return the current dispatch label in a backend column. Use auto for per-kernel runtime dispatch or scalar for a portable baseline; unavailable platform-specific requests such as avx512 on non-AVX-512 CPUs raise an error instead of silently falling back.

Readers

  • read_bcf(path, region := NULL, index_path := NULL, tidy_format := FALSE, additional_csq_column_types := NULL, decompression_threads := 0): Read VCF and BCF variant data with typed INFO, FORMAT, typed CSQ/ANN/BCSQ subfields, optional tidy sample output, optional bcftools-style CSQ type overrides, and optional htslib decompression worker threads via decompression_threads (default 0 for single-threaded reads).
  • read_bam(path, standard_tags := FALSE, auxiliary_tags := FALSE, region := NULL, index_path := NULL, reference := NULL, sequence_encoding := 'string', quality_representation := 'string', cigar_representation := 'string', decompression_threads := 2): Read SAM, BAM, and CRAM alignments with optional typed SAMtags and auxiliary tag maps. Use sequence_encoding := 'nt16' to return SEQ as UTINYINT[], quality_representation := 'phred' to return QUAL as UTINYINT[], and cigar_representation := 'binary' to return packed BAM CIGAR operations as UINTEGER[] instead of SAM text. decompression_threads controls per-file htslib worker threads and defaults to 2; use 0 to disable worker threads.
  • read_fasta(path, region := NULL, index_path := NULL, gzi_path := NULL, sequence_encoding := 'string'): Read FASTA records or indexed FASTA regions as sequence rows. Use sequence_encoding := 'nt16' to return SEQUENCE as UTINYINT[] (htslib nt16 4-bit codes) instead of VARCHAR. For bgzipped FASTA, gzi_path may point to an explicit .gzi sidecar when it is not colocated with the FASTA.
  • read_bed(path, region := NULL, index_path := NULL): Read BED3-BED12 interval files with canonical typed columns and optional tabix-backed region filtering.
  • fasta_nuc(path, bed_path := NULL, bin_width := NULL, region := NULL, index_path := NULL, gzi_path := NULL, bed_index_path := NULL, include_seq := FALSE): Compute bedtools nuc-style nucleotide composition for supplied BED intervals or generated fixed-width bins over a FASTA reference. For bgzipped FASTA, gzi_path may point to an explicit .gzi sidecar when it is not colocated with the FASTA.
  • read_fastq(path, interleaved := FALSE, mate_path := NULL, sequence_encoding := 'string', quality_representation := 'string', input_quality_encoding := 'phred33'): Read single-end, paired-end, or interleaved FASTQ files with optional legacy quality decoding. By default, FASTQ qualities are interpreted as modern Phred+33 input. Use sequence_encoding := 'nt16' to return SEQUENCE as UTINYINT[] and quality_representation := 'phred' to return QUALITY as UTINYINT[] instead of VARCHAR. input_quality_encoding accepts 'phred33', 'auto', 'phred64', or 'solexa64'.
  • read_gff(path, header_names := NULL, header := FALSE, column_types := NULL, auto_detect := FALSE, attributes_map := FALSE, attributes_list := FALSE, attributes_pairs := FALSE, strict := FALSE, region := NULL, index_path := NULL): Read GFF annotations with optional raw scalar and richer list/pair parsed attribute columns, strict GFF3 structural validation, and indexed region filtering.
  • read_gtf(path, header_names := NULL, header := FALSE, column_types := NULL, auto_detect := FALSE, attributes_map := FALSE, attributes_list := FALSE, attributes_pairs := FALSE, region := NULL, index_path := NULL): Read GTF annotations with optional raw scalar and richer list/pair parsed attribute columns and indexed region filtering.
  • read_tabix(path, header_names := NULL, header := FALSE, column_types := NULL, auto_detect := FALSE, region := NULL, index_path := NULL): Read generic tabix-indexed text data with optional header handling and type inference.
  • fasta_index(path, index_path := NULL): Build a FASTA index (.fai) and return a single row with columns success (BOOLEAN) and index_path (VARCHAR).
  • hts_union_query(reader, pattern, params := ''): Generate a UNION ALL BY NAME query string that reads every file matching a glob pattern through the named reader function. The result includes a 'filename' column identifying the source file for each row. Assign to a variable with SET VARIABLE and execute via query(getvariable(…)). Optional params string is appended to each reader call. In R, use the typed rduckhts_*_multi() helpers instead, which accept file vectors with optional per-file parameters and create DuckDB tables directly.

Coverage

  • read_pileup(path, region := NULL, index_path := NULL, min_mapq := 0, flag_mask := 1796): Construct a region-scoped BAM pileup with one row per covered position, emitting chrom, 1-based position, depth, observed bases, and Phred+33 qualities after SAM flag and MAPQ filtering. This is a compact htslib pileup view, not samtools mpileup text parity.
  • bam_bin_counts(path, bin_width, chrom := NULL, reference := NULL, index_path := NULL, mapq := 0, require_flags := 0, exclude_flags := 0, rmdup := 'none', stats := NULL): Count BAM or CRAM read starts into fixed-width bins. Returns one row per bin across the selected contig span, including zero-count bins, with total, forward, and reverse counts; rmdup := 'streaming' applies the WisecondorX-style larp/larp2 consecutive-position deduplication, rmdup := 'flag' drops SAM duplicate-flagged reads, and stats := 'gc', 'mq', or 'gc,mq' adds per-bin pre/post-filter GC and MAPQ sufficient statistics, including reference GC when reference is provided.
  • duckhts_bam_bed_coverage(path, bed_path, reference := NULL, index_path := NULL, bed_index_path := NULL, mapq := 0, min_baseq := 0, min_read_len := 0, require_flags := 0, exclude_flags := 1796, min_depth := 1, max_depth := 1000000, decompression_threads := 0, fragment_mode := FALSE, strand_outputs := TRUE, processing_threads := 0): Compute samtools coverage-like regional summaries for BAM or CRAM input over a BED target set, returning one row per BED interval with DuckHTS-specific pre/post-filter read counts, covered bases, percentage covered, mean depth, mean baseQ, mean mapQ, and strand-specific post-filter summaries in read mode. Indexed BAM/CRAM input is required in the current implementation. decompression_threads controls htslib worker threads for BAM/CRAM decoding; use 0 to disable them.
  • duckhts_mosdepth(prefix, path, chrom := NULL, by := NULL, fasta := NULL, read_groups := NULL, no_per_base := FALSE, threads := 2, processing_threads := 2, flag := 1796, include_flag := 0, fast_mode := FALSE, fragment_mode := FALSE, use_median := FALSE, mapq := 0, min_frag_len := -1, max_frag_len := -1, precision_digits := 2, quantize := NULL, thresholds := NULL, index_path := NULL, overwrite := FALSE): Write native mosdepth-compatible coverage outputs for indexed BAM or CRAM input. Produces mosdepth-style summary, global distribution, per-base BED.gz + CSI, optional window/BED region outputs, optional quantized BED.gz + CSI, and optional threshold counts for by; fast_mode defaults to FALSE to match upstream mosdepth, default mode performs CIGAR-aware coverage with mate-overlap correction, fragment_mode switches coverage to full-fragment insert spans for proper pairs, use_median switches by outputs from mean to median, read_groups filters by comma-separated RG IDs, min_frag_len and max_frag_len filter on absolute template length, fasta is required for CRAM when htslib needs a reference, precision_digits controls decimal places in the text outputs, and processing_threads enables parallel contig processing (0 = sequential, >0 = number of worker threads).

Intervals

  • duckhts_cgranges_create(name): Create an empty session-scoped cgranges registry entry that can be populated with intervals and finalized for overlap queries.
  • duckhts_cgranges_add(name, chrom, start, end[, label]): Append an interval to a session-scoped cgranges registry entry before finalization. Labels may be BIGINT-like, DOUBLE, VARCHAR, or BOOLEAN.
  • duckhts_cgranges_index(name): Finalize a populated cgranges registry entry and build its immutable overlap index for subsequent queries.
  • duckhts_cgranges_destroy(name): Destroy a session-scoped cgranges registry entry and release its indexed interval storage when it is not in active use.
  • duckhts_cgranges_from_query(name, query, chrom_col, start_col, end_col[, label_col]): Execute a SQL query on an extension-owned DuckDB connection, append its interval rows into a session-scoped cgranges registry entry, and leave the populated index ready for explicit finalization with duckhts_cgranges_index(…).
  • duckhts_cgranges_from_table(name, table_name, chrom_col, start_col, end_col[, label_col]): Reserved convenience constructor for bulk cgranges population from a table name. The current implementation is intentionally deferred and directs callers to duckhts_cgranges_from_query(…).
  • duckhts_cgranges_has_overlap(name, chrom, start, end[, mode]): Vectorized scalar predicate for streaming provider rows through a finalized session-scoped cgranges index. Returns TRUE when the query interval overlaps at least one indexed interval, or when mode = 'contain' and it fully contains at least one indexed interval; NULL inputs return NULL.
  • duckhts_cgranges_count_overlaps(name, chrom, start, end[, mode]): Vectorized scalar overlap counter for streaming provider rows through a finalized session-scoped cgranges index. Returns the number of indexed intervals that overlap the query interval, or with mode = 'contain' the number fully contained by it; NULL inputs return NULL.
  • duckhts_cgranges_overlaps_list(name, chrom, start, end[, mode]): Vectorized scalar overlap expander for streaming provider rows through a finalized session-scoped cgranges index. Returns a LIST of hit STRUCTs that can be expanded with UNNEST, preserving provider columns while emitting one row per matching indexed interval. Because scalar return types are fixed, labels are returned as text with label_type describing the original cgranges label kind; NULL inputs return NULL.
  • duckhts_cgranges_overlaps(name, chrom, start, end, mode := 'overlap', query_row_id := NULL): Query a finalized session-scoped cgranges registry entry and return one row per overlapping or containing indexed interval, preserving the original label type and interval coordinates.
  • duckhts_cgranges_overlaps_bulk(name, query, chrom_col, start_col, end_col, mode := 'overlap', query_row_id_col := NULL): Run a SQL query that yields overlap probes, stream those rows through a finalized session-scoped cgranges registry entry, and return one row per matching indexed interval. The probe query runs on the extension-owned helper connection, so it must reference regular tables/views rather than connection-local temp tables. When query_row_id_col is omitted, query_row_id defaults to the 1-based probe row ordinal.
  • regionkey(chrom, start, end, strand := 0): Encode a genomic interval as an official RegionKey-compatible 64-bit unsigned integer. Start and end use 0-based half-open interval semantics, matching BED-style coordinates; strand accepts -1, 0, or 1.
  • regionkey_hex(rk): Render a RegionKey as its lowercase 16-character hexadecimal string representation.
  • parse_regionkey_hex(hex): Parse a 16-character hexadecimal RegionKey string back into its UBIGINT code. Invalid or non-hex strings return NULL.
  • encode_regionkey(chrom_code, start, end, strand_code): Encode the raw upstream RegionKey fields directly: chromosome code, 0-based start, 0-based end, and strand code (0 = unknown, 1 = +, 2 = -).
  • extract_regionkey_chrom(rk): Extract the raw upstream RegionKey chromosome code.
  • extract_regionkey_startpos(rk): Extract the raw upstream RegionKey 0-based start position.
  • extract_regionkey_endpos(rk): Extract the raw upstream RegionKey 0-based end position.
  • extract_regionkey_strand(rk): Extract the raw upstream RegionKey strand code (0 = unknown, 1 = +, 2 = -).
  • decode_regionkey(rk): Decode a RegionKey into its raw upstream numeric fields: chrom_code, start, end, and strand_code.
  • reverse_regionkey(rk): Decode a RegionKey into a STRUCT with chrom, chrom_code, start, end, strand, and strand_code.
  • extend_regionkey(rk, size): Extend a RegionKey interval by a fixed number of bases on both sides, clamping to the official 28-bit RegionKey position range.
  • are_overlapping_regions(chrom_a, start_a, end_a, chrom_b, start_b, end_b): Return TRUE when two explicit 0-based half-open intervals overlap on the same canonical chromosome.
  • are_overlapping_region_regionkey(chrom, start, end, rk): Return TRUE when a 0-based half-open interval overlaps the supplied RegionKey interval.
  • are_overlapping_regionkeys(rka, rkb): Return TRUE when two RegionKeys overlap.

Metadata

  • detect_quality_encoding(path, max_records := 10000): Inspect a FASTQ file's observed quality ASCII range and report compatible legacy encodings with a heuristic guessed encoding.
  • duckhts_samtools_idxstats(path, output := NULL, index_path := NULL, threads := 0, overwrite := FALSE): Write samtools idxstats-compatible TAB-delimited output for BAM, CRAM, or SAM input. Indexed BAM uses hts_idx_get_stat(...) for the fast path; CRAM, SAM, and unindexed BAM fall back to a full scan while preserving samtools-style contig rows plus the final * row.
  • read_hts_header(path, format := NULL, mode := NULL): Inspect HTS headers in parsed, raw, or combined form across supported formats.
  • read_hts_index(path, format := NULL, index_path := NULL): Inspect high-level HTS index metadata such as sequence names and mapped counts.
  • read_hts_index_spans(path, format := NULL, index_path := NULL): Expand index metadata into span and chunk rows suitable for low-level index inspection.
  • read_hts_index_raw(path, format := NULL, index_path := NULL): Return the raw on-disk HTS index blob together with basic identifying metadata.

Compression

  • bgzip(path, output_path := NULL, threads := 4, level := -1, keep := TRUE, overwrite := FALSE): Compress a plain file to BGZF and return the created output path and byte counts.
  • bgunzip(path, output_path := NULL, threads := 4, keep := TRUE, overwrite := FALSE): Decompress a BGZF-compressed file and return the created output path and byte counts.

Indexing

  • bam_index(path, index_path := NULL, min_shift := 0, threads := 4): Build a BAM or CRAM index and report the written index path and format.
  • bcf_index(path, index_path := NULL, min_shift := NULL, threads := 4): Build a TBI or CSI index for a VCF or BCF file and report the written index path and format.
  • tabix_index(path, preset := 'vcf', index_path := NULL, min_shift := 0, threads := 4, seq_col := NULL, start_col := NULL, end_col := NULL, comment_char := NULL, skip_lines := NULL): Build a tabix index for a BGZF-compressed text file using a preset or explicit coordinate columns.

Variants

  • variantkey(chrom, pos, ref, alt): Encode a normalized biallelic variant as an official VariantKey-compatible 64-bit unsigned integer. This DuckHTS wrapper accepts 1-based VCF/DuckHTS POS to match bcftools %VKX / +add-variantkey, internally converts to the upstream 0-based field, and preserves the official hashed nonreversible mode for large, ambiguous, and symbolic REF/ALT strings. Only CHROM, POS, REF, and ALT are encoded; END, SVLEN, mate breakend coordinates, and other SV metadata are not.
  • variantkey_hex(vk): Render a VariantKey as its lowercase 16-character hexadecimal string representation.
  • parse_variantkey_hex(hex): Parse a 16-character hexadecimal VariantKey string back into its UBIGINT code. Invalid or non-hex strings return NULL.
  • encode_variantkey(chrom_code, pos0, refalt_code): Encode the raw upstream VariantKey fields directly: chromosome code, 0-based position, and 31-bit REF+ALT code.
  • extract_variantkey_chrom(vk): Extract the raw upstream VariantKey chromosome code.
  • extract_variantkey_pos(vk): Extract the raw upstream VariantKey 0-based position field.
  • extract_variantkey_refalt(vk): Extract the raw upstream 31-bit VariantKey REF+ALT code.
  • decode_variantkey(vk): Decode a VariantKey into its raw upstream numeric fields: chrom_code, pos0, and refalt_code.
  • reverse_variantkey(vk): Decode a VariantKey into a STRUCT with chrom, chrom_code, 1-based pos, upstream 0-based pos0, ref, alt, refalt_code, and reversible. For hashed nonreversible keys, reversible is FALSE and ref/alt are returned as NULL because DuckHTS v1 does not ship the optional NRVK lookup sidecar.
  • variantkey_range(chrom, pos_min, pos_max): Return the inclusive minimum and maximum VariantKey bounds for a chromosome plus 1-based VCF position range, suitable for numeric range filtering on precomputed VariantKeys.
  • bcftools_liftover(chrom, pos, ref, alt, chain_path, dst_fasta_ref, src_fasta_ref, max_snp_gap, max_indel_inc, lift_mt, end_pos, no_left_align): Row-oriented liftover kernel intended to mirror bcftools +liftover semantics as closely as possible while returning one STRUCT per input row with fields: src_chrom, src_pos, src_ref, src_alt, dest_chrom, dest_pos, dest_end, dest_ref, dest_alt, mapped, reverse_complemented, swap, reject_reason, and note. Set no_left_align := true to skip post-liftover left-alignment of lifted indels (mirrors –no-left-align in bcftools +liftover).
  • duckdb_liftover(table_name, chrom_col, pos_col, ref_col := NULL, alt_col := NULL, chain_path := NULL, dst_fasta_ref := NULL, src_fasta_ref := NULL, max_snp_gap := 1, max_indel_inc := 250, lift_mt := false, end_pos_col := NULL, no_left_align := false): DuckDB-specific wrapper over bcftools_liftover that takes either a table name or a derived-table expression plus column-name strings for chrom/pos/ref/alt and returns the lifted table. The no_left_align parameter mirrors –no-left-align in bcftools +liftover.
  • bcftools_norm_row(chrom, pos, ref, alt, fasta_ref, end_pos := NULL, svlen := NULL, fasta_index_path := NULL, gzi_path := NULL): Normalize one variant row with bcftools/vt-style left-alignment semantics against a FASTA reference. The alt argument may be either a comma-delimited VARCHAR or a VARCHAR[] list. The returned STRUCT contains pos_normed, end_pos_normed, ref_normed, alt_normed (always VARCHAR[]), normed (TRUE/FALSE/NULL), and norm_status. Symbolic rows can use end_pos, and symbolic rows can use svlen.
  • duckhts_bcftools_norm(table_name, fasta_ref, chrom_col := 'chrom', pos_col := 'pos', ref_col := 'ref', alt_col := 'alt', split_multiallelic := FALSE, end_pos_col := NULL, svlen_col := NULL, fasta_index_path := NULL, gzi_path := NULL): DuckDB table macro wrapper over bcftools_norm_row that normalizes variants from a table or derived-table expression while preserving the original columns. The input ALT column may be either VARCHAR or VARCHAR[]. The result appends pos_normed, end_pos_normed, ref_normed, alt_normed, normed, and norm_status; with split_multiallelic := TRUE, multiallelic sites are split before normalization and alt_normed becomes VARCHAR plus alt_index.
  • bcftools_score(bcf_path, summary_path_or_list, use := NULL, columns := 'PLINK', columns_file := NULL, q_score_thr := NULL, summaries_list_file := NULL, log_path := NULL, use_variant_id := FALSE, counts := FALSE, samples := NULL, force_samples := FALSE, regions := NULL, regions_file := NULL, regions_overlap := 1, targets := NULL, targets_file := NULL, targets_overlap := 0, apply_filters := NULL, include := NULL, exclude := NULL): Compute polygenic scores from one genotype BCF/VCF and one or more summary-statistics files with bcftools +score-compatible GT/DS/HDS/AP/GP/AS dosage semantics, sample subsetting, and region/target/FILTER-string controls. The second argument accepts a scalar path or a DuckDB LIST/array of paths; TSV/SSF summaries produce one PRS column per file in a single genotype scan, while GWAS-VCF summaries still produce one PRS column per FORMAT sample. Use summaries_list_file with a NULL second argument to read paths from a file or directory; list-file entries are interpreted as written, matching upstream bcftools +score --summaries behavior, while directory inputs scan supported regular summary files in lexicographic order and ignore index sidecars. Use log_path to write per-PRS loaded/matched/allele-mismatch/duplicate-marker audit counts.
  • bcftools_munge_row(chrom, pos, a1, a2, id, p, z, or, beta, n, n_cas, n_con, info, frq, se, lp, ac, neff, neffdiv2, het_i2, het_p, het_lp, dire, fasta_ref, iffy_tag := 'IFFY', mismatch_tag := 'REF_MISMATCH', ns := NULL, nc := NULL, ne := NULL): Normalize one summary-statistics row into GWAS-VCF-style fields (chrom/pos/ref/alt/effect metrics), resolving REF/ALT orientation against a FASTA reference and applying swap-aware sign/frequency/count transforms. The output flag alleles_swapped means REF/ALT orientation was swapped to match the FASTA reference.
  • duckdb_munge(table_name, preset := '', column_map := map([''], ['']), column_map_file := '', fasta_ref := NULL, iffy_tag := 'IFFY', mismatch_tag := 'REF_MISMATCH', ns := NULL, nc := NULL, ne := NULL): DuckDB macro wrapper over bcftools_munge_row that maps source columns (via preset or explicit map) and returns normalized GWAS-VCF-style rows with lean outputs and explicit alleles_swapped semantics. Output columns: chrom, pos, id, ref, alt, alleles_swapped, filter, ns, ez, nc, es, se, lp, af, ac, ne (16 columns). For METAL meta-analysis output with SI/I2/CQ/ED columns, use duckdb_munge_metal.
  • duckdb_munge_metal(table_name, preset := '', column_map := map([''], ['']), column_map_file := '', fasta_ref := NULL, iffy_tag := 'IFFY', mismatch_tag := 'REF_MISMATCH', ns := NULL, nc := NULL, ne := NULL): Extended munge macro with METAL meta-analysis output columns. Same as duckdb_munge but additionally emits: si (imputation info, from INFO input), i2 (Cochran's I² heterogeneity, from HET_I2), cq (Cochran's Q -log10 p, from HET_LP or -log10(HET_P)), and ed (effect direction string, from DIRE; +/- flipped on allele swap). The R wrapper rduckhts_munge() auto-dispatches to this macro when metal keys (INFO, HET_I2, HET_P, HET_LP, DIRE) are present in the resolved column map.

Sequence UDFs

  • seq_revcomp(sequence): Compute the reverse complement of a DNA sequence using A, C, G, T, and N bases.
  • seq_canonical(sequence): Return the lexicographically smaller of a sequence and its reverse complement.
  • seq_hash_2bit(sequence): Encode a short DNA sequence as a 2-bit unsigned integer hash.
  • seq_encode_4bit(sequence): Encode an IUPAC DNA sequence as a list of 4-bit base codes, preserving ambiguity symbols including N.
  • seq_decode_4bit(codes): Decode a list of 4-bit IUPAC DNA base codes back into a sequence string.
  • seq_gc_content(sequence): Compute GC fraction for a DNA sequence as a value between 0 and 1.
  • seq_kmers(sequence, k, canonical := FALSE): Expand a sequence into positional k-mers with optional canonicalization.

SAM Flag UDFs

  • sam_flag_bits(flag): Decode a SAM flag into a struct of boolean bit fields using explicit SAM-oriented names such as is_paired, is_proper_pair, is_next_segment_unmapped, and is_supplementary.
  • sam_flag_has(flag, mask): Test whether any bits from the provided SAM flag mask are set in a flag value.
  • is_forward_aligned(flag): Test whether a mapped segment is aligned to the forward strand. Returns NULL for unmapped segments because SAM flag 0x10 does not define genomic strand when 0x4 is set.
  • is_paired(flag): Test whether the SAM flag indicates that the template has multiple segments in sequencing (0x1).
  • is_proper_pair(flag): Test whether the SAM flag indicates that each segment is properly aligned according to the aligner (0x2).
  • is_unmapped(flag): Test whether the read itself is unmapped according to the SAM flag.
  • is_next_segment_unmapped(flag): Test whether the next segment in the template is flagged as unmapped (0x8).
  • is_reverse_complemented(flag): Test whether SEQ is stored reverse complemented (0x10); for mapped reads this corresponds to reverse-strand alignment.
  • is_next_segment_reverse_complemented(flag): Test whether SEQ of the next segment in the template is stored reverse complemented (0x20).
  • is_first_segment(flag): Test whether the read is marked as the first segment in the template.
  • is_last_segment(flag): Test whether the read is marked as the last segment in the template.
  • is_secondary(flag): Test whether the alignment is marked as secondary.
  • is_qc_fail(flag): Test whether the read failed vendor or pipeline quality checks.
  • is_duplicate(flag): Test whether the alignment is flagged as a duplicate.
  • is_supplementary(flag): Test whether the alignment is marked as supplementary.

CIGAR Utils

  • cigar_has_soft_clip(cigar): Test whether a CIGAR string contains any soft-clipped segment (S).
  • cigar_has_hard_clip(cigar): Test whether a CIGAR string contains any hard-clipped segment (H).
  • cigar_left_soft_clip(cigar): Return the left-end soft-clipped length from a CIGAR string, or zero if the alignment does not start with S.
  • cigar_right_soft_clip(cigar): Return the right-end soft-clipped length from a CIGAR string, or zero if the alignment does not end with S.
  • cigar_query_length(cigar): Return the query-consuming length from a CIGAR string, counting M, I, S, =, and X.
  • cigar_aligned_query_length(cigar): Return the aligned query length from a CIGAR string, counting M, =, and X but excluding clips and insertions.
  • cigar_reference_length(cigar): Return the reference-consuming length from a CIGAR string, counting M, D, N, =, and X.
  • cigar_has_op(cigar, op): Test whether a CIGAR string contains at least one instance of the requested operator.

Operational notes:

  • Paired FASTQ is supported via mate_path or interleaved := true.
  • CRAM reads are supported with an explicit reference file.
  • GTF/GFF attributes can be returned as a parsed MAP with attributes_map := true via read_gff or read_gtf.
  • Optional SAM tag columns and an auxiliary tag map are available through standard_tags and auxiliary_tags.
  • Tabix readers support header, header_names, type inference with auto_detect, and explicit column_types.
  • MSVC builds (windows_amd64/windows_arm64) are not supported; use MinGW/RTools on Windows.

Added Functions

function_name function_type description comment examples
are_overlapping_region_regionkey scalar NULL NULL  
are_overlapping_regionkeys scalar NULL NULL  
are_overlapping_regions scalar NULL NULL  
bam_bin_counts table NULL NULL  
bam_index table NULL NULL  
bcf_index table NULL NULL  
bcftools_liftover scalar NULL NULL  
bcftools_munge_row scalar NULL NULL  
bcftools_norm_row scalar NULL NULL  
bcftools_score table NULL NULL  
bgunzip table NULL NULL  
bgzip table NULL NULL  
cigar_aligned_query_length scalar NULL NULL  
cigar_has_hard_clip scalar NULL NULL  
cigar_has_op scalar NULL NULL  
cigar_has_soft_clip scalar NULL NULL  
cigar_left_soft_clip scalar NULL NULL  
cigar_query_length scalar NULL NULL  
cigar_reference_length scalar NULL NULL  
cigar_right_soft_clip scalar NULL NULL  
decode_regionkey scalar NULL NULL  
decode_variantkey scalar NULL NULL  
detect_quality_encoding table NULL NULL  
duckdb_liftover table_macro NULL NULL  
duckdb_munge table_macro NULL NULL  
duckdb_munge_metal table_macro NULL NULL  
duckdb_munge_preset_map macro NULL NULL  
duckdb_munge_resolved_map macro NULL NULL  
duckhts_alt_to_list scalar NULL NULL  
duckhts_bam_bed_coverage table NULL NULL  
duckhts_bcftools_norm table_macro NULL NULL  
duckhts_cgranges_add scalar NULL NULL  
duckhts_cgranges_count_overlaps scalar NULL NULL  
duckhts_cgranges_create scalar NULL NULL  
duckhts_cgranges_destroy scalar NULL NULL  
duckhts_cgranges_from_query scalar NULL NULL  
duckhts_cgranges_from_table scalar NULL NULL  
duckhts_cgranges_has_overlap scalar NULL NULL  
duckhts_cgranges_index scalar NULL NULL  
duckhts_cgranges_overlaps table NULL NULL  
duckhts_cgranges_overlaps_bulk table NULL NULL  
duckhts_cgranges_overlaps_list scalar NULL NULL  
duckhts_mosdepth table NULL NULL  
duckhts_quote_ident macro NULL NULL  
duckhts_quote_string macro NULL NULL  
duckhts_samtools_idxstats table NULL NULL  
duckhts_simd_backend scalar NULL NULL  
duckhts_simd_backend_available scalar NULL NULL  
duckhts_simd_backend_compiled scalar NULL NULL  
duckhts_simd_backend_cpu_supported scalar NULL NULL  
duckhts_simd_info table NULL NULL  
duckhts_simd_kernel_info table NULL NULL  
duckhts_simd_requested_backend scalar NULL NULL  
duckhts_simd_set_backend table NULL NULL  
encode_regionkey scalar NULL NULL  
encode_variantkey scalar NULL NULL  
extend_regionkey scalar NULL NULL  
extract_regionkey_chrom scalar NULL NULL  
extract_regionkey_endpos scalar NULL NULL  
extract_regionkey_startpos scalar NULL NULL  
extract_regionkey_strand scalar NULL NULL  
extract_variantkey_chrom scalar NULL NULL  
extract_variantkey_pos scalar NULL NULL  
extract_variantkey_refalt scalar NULL NULL  
fasta_index table NULL NULL  
fasta_nuc table NULL NULL  
hts_union_query macro NULL NULL  
is_duplicate scalar NULL NULL  
is_first_segment scalar NULL NULL  
is_forward_aligned scalar NULL NULL  
is_last_segment scalar NULL NULL  
is_next_segment_reverse_complemented scalar NULL NULL  
is_next_segment_unmapped scalar NULL NULL  
is_paired scalar NULL NULL  
is_proper_pair scalar NULL NULL  
is_qc_fail scalar NULL NULL  
is_reverse_complemented scalar NULL NULL  
is_secondary scalar NULL NULL  
is_supplementary scalar NULL NULL  
is_unmapped scalar NULL NULL  
parse_regionkey_hex scalar NULL NULL  
parse_variantkey_hex scalar NULL NULL  
read_bam table NULL NULL  
read_bcf table NULL NULL  
read_bed table NULL NULL  
read_fasta table NULL NULL  
read_fastq table NULL NULL  
read_gff table NULL NULL  
read_gtf table NULL NULL  
read_hts_header table NULL NULL  
read_hts_index table NULL NULL  
read_hts_index_raw table_macro NULL NULL  
read_hts_index_spans table NULL NULL  
read_pileup table NULL NULL  
read_tabix table NULL NULL  
regionkey scalar NULL NULL  
regionkey_hex scalar NULL NULL  
reverse_regionkey scalar NULL NULL  
reverse_variantkey scalar NULL NULL  
sam_flag_bits scalar NULL NULL  
sam_flag_has scalar NULL NULL  
seq_canonical scalar NULL NULL  
seq_decode_4bit scalar NULL NULL  
seq_encode_4bit scalar NULL NULL  
seq_gc_content scalar NULL NULL  
seq_hash_2bit scalar NULL NULL  
seq_kmers table NULL NULL  
seq_revcomp scalar NULL NULL  
tabix_index table NULL NULL  
variantkey scalar NULL NULL  
variantkey_hex scalar NULL NULL  
variantkey_range scalar NULL NULL  

Overloaded Functions

This extension does not add any function overloads.

Added Types

This extension does not add any types.

Added Settings

This extension does not add any settings.