Search Shortcut cmd + k | ctrl + k
ducksmiles

Cheminformatics toolkit for DuckDB - SMILES, InChI, MOL/SDF, PDB, SELFIES, Wildman-Crippen LogP, Morgan/ECFP fingerprints, and Tanimoto similarity from SQL

Maintainer(s): nkwork9999

Installing and Loading

INSTALL ducksmiles FROM community;
LOAD ducksmiles;

Example

-- Parse SMILES and get molecular formula
SELECT mol_formula('CCO');
-- C2H6O

-- Calculate molecular weight
SELECT round(mol_weight('c1ccccc1'), 2);
-- 78.11

-- Wildman-Crippen LogP (matches RDKit)
SELECT round(logp_crippen('CCO'), 4);
-- -0.0014

-- Morgan/ECFP fingerprint as 2048-bit BLOB (ECFP4 default)
SELECT bit_count(CAST(morgan_fp_bits('CC(=O)Oc1ccccc1C(=O)O') AS BIT));
-- 26   (aspirin popcount)

-- Tanimoto similarity between two fingerprint BLOBs (no CAST AS BIT needed)
SELECT round(tanimoto_bit(morgan_fp_bits('CCO'), morgan_fp_bits('CCN')), 4);
-- 0.3333

-- Validate and extract InChI layers
SELECT inchi_formula('InChI=1S/C2H4O2/c1-2(3)4/h1H3,(H,3,4)');
-- C2H4O2

-- Convert SMILES to SELFIES (ML-friendly notation)
SELECT smiles_to_selfies('CCO');

About ducksmiles

Cheminformatics toolkit for DuckDB - analyze molecular structures directly from SQL without leaving your database. Pure Rust implementation with no external chemistry library dependencies (no RDKit required).

Supported Formats:

  • SMILES: Molecular validation, formula, weight, atom/bond counts, LogP
  • InChI/InChIKey: Layer extraction, stereochemistry detection, skeleton matching
  • MOL/SDF: V2000/V3000 block parsing, molecule counting
  • PDB/CIF/XYZ: Protein structure analysis (atom, chain, residue, model counts)
  • SELFIES: Bidirectional SMILES-SELFIES conversion for ML pipelines

39 scalar SQL functions for molecular property extraction, format conversion, structural comparison, and physicochemical property prediction. Ideal for cheminformatics datasets, drug discovery pipelines, and molecular ML feature engineering.

LogP: logp_crippen() implements the Wildman-Crippen atom-contribution method (110 SMARTS patterns, 68 atom types) and matches RDKit's Crippen.MolLogP exactly for small molecules.

Morgan / ECFP fingerprint: morgan_fp_bits() ports RDKit's MorganGenerator (layered BFS + hash_combine + dead-atom dedup) to Rust and returns a fixed-width bit vector as BLOB. Defaults to ECFP4 (radius=2, 2048 bit); 3-arg overload morgan_fp_bits(smi, radius, n_bits) exposes full control.

Tanimoto similarity: tanimoto_bit(BLOB, BLOB) -> DOUBLE computes popcount(a & b) / popcount(a | b) directly on raw BLOB bytes (no CAST AS BIT round-trip), processing 8 bytes at a time via count_ones() so it lowers to POPCNT on x86_64 / CNT on aarch64. Mismatched lengths raise a clear InvalidInputException; empty-vs-empty returns 0.0 (RDKit convention). The SQL-level bit_count(a & b)::DOUBLE / bit_count(a | b) is still available and produces bit-exact identical results.

Architecture: Rust (core logic, 5 crates) + C++ (DuckDB integration via FFI)

Added Functions

function_name function_type description comment examples
add_hydrogens scalar NULL NULL  
canonical_smiles scalar NULL NULL  
fraction_csp3 scalar NULL NULL  
fragment_parent scalar NULL NULL  
generic_scaffold scalar NULL NULL  
inchi_charge scalar NULL NULL  
inchi_connections scalar NULL NULL  
inchi_formula scalar NULL NULL  
inchi_has_stereo scalar NULL NULL  
inchi_hydrogens scalar NULL NULL  
inchi_is_standard scalar NULL NULL  
inchi_is_valid scalar NULL NULL  
inchi_num_stereo_centers scalar NULL NULL  
inchi_skeleton_match scalar NULL NULL  
inchi_stereo_bond scalar NULL NULL  
inchi_stereo_tetrahedral scalar NULL NULL  
inchi_version scalar NULL NULL  
inchikey_connectivity scalar NULL NULL  
inchikey_is_valid scalar NULL NULL  
inchikey_protonation scalar NULL NULL  
inchikey_stereo scalar NULL NULL  
largest_fragment scalar NULL NULL  
logp_crippen scalar NULL NULL  
maccs_keys scalar NULL NULL  
mcs_json scalar NULL NULL  
mcs_smarts scalar NULL NULL  
mol_block_atoms_json scalar NULL NULL  
mol_block_bonds_json scalar NULL NULL  
mol_block_centroid_x scalar NULL NULL  
mol_block_centroid_y scalar NULL NULL  
mol_block_centroid_z scalar NULL NULL  
mol_block_formula scalar NULL NULL  
mol_block_has_3d scalar NULL NULL  
mol_block_json scalar NULL NULL  
mol_block_max_x scalar NULL NULL  
mol_block_max_y scalar NULL NULL  
mol_block_max_z scalar NULL NULL  
mol_block_min_x scalar NULL NULL  
mol_block_min_y scalar NULL NULL  
mol_block_min_z scalar NULL NULL  
mol_block_name scalar NULL NULL  
mol_block_num_atoms scalar NULL NULL  
mol_block_num_bonds scalar NULL NULL  
mol_block_properties_json scalar NULL NULL  
mol_block_property scalar NULL NULL  
mol_block_radius_of_gyration scalar NULL NULL  
mol_block_weight scalar NULL NULL  
mol_exact_mass scalar NULL NULL  
mol_formula scalar NULL NULL  
mol_has_substructure scalar NULL NULL  
mol_hash scalar NULL NULL  
mol_hash_methods scalar NULL NULL  
mol_is_valid scalar NULL NULL  
mol_num_atoms scalar NULL NULL  
mol_num_bonds scalar NULL NULL  
mol_substructure_count scalar NULL NULL  
mol_substructure_matches_json scalar NULL NULL  
mol_weight scalar NULL NULL  
morgan_fp_bits scalar NULL NULL  
murcko_scaffold scalar NULL NULL  
neutralize_charges scalar NULL NULL  
normalize_smiles scalar NULL NULL  
num_aromatic_rings scalar NULL NULL  
num_h_acceptors scalar NULL NULL  
num_h_donors scalar NULL NULL  
num_heteroatoms scalar NULL NULL  
num_rotatable_bonds scalar NULL NULL  
ring_count scalar NULL NULL  
ring_systems_json scalar NULL NULL  
scaffold_network_json scalar NULL NULL  
sdf_count scalar NULL NULL  
sdf_properties_json scalar NULL NULL  
sdf_property scalar NULL NULL  
selfies_is_valid scalar NULL NULL  
selfies_to_smiles scalar NULL NULL  
smiles_to_selfies scalar NULL NULL  
strip_salts scalar NULL NULL  
structure_atom_count scalar NULL NULL  
structure_centroid_x scalar NULL NULL  
structure_centroid_y scalar NULL NULL  
structure_centroid_z scalar NULL NULL  
structure_chain_count scalar NULL NULL  
structure_max_x scalar NULL NULL  
structure_max_y scalar NULL NULL  
structure_max_z scalar NULL NULL  
structure_min_x scalar NULL NULL  
structure_min_y scalar NULL NULL  
structure_min_z scalar NULL NULL  
structure_model_count scalar NULL NULL  
structure_radius_of_gyration scalar NULL NULL  
structure_residue_count scalar NULL NULL  
tanimoto_bit scalar NULL NULL  
tpsa scalar NULL NULL  

Overloaded Functions

This extension does not add any function overloads.

Added Types

This extension does not add any types.

Added Settings

This extension does not add any settings.