Cheminformatics toolkit for DuckDB - SMILES, InChI, MOL/SDF, PDB, SELFIES, Wildman-Crippen LogP, Morgan/ECFP fingerprints, and Tanimoto similarity from SQL
Installing and Loading
INSTALL ducksmiles FROM community;
LOAD ducksmiles;
Example
-- Parse SMILES and get molecular formula
SELECT mol_formula('CCO');
-- C2H6O
-- Calculate molecular weight
SELECT round(mol_weight('c1ccccc1'), 2);
-- 78.11
-- Wildman-Crippen LogP (matches RDKit)
SELECT round(logp_crippen('CCO'), 4);
-- -0.0014
-- Morgan/ECFP fingerprint as 2048-bit BLOB (ECFP4 default)
SELECT bit_count(CAST(morgan_fp_bits('CC(=O)Oc1ccccc1C(=O)O') AS BIT));
-- 26 (aspirin popcount)
-- Tanimoto similarity between two fingerprint BLOBs (no CAST AS BIT needed)
SELECT round(tanimoto_bit(morgan_fp_bits('CCO'), morgan_fp_bits('CCN')), 4);
-- 0.3333
-- Validate and extract InChI layers
SELECT inchi_formula('InChI=1S/C2H4O2/c1-2(3)4/h1H3,(H,3,4)');
-- C2H4O2
-- Convert SMILES to SELFIES (ML-friendly notation)
SELECT smiles_to_selfies('CCO');
About ducksmiles
Cheminformatics toolkit for DuckDB - analyze molecular structures directly from SQL without leaving your database. Pure Rust implementation with no external chemistry library dependencies (no RDKit required).
Supported Formats:
- SMILES: Molecular validation, formula, weight, atom/bond counts, LogP
- InChI/InChIKey: Layer extraction, stereochemistry detection, skeleton matching
- MOL/SDF: V2000/V3000 block parsing, molecule counting
- PDB/CIF/XYZ: Protein structure analysis (atom, chain, residue, model counts)
- SELFIES: Bidirectional SMILES-SELFIES conversion for ML pipelines
39 scalar SQL functions for molecular property extraction, format conversion, structural comparison, and physicochemical property prediction. Ideal for cheminformatics datasets, drug discovery pipelines, and molecular ML feature engineering.
LogP: logp_crippen() implements the Wildman-Crippen atom-contribution
method (110 SMARTS patterns, 68 atom types) and matches RDKit's
Crippen.MolLogP exactly for small molecules.
Morgan / ECFP fingerprint: morgan_fp_bits() ports RDKit's MorganGenerator
(layered BFS + hash_combine + dead-atom dedup) to Rust and returns a fixed-width
bit vector as BLOB. Defaults to ECFP4 (radius=2, 2048 bit); 3-arg overload
morgan_fp_bits(smi, radius, n_bits) exposes full control.
Tanimoto similarity: tanimoto_bit(BLOB, BLOB) -> DOUBLE computes
popcount(a & b) / popcount(a | b) directly on raw BLOB bytes (no
CAST AS BIT round-trip), processing 8 bytes at a time via count_ones()
so it lowers to POPCNT on x86_64 / CNT on aarch64. Mismatched lengths raise
a clear InvalidInputException; empty-vs-empty returns 0.0 (RDKit
convention). The SQL-level bit_count(a & b)::DOUBLE / bit_count(a | b)
is still available and produces bit-exact identical results.
Architecture: Rust (core logic, 5 crates) + C++ (DuckDB integration via FFI)
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| add_hydrogens | scalar | NULL | NULL | |
| canonical_smiles | scalar | NULL | NULL | |
| fraction_csp3 | scalar | NULL | NULL | |
| fragment_parent | scalar | NULL | NULL | |
| generic_scaffold | scalar | NULL | NULL | |
| inchi_charge | scalar | NULL | NULL | |
| inchi_connections | scalar | NULL | NULL | |
| inchi_formula | scalar | NULL | NULL | |
| inchi_has_stereo | scalar | NULL | NULL | |
| inchi_hydrogens | scalar | NULL | NULL | |
| inchi_is_standard | scalar | NULL | NULL | |
| inchi_is_valid | scalar | NULL | NULL | |
| inchi_num_stereo_centers | scalar | NULL | NULL | |
| inchi_skeleton_match | scalar | NULL | NULL | |
| inchi_stereo_bond | scalar | NULL | NULL | |
| inchi_stereo_tetrahedral | scalar | NULL | NULL | |
| inchi_version | scalar | NULL | NULL | |
| inchikey_connectivity | scalar | NULL | NULL | |
| inchikey_is_valid | scalar | NULL | NULL | |
| inchikey_protonation | scalar | NULL | NULL | |
| inchikey_stereo | scalar | NULL | NULL | |
| largest_fragment | scalar | NULL | NULL | |
| logp_crippen | scalar | NULL | NULL | |
| maccs_keys | scalar | NULL | NULL | |
| mcs_json | scalar | NULL | NULL | |
| mcs_smarts | scalar | NULL | NULL | |
| mol_block_atoms_json | scalar | NULL | NULL | |
| mol_block_bonds_json | scalar | NULL | NULL | |
| mol_block_centroid_x | scalar | NULL | NULL | |
| mol_block_centroid_y | scalar | NULL | NULL | |
| mol_block_centroid_z | scalar | NULL | NULL | |
| mol_block_formula | scalar | NULL | NULL | |
| mol_block_has_3d | scalar | NULL | NULL | |
| mol_block_json | scalar | NULL | NULL | |
| mol_block_max_x | scalar | NULL | NULL | |
| mol_block_max_y | scalar | NULL | NULL | |
| mol_block_max_z | scalar | NULL | NULL | |
| mol_block_min_x | scalar | NULL | NULL | |
| mol_block_min_y | scalar | NULL | NULL | |
| mol_block_min_z | scalar | NULL | NULL | |
| mol_block_name | scalar | NULL | NULL | |
| mol_block_num_atoms | scalar | NULL | NULL | |
| mol_block_num_bonds | scalar | NULL | NULL | |
| mol_block_properties_json | scalar | NULL | NULL | |
| mol_block_property | scalar | NULL | NULL | |
| mol_block_radius_of_gyration | scalar | NULL | NULL | |
| mol_block_weight | scalar | NULL | NULL | |
| mol_exact_mass | scalar | NULL | NULL | |
| mol_formula | scalar | NULL | NULL | |
| mol_has_substructure | scalar | NULL | NULL | |
| mol_hash | scalar | NULL | NULL | |
| mol_hash_methods | scalar | NULL | NULL | |
| mol_is_valid | scalar | NULL | NULL | |
| mol_num_atoms | scalar | NULL | NULL | |
| mol_num_bonds | scalar | NULL | NULL | |
| mol_substructure_count | scalar | NULL | NULL | |
| mol_substructure_matches_json | scalar | NULL | NULL | |
| mol_weight | scalar | NULL | NULL | |
| morgan_fp_bits | scalar | NULL | NULL | |
| murcko_scaffold | scalar | NULL | NULL | |
| neutralize_charges | scalar | NULL | NULL | |
| normalize_smiles | scalar | NULL | NULL | |
| num_aromatic_rings | scalar | NULL | NULL | |
| num_h_acceptors | scalar | NULL | NULL | |
| num_h_donors | scalar | NULL | NULL | |
| num_heteroatoms | scalar | NULL | NULL | |
| num_rotatable_bonds | scalar | NULL | NULL | |
| ring_count | scalar | NULL | NULL | |
| ring_systems_json | scalar | NULL | NULL | |
| scaffold_network_json | scalar | NULL | NULL | |
| sdf_count | scalar | NULL | NULL | |
| sdf_properties_json | scalar | NULL | NULL | |
| sdf_property | scalar | NULL | NULL | |
| selfies_is_valid | scalar | NULL | NULL | |
| selfies_to_smiles | scalar | NULL | NULL | |
| smiles_to_selfies | scalar | NULL | NULL | |
| strip_salts | scalar | NULL | NULL | |
| structure_atom_count | scalar | NULL | NULL | |
| structure_centroid_x | scalar | NULL | NULL | |
| structure_centroid_y | scalar | NULL | NULL | |
| structure_centroid_z | scalar | NULL | NULL | |
| structure_chain_count | scalar | NULL | NULL | |
| structure_max_x | scalar | NULL | NULL | |
| structure_max_y | scalar | NULL | NULL | |
| structure_max_z | scalar | NULL | NULL | |
| structure_min_x | scalar | NULL | NULL | |
| structure_min_y | scalar | NULL | NULL | |
| structure_min_z | scalar | NULL | NULL | |
| structure_model_count | scalar | NULL | NULL | |
| structure_radius_of_gyration | scalar | NULL | NULL | |
| structure_residue_count | scalar | NULL | NULL | |
| tanimoto_bit | scalar | NULL | NULL | |
| tpsa | scalar | NULL | NULL |
Overloaded Functions
This extension does not add any function overloads.
Added Types
This extension does not add any types.
Added Settings
This extension does not add any settings.