Semantic type classification — detects 240 data types (emails, URLs, dates, UUIDs, currencies, etc.) from raw strings
Installing and Loading
INSTALL finetype FROM community;
LOAD finetype;
Example
-- Profile every column of a table: one row per column, with the detected
-- semantic type, model confidence, and recommended DuckDB storage type.
D CREATE TABLE people AS SELECT * FROM (VALUES
('[email protected]', '+44 20 7946 0958', {'city': 'London'}),
('[email protected]', '0117 496 0123', {'city': 'Bristol'}),
('not-an-email', 'not-a-phone', {'city': 'Leeds'})
) AS t(email, phone, addr);
D SELECT * FROM ft_profile('people');
┌─────────────┬────────────────────────────┬────────────────────┬──────────────┐
│ column_name │ detected_type │ confidence │ duckdb_type │
│ varchar │ varchar │ double │ varchar │
├─────────────┼────────────────────────────┼────────────────────┼──────────────┤
│ email │ identity.person.email │ 0.5812865495681763 │ VARCHAR │
│ phone │ identity.person.phone_num… │ 0.5 │ VARCHAR │
│ addr │ │ │ │
└─────────────┴────────────────────────────┴────────────────────┴──────────────┘
-- (addr is a nested STRUCT column — profiled as NULL rather than
-- silently rejected; see the nested-column note below.)
-- Validate a table against a JSON Schema: one row per column, counting
-- how many values the schema rejects.
D SELECT * FROM ft_validate('people',
'{"email": {"type":"string","format":"email"}}');
-- Classify a single value (per-value, no column context)
D SELECT ft_infer('https://example.com') AS detected_type;
┌──────────────────────────┐
│ detected_type │
│ varchar │
├──────────────────────────┤
│ technology.internet.url │
└──────────────────────────┘
About finetype
FineType is a semantic type classifier that detects 240 data types from raw string values. A
multi-branch neural network labels each value, mapping it into a three-level taxonomy:
domain.category.type.
FineType is column-oriented: its accuracy comes from seeing a whole column at once, so the
primary surface is the ft_profile / ft_validate table verbs below. Per-value scalar
functions are also provided for ad-hoc classification, but a lone value that normally leans on its
column neighbours to disambiguate (a bare UUID, a phone number) is classified less confidently
than the same value seen in column context.
Table verbs
These take a table (or a LIST of column values) and return one row per column — the natural
grain for profiling and validation.
ft_profile(table_name VARCHAR)
Profile every column of a table. Returns (column_name, detected_type, confidence, duckdb_type),
one row per column. The table is named as a string literal; FineType reaches into the catalog to
read its columns.
SELECT * FROM ft_profile('my_table');
Nested STRUCT / LIST columns are guarded, not silently rejected: they profile as NULL
rather than being forced through per-value classification that would misreport them. Flatten the
column first if you need it typed.
ft_validate(table_name VARCHAR, schema_json VARCHAR)
Validate a table's columns against a JSON Schema. Returns one row per validated column with the
total values checked, the reject count, and a sample rejection message. The schema can be supplied
inline, via getvariable, or read from a file with DuckDB's read_text:
-- inline
SELECT * FROM ft_validate('my_table', '{"email": {"type":"string","format":"email"}}');
-- from a file
SELECT * FROM ft_validate('my_table', (SELECT content FROM read_text('schema.json')));
ft_profile(values LIST)
Profile a single column passed as a list — useful when the values are already in hand rather than in a named table.
SELECT * FROM ft_profile((SELECT list(email) FROM people));
Scalar functions
Per-value classification, one value at a time. Strongest on values unambiguous in isolation (URLs, ISO dates, well-formed emails, IP addresses).
ft_infer(value VARCHAR) → VARCHAR
Classify a single value. Returns the full semantic type label.
SELECT ft_infer('https://example.com'); -- technology.internet.url
SELECT ft_infer('2024-01-15'); -- datetime.date.iso
SELECT ft_infer('true'); -- representation.boolean.terms
ft_detail(value VARCHAR) → VARCHAR
Classify with full detail. Returns JSON with the type, confidence (0.0–1.0), the recommended DuckDB type, the number of samples seen, the disambiguation path, and the per-label vote map.
SELECT ft_detail('192.168.1.1');
-- {"type": "technology.internet.ip_v4", "confidence": 0.901, "duckdb_type": "INET", ...}
ft_cast(value VARCHAR) → VARCHAR
Normalize a value for safe TRY_CAST() to its detected DuckDB type. Handles date format
conversion (US/EU → ISO), boolean normalization, UUID lowercasing, numeric cleanup.
SELECT ft_cast('01/15/2024'); -- 2024-01-15 (US date → ISO)
ft_validate_text(value VARCHAR, schema_json VARCHAR) → VARCHAR
Validate a single value against a JSON Schema fragment. Returns valid when the value conforms,
or the first validation error otherwise.
SELECT ft_validate_text('[email protected]', '{"type":"string","format":"email"}'); -- valid
ft_unpack(json VARCHAR) → VARCHAR
Recursively classify every scalar value in a JSON document. Returns annotated JSON carrying the type, confidence, recommended DuckDB type, and original value for each field.
ft_version() → VARCHAR
Returns the extension version string.
Aliases. The earlier un-prefixed scalar names (
finetype,finetype_detail,finetype_cast,finetype_unpack,finetype_validate,finetype_version) remain registered as aliases of theft_scalars for one release, so a v0.6.22 install keeps working. New code should use theft_names.
Type Taxonomy
240 types organized into 7 domains:
- container: JSON, XML, CSV, arrays, key-value (11 types)
- datetime: dates, times, timestamps, epochs, durations, offsets (84 types)
- finance: currencies, accounting, market identifiers, transactions (28 types)
- geography: coordinates, locations, addresses, transportation (25 types)
- identity: names, emails, phones, payments, medical (33 types)
- representation: booleans, numbers, text, files, scientific (33 types)
- technology: URLs, IPs, UUIDs, versions, codes (26 types)
For more information, see the FineType documentation.
Added Functions
| function_name | function_type | description | comment | examples |
|---|---|---|---|---|
| finetype | scalar | NULL | NULL | |
| finetype_cast | scalar | NULL | NULL | |
| finetype_detail | scalar | NULL | NULL | |
| finetype_spike | table | NULL | NULL | |
| finetype_unpack | scalar | NULL | NULL | |
| finetype_validate | scalar | NULL | NULL | |
| finetype_version | scalar | NULL | NULL | |
| ft_cast | scalar | NULL | NULL | |
| ft_detail | scalar | NULL | NULL | |
| ft_infer | scalar | NULL | NULL | |
| ft_profile | scalar | NULL | NULL | |
| ft_profile | table_macro | NULL | NULL | |
| ft_unpack | scalar | NULL | NULL | |
| ft_validate | table_macro | NULL | NULL | |
| ft_validate_text | scalar | NULL | NULL | |
| ft_version | scalar | NULL | NULL |
Overloaded Functions
This extension does not add any function overloads.
Added Types
This extension does not add any types.
Added Settings
This extension does not add any settings.