Search Shortcut cmd + k | ctrl + k
finetype

Semantic type classification — detects 240 data types (emails, URLs, dates, UUIDs, currencies, etc.) from raw strings

Maintainer(s): hughcameron

Installing and Loading

INSTALL finetype FROM community;
LOAD finetype;

Example

-- Profile every column of a table: one row per column, with the detected
-- semantic type, model confidence, and recommended DuckDB storage type.
D CREATE TABLE people AS SELECT * FROM (VALUES
    ('[email protected]', '+44 20 7946 0958', {'city': 'London'}),
    ('[email protected]', '0117 496 0123',    {'city': 'Bristol'}),
    ('not-an-email',           'not-a-phone',       {'city': 'Leeds'})
  ) AS t(email, phone, addr);

D SELECT * FROM ft_profile('people');
┌─────────────┬────────────────────────────┬────────────────────┬──────────────┐
 column_name        detected_type             confidence      duckdb_type  
   varchar             varchar                  double          varchar    
├─────────────┼────────────────────────────┼────────────────────┼──────────────┤
 email        identity.person.email       0.5812865495681763  VARCHAR      
 phone        identity.person.phone_num                 0.5  VARCHAR      
 addr                                                                      
└─────────────┴────────────────────────────┴────────────────────┴──────────────┘
-- (addr is a nested STRUCT column — profiled as NULL rather than
--  silently rejected; see the nested-column note below.)

-- Validate a table against a JSON Schema: one row per column, counting
-- how many values the schema rejects.
D SELECT * FROM ft_validate('people',
    '{"email": {"type":"string","format":"email"}}');

-- Classify a single value (per-value, no column context)
D SELECT ft_infer('https://example.com') AS detected_type;
┌──────────────────────────┐
      detected_type       
         varchar          
├──────────────────────────┤
 technology.internet.url  
└──────────────────────────┘

About finetype

FineType is a semantic type classifier that detects 240 data types from raw string values. A multi-branch neural network labels each value, mapping it into a three-level taxonomy: domain.category.type.

FineType is column-oriented: its accuracy comes from seeing a whole column at once, so the primary surface is the ft_profile / ft_validate table verbs below. Per-value scalar functions are also provided for ad-hoc classification, but a lone value that normally leans on its column neighbours to disambiguate (a bare UUID, a phone number) is classified less confidently than the same value seen in column context.

Table verbs

These take a table (or a LIST of column values) and return one row per column — the natural grain for profiling and validation.

ft_profile(table_name VARCHAR)

Profile every column of a table. Returns (column_name, detected_type, confidence, duckdb_type), one row per column. The table is named as a string literal; FineType reaches into the catalog to read its columns.

SELECT * FROM ft_profile('my_table');

Nested STRUCT / LIST columns are guarded, not silently rejected: they profile as NULL rather than being forced through per-value classification that would misreport them. Flatten the column first if you need it typed.

ft_validate(table_name VARCHAR, schema_json VARCHAR)

Validate a table's columns against a JSON Schema. Returns one row per validated column with the total values checked, the reject count, and a sample rejection message. The schema can be supplied inline, via getvariable, or read from a file with DuckDB's read_text:

-- inline
SELECT * FROM ft_validate('my_table', '{"email": {"type":"string","format":"email"}}');

-- from a file
SELECT * FROM ft_validate('my_table', (SELECT content FROM read_text('schema.json')));

ft_profile(values LIST)

Profile a single column passed as a list — useful when the values are already in hand rather than in a named table.

SELECT * FROM ft_profile((SELECT list(email) FROM people));

Scalar functions

Per-value classification, one value at a time. Strongest on values unambiguous in isolation (URLs, ISO dates, well-formed emails, IP addresses).

ft_infer(value VARCHAR) → VARCHAR

Classify a single value. Returns the full semantic type label.

SELECT ft_infer('https://example.com');  -- technology.internet.url
SELECT ft_infer('2024-01-15');           -- datetime.date.iso
SELECT ft_infer('true');                 -- representation.boolean.terms

ft_detail(value VARCHAR) → VARCHAR

Classify with full detail. Returns JSON with the type, confidence (0.0–1.0), the recommended DuckDB type, the number of samples seen, the disambiguation path, and the per-label vote map.

SELECT ft_detail('192.168.1.1');
-- {"type": "technology.internet.ip_v4", "confidence": 0.901, "duckdb_type": "INET", ...}

ft_cast(value VARCHAR) → VARCHAR

Normalize a value for safe TRY_CAST() to its detected DuckDB type. Handles date format conversion (US/EU → ISO), boolean normalization, UUID lowercasing, numeric cleanup.

SELECT ft_cast('01/15/2024');  -- 2024-01-15 (US date → ISO)

ft_validate_text(value VARCHAR, schema_json VARCHAR) → VARCHAR

Validate a single value against a JSON Schema fragment. Returns valid when the value conforms, or the first validation error otherwise.

SELECT ft_validate_text('[email protected]', '{"type":"string","format":"email"}');  -- valid

ft_unpack(json VARCHAR) → VARCHAR

Recursively classify every scalar value in a JSON document. Returns annotated JSON carrying the type, confidence, recommended DuckDB type, and original value for each field.

ft_version() → VARCHAR

Returns the extension version string.

Aliases. The earlier un-prefixed scalar names (finetype, finetype_detail, finetype_cast, finetype_unpack, finetype_validate, finetype_version) remain registered as aliases of the ft_ scalars for one release, so a v0.6.22 install keeps working. New code should use the ft_ names.

Type Taxonomy

240 types organized into 7 domains:

  • container: JSON, XML, CSV, arrays, key-value (11 types)
  • datetime: dates, times, timestamps, epochs, durations, offsets (84 types)
  • finance: currencies, accounting, market identifiers, transactions (28 types)
  • geography: coordinates, locations, addresses, transportation (25 types)
  • identity: names, emails, phones, payments, medical (33 types)
  • representation: booleans, numbers, text, files, scientific (33 types)
  • technology: URLs, IPs, UUIDs, versions, codes (26 types)

For more information, see the FineType documentation.

Added Functions

function_name function_type description comment examples
finetype scalar NULL NULL  
finetype_cast scalar NULL NULL  
finetype_detail scalar NULL NULL  
finetype_spike table NULL NULL  
finetype_unpack scalar NULL NULL  
finetype_validate scalar NULL NULL  
finetype_version scalar NULL NULL  
ft_cast scalar NULL NULL  
ft_detail scalar NULL NULL  
ft_infer scalar NULL NULL  
ft_profile scalar NULL NULL  
ft_profile table_macro NULL NULL  
ft_unpack scalar NULL NULL  
ft_validate table_macro NULL NULL  
ft_validate_text scalar NULL NULL  
ft_version scalar NULL NULL  

Overloaded Functions

This extension does not add any function overloads.

Added Types

This extension does not add any types.

Added Settings

This extension does not add any settings.