DuckDB Internals

On this page is a brief description of the internals of the DuckDB engine.

Parser

The parser converts a query string into the following tokens:

The parser is not aware of the catalog or any other aspect of the database. It will not throw errors if tables do not exist, and will not resolve any types of columns yet. It only transforms a query string into a set of tokens as specified.

ParsedExpression

The ParsedExpression represents an expression within a SQL statement. This can be e.g. a reference to a column, an addition operator or a constant value. The type of the ParsedExpression indicates what it represents, e.g. a comparison is represented as a ComparisonExpression.

ParsedExpressions do not have types, except for nodes with explicit types such as CAST statements. The types for expressions are resolved in the Binder, not in the Parser.

TableRef

The TableRef represents any table source. This can be a reference to a base table, but it can also be a join, a table-producing function or a subquery.

QueryNode

The QueryNode represents either (1) a SELECT statement, or (2) a set operation (i.e. UNION, INTERSECT or DIFFERENCE).

SQL Statement

The SQLStatement represents a complete SQL statement. The type of the SQL Statement represents what kind of statement it is (e.g. StatementType::SELECT represents a SELECT statement). A single SQL string can be transformed into multiple SQL statements in case the original query string contains multiple queries.

Binder

The binder converts all nodes into their bound equivalents. In the binder phase:

  • The tables and columns are resolved using the catalog
  • Types are resolved
  • Aggregate/window functions are extracted

The following conversions happen:

Logical Planner

The logical planner creates LogicalOperator nodes from the bound statements. In this phase, the actual logical query tree is created.

Optimizer

After the logical planner has created the logical query tree, the optimizers are run over that query tree to create an optimized query plan. The following query optimizers are run:

  • Expression Rewriter: Simplifies expressions, performs constant folding
  • Filter Pushdown: Pushes filters down into the query plan and duplicates filters over equivalency sets. Also prunes subtrees that are guaranteed to be empty (because of filters that statically evaluate to false).
  • Join Order Optimizer: Reorders joins using dynamic programming. Specifically, the DPcpp algorithm from the paper Dynamic Programming Strikes Back is used.
  • Common Sub Expressions: Extracts common subexpressions from projection and filter nodes to prevent unnecessary duplicate execution.
  • In Clause Rewriter: Rewrites large static IN clauses to a MARK join or INNER join.

Column Binding Resolver

The column binding resolver converts logical BoundColumnRefExpresion nodes that refer to a column of a specific table into BoundReferenceExpression nodes that refer to a specific index into the DataChunks that are passed around in the execution engine.

Physical Plan Generator

The physical plan generator converts the resulting logical operator tree into a PhysicalOperator tree.

Execution

In the execution phase, the physical operators are executed to produce the query result. The execution model is a vectorized volcano model, where DataChunks are pulled from the root node of the physical operator tree. Each PhysicalOperator itself defines how it grants its result. A PhysicalTableScan node will pull the chunk from the base tables on disk, whereas a PhysicalHashJoin will perform a hash join between the output obtained from its child nodes.