This is an extension for DuckDB that automatically captures and emits OpenLineage events for every query executed. This enables automated data lineage, governance, and observability for DuckDB workloads.
DuckLineage is developed by ILUM, the free data lakehouse platform for a cloud native world.
This extension currently implements the following OpenLineage capabilities:
- Automatic events: Emits START, COMPLETE and FAIL events for each query execution.
- Dataset detection: Extracts input and output datasets from query plans (tables, CREATE/INSERT/UPDATE/DELETE/COPY, basic file table functions).
- Schema capture: Records dataset schema (column names and types) as a dataset facet.
- Column-level lineage: Emits the OpenLineage
columnLineagedataset facet on output datasets, tracking which input columns flow into each output column and whether the transformation is direct or indirect. Supports column refs, expressions, aliases, CAST, star expansion, JOINs, aggregation, UNION/INTERSECT/EXCEPT, window functions, CTAS, INSERT INTO SELECT, file scans (CSV/Parquet), PIVOT, and UNNEST. - Facets: Job
sqlfacet, runparentfacet (via OPENLINEAGE_PARENT* env vars),processing_engine,dataSourceandcatalogfacets, lifecycle change facets (CREATE/DROP/ALTER/OVERWRITE/RENAME/TRUNCATE), and basicoutputStatistics(row count). - DuckLake support: Works with DuckLake catalogs. For DuckLake with external storage (S3, local path), the dataset namespace is automatically resolved from the DuckLake DATA_PATH.
- Asynchronous delivery: Background HTTP client with configurable OpenLineage URL, API key, retries, queueing and debug logging.
Note: This extension works with both direct SQL execution and PreparedStatements (used by embedded drivers like Java JDBC and Python SQLAlchemy). When a query is executed via PreparedStatement, DuckDB does not expose the original SQL string to the optimizer. In this case, the extension still captures full lineage (inputs, outputs, schemas, facets), but view reference detection is unavailable — tables accessed through views will appear as direct inputs rather than being grouped under their view. The SQL job facet is also omitted when the original query string is not available.
The fastest way to see Duck Lineage in action is the quickstart demo. It starts Marquez, runs a sample ETL pipeline, and drops you into an interactive DuckDB session with lineage tracking enabled - all in one command:
./test/demo.shor
make demoThen open http://localhost:3000 to explore the lineage graph. When you're done, stop the infrastructure with make demo-down.
Prerequisites: Docker (with Compose) and curl. The demo will auto-download DuckDB if it's not already on your PATH.
The first step is to load the extension in DuckDB from the community extension repository:
INSTALL duck_lineage FROM community;
LOAD duck_lineage;Next, configure the OpenLineage backend URL (e.g., Marquez):
SET duck_lineage_url='http://localhost:5000/api/v1/lineage';
-- Set API Key (Optional)
-- SET duck_lineage_api_key='your-api-key';
-- Set Namespace (Default: duckdb)
-- SET duck_lineage_namespace='my-data-warehouse';
-- Enable Debug Mode (Logs JSON events to console)
-- SET duck_lineage_debug=true;
-- Advanced Configuration (Optional)
-- SET duck_lineage_max_retries=3; -- HTTP retry attempts (default: 3)
-- SET duck_lineage_max_queue_size=10000; -- Max pending events before dropping (default: 10000)
-- SET duck_lineage_timeout=10; -- HTTP request timeout in seconds (default: 10)
-- SET duck_lineage_exclude_dataset_prefixes='__ducklake_metadata_'; -- Comma-separated prefixes to exclude from lineage eventsExecute your analytical queries. The extension will automatically trace them.
CREATE TABLE users (id INT, name VARCHAR);
INSERT INTO users VALUES (1, 'Alice'), (2, 'Bob');
SELECT count(*) FROM users;Check your OpenLineage backend (e.g., Marquez UI) to see the lineage graph and run details!
The extension automatically detects standard OpenLineage environment variables to link queries to a parent job run:
OPENLINEAGE_PARENT_RUN_IDOPENLINEAGE_PARENT_JOB_NAMESPACEOPENLINEAGE_PARENT_JOB_NAME
If these are set (e.g., by your Airflow operator), the DuckDB query will appear as a child run in the lineage graph.
For building DuckDB extensions, you will need the following tools installed on your system:
- CMake (3.11+)
- A C++11 compliant compiler (GCC/Clang)
Additionally, this extension uses vcpkg to manage dependencies. You can omit it, but it will require you to manually install development libraries for OpenSSL, Curl and JSON (on Ubuntu-based systems - libssl-dev libcurl4-openssl-dev nlohmann-json3-dev)
Now to build the extension, run:
makeNOTE: You may additionally need to provide
VCPKG_TOOLCHAIN_PATHenvironment variable pointing to thevcpkg.cmakefile if usingvcpkg.
The main binaries that will be built are:
./build/release/duckdb
./build/release/extension/duck_lineage/duck_lineage.duckdb_extensionduckdbis the binary for the duckdb shell with the extension code automatically loaded.duck_lineage.duckdb_extensionis the loadable binary as it would be distributed.
This extension includes comprehensive integration tests that verify lineage events are correctly sent to Marquez (OpenLineage server).
Since we need marquez server running to execute the tests, the tests require Docker with Docker Compose and UV to be installed. UV is used to manage python dependencies, as the tests are written in Python.
Quick start:
make test-allSee test/README.md for detailed testing documentation.
This repository is based on https://github.com/duckdb/extension-template, check it out if you want to build and ship your own DuckDB extension.
When cloning the template, the target version of DuckDB should be the latest stable release of DuckDB. However, there will inevitably come a time when a new DuckDB is released and the extension repository needs updating. This process goes as follows:
- Bump submodules
./duckdbshould be set to latest tagged release./extension-ci-toolsshould be set to updated branch corresponding to latest DuckDB release. So if you're building for DuckDBv1.1.0there will be a branch inextension-ci-toolsnamedv1.1.0to which you should check out.
- Bump versions in
./github/workflowsduckdb_versioninput induckdb-stable-buildjob inMainDistributionPipeline.ymlshould be set to latest tagged release- the reusable workflow
duckdb/extension-ci-tools/.github/workflows/_extension_distribution.ymlfor theduckdb-stable-buildjob should be set to latest tagged release
DuckDB extensions built with this extension template are built against the internal C++ API of DuckDB. This API is not guaranteed to be stable. What this means for extension development is that when updating your extensions DuckDB target version using the above steps, you may run into the fact that your extension no longer builds properly.
Currently, DuckDB does not (yet) provide a specific change log for these API changes, but it is generally not too hard to figure out what has changed.
For figuring out how and why the C++ API changed, we recommend using the following resources:
- DuckDB's Release Notes
- DuckDB's history of Core extension patches
- The git history of the relevant C++ Header file of the API that has changed
Configuring CLion with this extension requires a little work. Firstly, make sure that the DuckDB submodule is available.
Then make sure to open ./duckdb/CMakeLists.txt (so not the top level CMakeLists.txt file from this repo) as a project in CLion.
Now to fix your project path go to tools->CMake->Change Project Root(docs) to set the project root to the root dir of this repo.
To set up debugging in CLion, there are two simple steps required. Firstly, in CLion -> Settings / Preferences -> Build, Execution, Deploy -> CMake you will need to add the desired builds (e.g. Debug, Release, RelDebug, etc). There's different ways to configure this, but the easiest is to leave all empty, except the build path, which needs to be set to ../build/{build type}, and CMake Options to which the following flag should be added, with the path to the extension CMakeList:
-DDUCKDB_EXTENSION_CONFIGS=<path_to_the_exentension_CMakeLists.txt>
The second step is to configure the unittest runner as a run/debug configuration. To do this, go to Run -> Edit Configurations and click + -> Cmake Application. The target and executable should be unittest. This will run all the DuckDB tests. To specify only running the extension specific tests, add --test-dir ../../.. [sql] to the Program Arguments. Note that it is recommended to use the unittest executable for testing/development within CLion. The actual DuckDB CLI currently does not reliably work as a run target in CLion.

