Skip to content

ilum-cloud/duck_lineage

Repository files navigation

DuckLineage Image

DuckLineage Banner

Build DuckDB Community Extension OpenLineage License: MIT Built by ILUM

DuckLineage Extension for DuckDB

This is an extension for DuckDB that automatically captures and emits OpenLineage events for every query executed. This enables automated data lineage, governance, and observability for DuckDB workloads.

DuckLineage is developed by ILUM, the free data lakehouse platform for a cloud native world.

Supported Features

This extension currently implements the following OpenLineage capabilities:

  • Automatic events: Emits START, COMPLETE and FAIL events for each query execution.
  • Dataset detection: Extracts input and output datasets from query plans (tables, CREATE/INSERT/UPDATE/DELETE/COPY, basic file table functions).
  • Schema capture: Records dataset schema (column names and types) as a dataset facet.
  • Column-level lineage: Emits the OpenLineage columnLineage dataset facet on output datasets, tracking which input columns flow into each output column and whether the transformation is direct or indirect. Supports column refs, expressions, aliases, CAST, star expansion, JOINs, aggregation, UNION/INTERSECT/EXCEPT, window functions, CTAS, INSERT INTO SELECT, file scans (CSV/Parquet), PIVOT, and UNNEST.
  • Facets: Job sql facet, run parent facet (via OPENLINEAGE_PARENT* env vars), processing_engine, dataSource and catalog facets, lifecycle change facets (CREATE/DROP/ALTER/OVERWRITE/RENAME/TRUNCATE), and basic outputStatistics (row count).
  • DuckLake support: Works with DuckLake catalogs. For DuckLake with external storage (S3, local path), the dataset namespace is automatically resolved from the DuckLake DATA_PATH.
  • Asynchronous delivery: Background HTTP client with configurable OpenLineage URL, API key, retries, queueing and debug logging.

Note: This extension works with both direct SQL execution and PreparedStatements (used by embedded drivers like Java JDBC and Python SQLAlchemy). When a query is executed via PreparedStatement, DuckDB does not expose the original SQL string to the optimizer. In this case, the extension still captures full lineage (inputs, outputs, schemas, facets), but view reference detection is unavailable — tables accessed through views will appear as direct inputs rather than being grouped under their view. The SQL job facet is also omitted when the original query string is not available.

Try It Out

The fastest way to see Duck Lineage in action is the quickstart demo. It starts Marquez, runs a sample ETL pipeline, and drops you into an interactive DuckDB session with lineage tracking enabled - all in one command:

./test/demo.sh

or

make demo

Then open http://localhost:3000 to explore the lineage graph. When you're done, stop the infrastructure with make demo-down.

Prerequisites: Docker (with Compose) and curl. The demo will auto-download DuckDB if it's not already on your PATH.

Quick Start

The first step is to load the extension in DuckDB from the community extension repository:

INSTALL duck_lineage FROM community;
LOAD duck_lineage;

Next, configure the OpenLineage backend URL (e.g., Marquez):

SET duck_lineage_url='http://localhost:5000/api/v1/lineage';

-- Set API Key (Optional)
-- SET duck_lineage_api_key='your-api-key';

-- Set Namespace (Default: duckdb)
-- SET duck_lineage_namespace='my-data-warehouse';

-- Enable Debug Mode (Logs JSON events to console)
-- SET duck_lineage_debug=true;

-- Advanced Configuration (Optional)
-- SET duck_lineage_max_retries=3;           -- HTTP retry attempts (default: 3)
-- SET duck_lineage_max_queue_size=10000;    -- Max pending events before dropping (default: 10000)
-- SET duck_lineage_timeout=10;              -- HTTP request timeout in seconds (default: 10)
-- SET duck_lineage_exclude_dataset_prefixes='__ducklake_metadata_'; -- Comma-separated prefixes to exclude from lineage events

Execute your analytical queries. The extension will automatically trace them.

CREATE TABLE users (id INT, name VARCHAR);
INSERT INTO users VALUES (1, 'Alice'), (2, 'Bob');
SELECT count(*) FROM users;

Check your OpenLineage backend (e.g., Marquez UI) to see the lineage graph and run details!

Parent Run Integration (Airflow/Dagster)

The extension automatically detects standard OpenLineage environment variables to link queries to a parent job run:

  • OPENLINEAGE_PARENT_RUN_ID
  • OPENLINEAGE_PARENT_JOB_NAMESPACE
  • OPENLINEAGE_PARENT_JOB_NAME

If these are set (e.g., by your Airflow operator), the DuckDB query will appear as a child run in the lineage graph.

Building

Prerequisites

For building DuckDB extensions, you will need the following tools installed on your system:

  • CMake (3.11+)
  • A C++11 compliant compiler (GCC/Clang)

Additionally, this extension uses vcpkg to manage dependencies. You can omit it, but it will require you to manually install development libraries for OpenSSL, Curl and JSON (on Ubuntu-based systems - libssl-dev libcurl4-openssl-dev nlohmann-json3-dev)

Build steps

Now to build the extension, run:

make

NOTE: You may additionally need to provide VCPKG_TOOLCHAIN_PATH environment variable pointing to the vcpkg.cmake file if using vcpkg.

The main binaries that will be built are:

./build/release/duckdb
./build/release/extension/duck_lineage/duck_lineage.duckdb_extension
  • duckdb is the binary for the duckdb shell with the extension code automatically loaded.
  • duck_lineage.duckdb_extension is the loadable binary as it would be distributed.

Running the tests

Integration Tests with Marquez

This extension includes comprehensive integration tests that verify lineage events are correctly sent to Marquez (OpenLineage server).

Since we need marquez server running to execute the tests, the tests require Docker with Docker Compose and UV to be installed. UV is used to manage python dependencies, as the tests are written in Python.

Quick start:

make test-all

See test/README.md for detailed testing documentation.

Extension development

This repository is based on https://github.com/duckdb/extension-template, check it out if you want to build and ship your own DuckDB extension.

Extension updating

When cloning the template, the target version of DuckDB should be the latest stable release of DuckDB. However, there will inevitably come a time when a new DuckDB is released and the extension repository needs updating. This process goes as follows:

  • Bump submodules
    • ./duckdb should be set to latest tagged release
    • ./extension-ci-tools should be set to updated branch corresponding to latest DuckDB release. So if you're building for DuckDB v1.1.0 there will be a branch in extension-ci-tools named v1.1.0 to which you should check out.
  • Bump versions in ./github/workflows
    • duckdb_version input in duckdb-stable-build job in MainDistributionPipeline.yml should be set to latest tagged release
    • the reusable workflow duckdb/extension-ci-tools/.github/workflows/_extension_distribution.yml for the duckdb-stable-build job should be set to latest tagged release

API changes

DuckDB extensions built with this extension template are built against the internal C++ API of DuckDB. This API is not guaranteed to be stable. What this means for extension development is that when updating your extensions DuckDB target version using the above steps, you may run into the fact that your extension no longer builds properly.

Currently, DuckDB does not (yet) provide a specific change log for these API changes, but it is generally not too hard to figure out what has changed.

For figuring out how and why the C++ API changed, we recommend using the following resources:

Setting up CLion

Opening project

Configuring CLion with this extension requires a little work. Firstly, make sure that the DuckDB submodule is available. Then make sure to open ./duckdb/CMakeLists.txt (so not the top level CMakeLists.txt file from this repo) as a project in CLion. Now to fix your project path go to tools->CMake->Change Project Root(docs) to set the project root to the root dir of this repo.

Debugging

To set up debugging in CLion, there are two simple steps required. Firstly, in CLion -> Settings / Preferences -> Build, Execution, Deploy -> CMake you will need to add the desired builds (e.g. Debug, Release, RelDebug, etc). There's different ways to configure this, but the easiest is to leave all empty, except the build path, which needs to be set to ../build/{build type}, and CMake Options to which the following flag should be added, with the path to the extension CMakeList:

-DDUCKDB_EXTENSION_CONFIGS=<path_to_the_exentension_CMakeLists.txt>

The second step is to configure the unittest runner as a run/debug configuration. To do this, go to Run -> Edit Configurations and click + -> Cmake Application. The target and executable should be unittest. This will run all the DuckDB tests. To specify only running the extension specific tests, add --test-dir ../../.. [sql] to the Program Arguments. Note that it is recommended to use the unittest executable for testing/development within CLion. The actual DuckDB CLI currently does not reliably work as a run target in CLion.

About

A extension for DuckDB, which captures lineage events for executed queries

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Generated from duckdb/extension-template