Skip to content

russellromney/haduck

Repository files navigation

haduck

DuckDB extension for block-level replication to S3-compatible storage. Intercepts DuckDB's file I/O, tracks dirty 256KB blocks, and ships them to S3 on checkpoint.

Part of the hadb ecosystem for making embedded databases highly available.

How it works

haduck registers a custom FileSystem with DuckDB that wraps the local filesystem:

  1. Write interception: every Write() call marks the affected 256KB blocks as dirty in a bitmap
  2. Sync/checkpoint: on FileSync() (triggered by DuckDB's CHECKPOINT), dirty blocks are read from disk and uploaded to S3
  3. S3 key layout: {prefix}/{db_file_key}/block_{block_id} per block

All reads and writes hit local disk at full speed. S3 uploads happen asynchronously during checkpoint, not on the write path.

Usage

LOAD haduck;
ATTACH 'haduck:///path/to/my.duckdb' AS mydb;

-- Use normally. Writes go to local disk.
CREATE TABLE mydb.t AS SELECT * FROM range(1000000);

-- CHECKPOINT ships dirty blocks to S3
CHECKPOINT mydb;

Configuration

Set environment variables before loading the extension:

Variable Required Description
HADUCK_S3_BUCKET Yes S3 bucket name
HADUCK_S3_ENDPOINT Yes S3 endpoint URL (e.g., https://fly.storage.tigris.dev)
HADUCK_S3_ACCESS_KEY Yes AWS access key ID
HADUCK_S3_SECRET_KEY Yes AWS secret access key
HADUCK_S3_PREFIX No Key prefix (default: haduck)

Without S3 credentials, haduck runs in local-only mode (dirty block tracking with no uploads).

Architecture

DuckDB
  |
  v
HaduckFileSystem (C++ extension)
  |-- Write() -> local disk + DirtyTracker bitmap
  |-- Read()  -> local disk (unchanged)
  |-- FileSync() -> drain bitmap, upload dirty blocks via FFI
  |
  v
haduck-s3 (Rust staticlib, linked via CMake)
  |-- haduck_s3_put_block() -> S3 PutObject
  |-- haduck_s3_init/shutdown() -> lifecycle

Components

  • DirtyTracker: thread-safe bitmap of modified 256KB blocks per file handle
  • HaduckFileSystem: DuckDB FileSystem subclass, delegates to LocalFileSystem for I/O
  • HaduckFileHandle: wraps inner file handle, owns a DirtyTracker
  • haduck-s3 (rust/): Rust FFI crate providing S3 upload via aws-sdk-s3

Related crates

  • duckblock: segment format, checksum chaining, and replication primitives for DuckDB blocks (uses hadb-changeset)
  • hadb: HA building blocks for embedded databases (leases, coordination, replication traits)

Building

Requires DuckDB source (submodule), CMake, and Rust toolchain.

# Clone with submodules
git clone --recurse-submodules https://github.com/russellromney/haduck.git
cd haduck

# Build
make release

# Test
make test

Roadmap

See ROADMAP.md for the full plan. Current status:

  • Phase Anvil (dirty block tracking): done
  • Phase Anvil-b (S3 block shipping): done
  • Phase Kestrel-Osprey (segment format, storage, apply): done (in duckblock)
  • Phase Harrier (replace raw S3 with duckblock segments): next
  • Phase Peregrine (follower mode with hadb coordination): planned

License

Apache-2.0

About

DuckDB extension for block-level replication to S3-compatible storage

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors