Skip to main content

Native Python binding for Apache Hudi, based on hudi-rs.

Project description

Hudi logo

The native Rust implementation for Apache Hudi, with C++ & Python API bindings.

hudi-rs ci hudi-rs codecov join hudi slack follow hudi x/twitter follow hudi linkedin

The Hudi-rs project aims to standardize the core Apache Hudi APIs, and broaden the Hudi integration in the data ecosystems for a diverse range of users and projects.

Source Downloads Installation Command
PyPi.org pip install hudi
Crates.io cargo add hudi

Usage Examples

[!NOTE] These examples expect a Hudi table exists at /tmp/trips_table, created using the quick start guide.

Snapshot Query

Snapshot query reads the latest version of the data from the table. The table API also accepts partition filters.

Python

from hudi import HudiTableBuilder
import pyarrow as pa

hudi_table = HudiTableBuilder.from_base_uri("/tmp/trips_table").build()
batches = hudi_table.read_snapshot(filters=[("city", "=", "san_francisco")])

# convert to PyArrow table
arrow_table = pa.Table.from_batches(batches)
result = arrow_table.select(["rider", "city", "ts", "fare"])
print(result)

Rust

use hudi::error::Result;
use hudi::table::builder::TableBuilder as HudiTableBuilder;
use arrow::compute::concat_batches;

#[tokio::main]
async fn main() -> Result<()> {
    let hudi_table = HudiTableBuilder::from_base_uri("/tmp/trips_table").build().await?;
    let batches = hudi_table.read_snapshot(&[("city", "=", "san_francisco")]).await?;
    let batch = concat_batches(&batches[0].schema(), &batches)?;
    let columns = vec!["rider", "city", "ts", "fare"];
    for col_name in columns {
        let idx = batch.schema().index_of(col_name).unwrap();
        println!("{}: {}", col_name, batch.column(idx));
    }
    Ok(())
}

To run read-optimized (RO) query on Merge-on-Read (MOR) tables, set hoodie.read.use.read_optimized.mode when creating the table.

Python

hudi_table = (
    HudiTableBuilder
    .from_base_uri("/tmp/trips_table")
    .with_option("hoodie.read.use.read_optimized.mode", "true")
    .build()
)

Rust

let hudi_table = 
    HudiTableBuilder::from_base_uri("/tmp/trips_table")
    .with_option("hoodie.read.use.read_optimized.mode", "true")
    .build().await?;

Time-Travel Query

Time-travel query reads the data at a specific timestamp from the table. The table API also accepts partition filters.

Python

batches = (
    hudi_table
    .read_snapshot_as_of("20241231123456789", filters=[("city", "=", "san_francisco")])
)

Rust

let batches = 
    hudi_table
    .read_snapshot_as_of("20241231123456789", &[("city", "=", "san_francisco")]).await?;
Supported timestamp formats

The supported formats for the timestamp argument are:

  • Hudi Timeline format (highest matching precedence): yyyyMMddHHmmssSSS or yyyyMMddHHmmss.
  • Unix epoch time in seconds, milliseconds, microseconds, or nanoseconds.
  • ISO 8601 format including but not limited to:
    • yyyy-MM-dd'T'HH:mm:ss.SSS+00:00
    • yyyy-MM-dd'T'HH:mm:ss.SSSZ
    • yyyy-MM-dd'T'HH:mm:ss.SSS
    • yyyy-MM-dd'T'HH:mm:ss+00:00
    • yyyy-MM-dd'T'HH:mm:ssZ
    • yyyy-MM-dd'T'HH:mm:ss
    • yyyy-MM-dd

Incremental Query

Incremental query reads the changed data from the table for a given time range.

Python

# read the records between t1 (exclusive) and t2 (inclusive)
batches = hudi_table.read_incremental_records(t1, t2)

# read the records after t1
batches = hudi_table.read_incremental_records(t1)

Rust

// read the records between t1 (exclusive) and t2 (inclusive)
let batches = hudi_table.read_incremental_records(t1, Some(t2)).await?;

// read the records after t1
let batches = hudi_table.read_incremental_records(t1, None).await?;

Incremental queries support the same timestamp formats as time-travel queries.

File Group Reading (Experimental)

File group reading allows you to read data from a specific file slice. This is useful when integrating with query engines, where the plan provides file paths.

Python

from hudi import HudiFileGroupReader

reader = HudiFileGroupReader(
    "/table/base/path", {"hoodie.read.file_group.start_timestamp": "0"})

# Returns a PyArrow RecordBatch
record_batch = reader.read_file_slice_by_base_file_path("relative/path.parquet")

Rust

use hudi::file_group::reader::FileGroupReader;

let reader = FileGroupReader::new_with_options(
    "/table/base/path", [("hoodie.read.file_group.start_timestamp", "0")])?;

// Returns an Arrow RecordBatch
let record_batch = reader.read_file_slice_by_base_file_path("relative/path.parquet").await?;

C++

#include "cxx.h"
#include "src/lib.rs.h"
#include "arrow/c/abi.h"

auto reader = new_file_group_reader_with_options(
    "/table/base/path", {"hoodie.read.file_group.start_timestamp=0"});

// Returns an ArrowArrayStream pointer
ArrowArrayStream* stream_ptr = reader->read_file_slice_by_base_file_path("relative/path.parquet");

Query Engine Integration

Hudi-rs provides APIs to support integration with query engines. The sections below highlight some commonly used APIs.

Table API

Create a Hudi table instance using its constructor or the TableBuilder API.

Stage API Description
Query planning get_file_slices() For snapshot query, get a list of file slices.
get_file_slices_splits() For snapshot query, get a list of file slices in splits.
get_file_slices_as_of() For time-travel query, get a list of file slices at a given time.
get_file_slices_splits_as_of() For time-travel query, get a list of file slices in splits at a given time.
get_file_slices_between() For incremental query, get a list of changed file slices between a time range.
Query execution create_file_group_reader_with_options() Create a file group reader instance with the table instance's configs.

File Group API

Create a Hudi file group reader instance using its constructor or the Hudi table API create_file_group_reader_with_options().

Stage API Description
Query execution read_file_slice() Read records from a given file slice; based on the configs, read records from only base file, or from base file and log files, and merge records based on the configured strategy.
read_file_slice_by_base_file_path() Read records from a given base file path; log files will be ignored

Apache DataFusion

Enabling the hudi crate with datafusion feature will provide a DataFusion extension to query Hudi tables.

Add crate hudi with datafusion feature to your application to query a Hudi table.
cargo new my_project --bin && cd my_project
cargo add tokio@1 datafusion@45
cargo add hudi --features datafusion

Update src/main.rs with the code snippet below then cargo run.

use std::sync::Arc;

use datafusion::error::Result;
use datafusion::prelude::{DataFrame, SessionContext};
use hudi::HudiDataSource;

#[tokio::main]
async fn main() -> Result<()> {
    let ctx = SessionContext::new();
    let hudi = HudiDataSource::new_with_options(
        "/tmp/trips_table",
        [("hoodie.read.input.partitions", "5")]).await?;
    ctx.register_table("trips_table", Arc::new(hudi))?;
    let df: DataFrame = ctx.sql("SELECT * from trips_table where city = 'san_francisco'").await?;
    df.show().await?;
    Ok(())
}

Other Integrations

Hudi is also integrated with

Work with cloud storage

Ensure cloud storage credentials are set properly as environment variables, e.g., AWS_*, AZURE_*, or GOOGLE_*. Relevant storage environment variables will then be picked up. The target table's base uri with schemes such as s3://, az://, or gs:// will be processed accordingly.

Alternatively, you can pass the storage configuration as options via Table APIs.

Python

from hudi import HudiTableBuilder

hudi_table = (
    HudiTableBuilder
    .from_base_uri("s3://bucket/trips_table")
    .with_option("aws_region", "us-west-2")
    .build()
)

Rust

use hudi::table::builder::TableBuilder as HudiTableBuilder;

async fn main() -> Result<()> {
    let hudi_table = 
        HudiTableBuilder::from_base_uri("s3://bucket/trips_table")
        .with_option("aws_region", "us-west-2")
        .build().await?;
}

Contributing

Check out the contributing guide for all the details about making contributions to the project.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

hudi-0.4.0.tar.gz (504.3 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

hudi-0.4.0-cp39-abi3-win_amd64.whl (7.3 MB view details)

Uploaded CPython 3.9+Windows x86-64

hudi-0.4.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (8.7 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ x86-64

hudi-0.4.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (8.4 MB view details)

Uploaded CPython 3.9+manylinux: glibc 2.17+ ARM64

hudi-0.4.0-cp39-abi3-macosx_11_0_arm64.whl (7.6 MB view details)

Uploaded CPython 3.9+macOS 11.0+ ARM64

hudi-0.4.0-cp39-abi3-macosx_10_12_x86_64.whl (8.0 MB view details)

Uploaded CPython 3.9+macOS 10.12+ x86-64

File details

Details for the file hudi-0.4.0.tar.gz.

File metadata

  • Download URL: hudi-0.4.0.tar.gz
  • Upload date:
  • Size: 504.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.0

File hashes

Hashes for hudi-0.4.0.tar.gz
Algorithm Hash digest
SHA256 fdcb208b25d7b0c2a31b9c20b59f171401701a6cce04f18e642ec941598bf53c
MD5 52a49b1112159898d52f9e428a85fdd0
BLAKE2b-256 b67025d61174865f47d66b082988f608ecd57d6890646ec07f986eba0a7023f5

See more details on using hashes here.

File details

Details for the file hudi-0.4.0-cp39-abi3-win_amd64.whl.

File metadata

  • Download URL: hudi-0.4.0-cp39-abi3-win_amd64.whl
  • Upload date:
  • Size: 7.3 MB
  • Tags: CPython 3.9+, Windows x86-64
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.0

File hashes

Hashes for hudi-0.4.0-cp39-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 fbbe0f91cdb54e389c812330aa0dae1796f83cf8e4e75c28d73e620a289802dc
MD5 0ce4e4fd163d3e7871a532c83fdaca29
BLAKE2b-256 be2d5e915f32783b193608f184ad2ac6335fd53c0a93d73ccb9841abce356ff3

See more details on using hashes here.

File details

Details for the file hudi-0.4.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for hudi-0.4.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 fcc98a740fd54a111415c9acad309a6573598d4c7fb7d5ab5b6e44cb50ada192
MD5 c7d8a1fcfac1b0c32cc9d0df08377bd7
BLAKE2b-256 b1fd44ca1da4849348a7915f379247d86e399dc4f28973c49da66f96538b05ea

See more details on using hashes here.

File details

Details for the file hudi-0.4.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for hudi-0.4.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 075d9f6550f7b48555b54270bafc09b82d52148831858ffbd760ae635a1c4418
MD5 3dc4d0979953e44cf0c76e278942ac32
BLAKE2b-256 bdc0bd2f8bebb46bbda9aaf2c6a727f48f4f1764a60f504ca0302939390077a0

See more details on using hashes here.

File details

Details for the file hudi-0.4.0-cp39-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for hudi-0.4.0-cp39-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 345a5bb5867daba721641bd985cfafefebd0eae3d538c724f69f55f5ba25ef9b
MD5 8290cf721903a280ede2cf00e17769f7
BLAKE2b-256 0d36b480adc6f9524270ffcb9f78299e3257cbe9803b12b76fc5b318011ff13a

See more details on using hashes here.

File details

Details for the file hudi-0.4.0-cp39-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for hudi-0.4.0-cp39-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 6beae8f0ce73a6114d61124dccebbe5e6097cecb554019b4ece186f74e3cd3e1
MD5 2f9ab7bcecd137a6fef2ae990f7746bf
BLAKE2b-256 ab8c91a1a21261f64587198e6e57731d33c4a376459842678e79ca7af2bef6d8

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page