Native Python binding for Apache Hudi, based on hudi-rs.
Project description
The native Rust implementation for Apache Hudi, with C++ & Python API bindings.
The Hudi-rs project aims to standardize the core Apache Hudi APIs, and broaden the Hudi integration in the data ecosystems for a diverse range of users and projects.
| Source | Downloads | Installation Command |
|---|---|---|
| PyPi.org | pip install hudi |
|
| Crates.io | cargo add hudi |
Usage Examples
[!NOTE] These examples expect a Hudi table exists at
/tmp/trips_table, created using the quick start guide.
Snapshot Query
Snapshot query reads the latest version of the data from the table. The table API also accepts partition filters.
Python
from hudi import HudiTableBuilder
import pyarrow as pa
hudi_table = HudiTableBuilder.from_base_uri("/tmp/trips_table").build()
batches = hudi_table.read_snapshot(filters=[("city", "=", "san_francisco")])
# convert to PyArrow table
arrow_table = pa.Table.from_batches(batches)
result = arrow_table.select(["rider", "city", "ts", "fare"])
print(result)
Rust
use hudi::error::Result;
use hudi::table::builder::TableBuilder as HudiTableBuilder;
use arrow::compute::concat_batches;
#[tokio::main]
async fn main() -> Result<()> {
let hudi_table = HudiTableBuilder::from_base_uri("/tmp/trips_table").build().await?;
let batches = hudi_table.read_snapshot(&[("city", "=", "san_francisco")]).await?;
let batch = concat_batches(&batches[0].schema(), &batches)?;
let columns = vec!["rider", "city", "ts", "fare"];
for col_name in columns {
let idx = batch.schema().index_of(col_name).unwrap();
println!("{}: {}", col_name, batch.column(idx));
}
Ok(())
}
To run read-optimized (RO) query on Merge-on-Read (MOR) tables, set hoodie.read.use.read_optimized.mode when creating the table.
Python
hudi_table = (
HudiTableBuilder
.from_base_uri("/tmp/trips_table")
.with_option("hoodie.read.use.read_optimized.mode", "true")
.build()
)
Rust
let hudi_table =
HudiTableBuilder::from_base_uri("/tmp/trips_table")
.with_option("hoodie.read.use.read_optimized.mode", "true")
.build().await?;
Time-Travel Query
Time-travel query reads the data at a specific timestamp from the table. The table API also accepts partition filters.
Python
batches = (
hudi_table
.read_snapshot_as_of("20241231123456789", filters=[("city", "=", "san_francisco")])
)
Rust
let batches =
hudi_table
.read_snapshot_as_of("20241231123456789", &[("city", "=", "san_francisco")]).await?;
Supported timestamp formats
The supported formats for the timestamp argument are:
- Hudi Timeline format (highest matching precedence):
yyyyMMddHHmmssSSSoryyyyMMddHHmmss. - Unix epoch time in seconds, milliseconds, microseconds, or nanoseconds.
- ISO 8601 format including but not limited to:
yyyy-MM-dd'T'HH:mm:ss.SSS+00:00yyyy-MM-dd'T'HH:mm:ss.SSSZyyyy-MM-dd'T'HH:mm:ss.SSSyyyy-MM-dd'T'HH:mm:ss+00:00yyyy-MM-dd'T'HH:mm:ssZyyyy-MM-dd'T'HH:mm:ssyyyy-MM-dd
Incremental Query
Incremental query reads the changed data from the table for a given time range.
Python
# read the records between t1 (exclusive) and t2 (inclusive)
batches = hudi_table.read_incremental_records(t1, t2)
# read the records after t1
batches = hudi_table.read_incremental_records(t1)
Rust
// read the records between t1 (exclusive) and t2 (inclusive)
let batches = hudi_table.read_incremental_records(t1, Some(t2)).await?;
// read the records after t1
let batches = hudi_table.read_incremental_records(t1, None).await?;
Incremental queries support the same timestamp formats as time-travel queries.
File Group Reading (Experimental)
File group reading allows you to read data from a specific file slice. This is useful when integrating with query engines, where the plan provides file paths.
Python
from hudi import HudiFileGroupReader
reader = HudiFileGroupReader(
"/table/base/path", {"hoodie.read.file_group.start_timestamp": "0"})
# Returns a PyArrow RecordBatch
record_batch = reader.read_file_slice_by_base_file_path("relative/path.parquet")
Rust
use hudi::file_group::reader::FileGroupReader;
let reader = FileGroupReader::new_with_options(
"/table/base/path", [("hoodie.read.file_group.start_timestamp", "0")])?;
// Returns an Arrow RecordBatch
let record_batch = reader.read_file_slice_by_base_file_path("relative/path.parquet").await?;
C++
#include "cxx.h"
#include "src/lib.rs.h"
#include "arrow/c/abi.h"
auto reader = new_file_group_reader_with_options(
"/table/base/path", {"hoodie.read.file_group.start_timestamp=0"});
// Returns an ArrowArrayStream pointer
ArrowArrayStream* stream_ptr = reader->read_file_slice_by_base_file_path("relative/path.parquet");
Query Engine Integration
Hudi-rs provides APIs to support integration with query engines. The sections below highlight some commonly used APIs.
Table API
Create a Hudi table instance using its constructor or the TableBuilder API.
| Stage | API | Description |
|---|---|---|
| Query planning | get_file_slices() |
For snapshot query, get a list of file slices. |
get_file_slices_splits() |
For snapshot query, get a list of file slices in splits. | |
get_file_slices_as_of() |
For time-travel query, get a list of file slices at a given time. | |
get_file_slices_splits_as_of() |
For time-travel query, get a list of file slices in splits at a given time. | |
get_file_slices_between() |
For incremental query, get a list of changed file slices between a time range. | |
| Query execution | create_file_group_reader_with_options() |
Create a file group reader instance with the table instance's configs. |
File Group API
Create a Hudi file group reader instance using its constructor or the Hudi table API create_file_group_reader_with_options().
| Stage | API | Description |
|---|---|---|
| Query execution | read_file_slice() |
Read records from a given file slice; based on the configs, read records from only base file, or from base file and log files, and merge records based on the configured strategy. |
read_file_slice_by_base_file_path() |
Read records from a given base file path; log files will be ignored |
Apache DataFusion
Enabling the hudi crate with datafusion feature will provide a DataFusion
extension to query Hudi tables.
Add crate hudi with datafusion feature to your application to query a Hudi table.
cargo new my_project --bin && cd my_project
cargo add tokio@1 datafusion@45
cargo add hudi --features datafusion
Update src/main.rs with the code snippet below then cargo run.
use std::sync::Arc;
use datafusion::error::Result;
use datafusion::prelude::{DataFrame, SessionContext};
use hudi::HudiDataSource;
#[tokio::main]
async fn main() -> Result<()> {
let ctx = SessionContext::new();
let hudi = HudiDataSource::new_with_options(
"/tmp/trips_table",
[("hoodie.read.input.partitions", "5")]).await?;
ctx.register_table("trips_table", Arc::new(hudi))?;
let df: DataFrame = ctx.sql("SELECT * from trips_table where city = 'san_francisco'").await?;
df.show().await?;
Ok(())
}
Other Integrations
Hudi is also integrated with
Work with cloud storage
Ensure cloud storage credentials are set properly as environment variables, e.g., AWS_*, AZURE_*, or GOOGLE_*.
Relevant storage environment variables will then be picked up. The target table's base uri with schemes such
as s3://, az://, or gs:// will be processed accordingly.
Alternatively, you can pass the storage configuration as options via Table APIs.
Python
from hudi import HudiTableBuilder
hudi_table = (
HudiTableBuilder
.from_base_uri("s3://bucket/trips_table")
.with_option("aws_region", "us-west-2")
.build()
)
Rust
use hudi::table::builder::TableBuilder as HudiTableBuilder;
async fn main() -> Result<()> {
let hudi_table =
HudiTableBuilder::from_base_uri("s3://bucket/trips_table")
.with_option("aws_region", "us-west-2")
.build().await?;
}
Contributing
Check out the contributing guide for all the details about making contributions to the project.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file hudi-0.4.0.tar.gz.
File metadata
- Download URL: hudi-0.4.0.tar.gz
- Upload date:
- Size: 504.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.9.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fdcb208b25d7b0c2a31b9c20b59f171401701a6cce04f18e642ec941598bf53c
|
|
| MD5 |
52a49b1112159898d52f9e428a85fdd0
|
|
| BLAKE2b-256 |
b67025d61174865f47d66b082988f608ecd57d6890646ec07f986eba0a7023f5
|
File details
Details for the file hudi-0.4.0-cp39-abi3-win_amd64.whl.
File metadata
- Download URL: hudi-0.4.0-cp39-abi3-win_amd64.whl
- Upload date:
- Size: 7.3 MB
- Tags: CPython 3.9+, Windows x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.9.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fbbe0f91cdb54e389c812330aa0dae1796f83cf8e4e75c28d73e620a289802dc
|
|
| MD5 |
0ce4e4fd163d3e7871a532c83fdaca29
|
|
| BLAKE2b-256 |
be2d5e915f32783b193608f184ad2ac6335fd53c0a93d73ccb9841abce356ff3
|
File details
Details for the file hudi-0.4.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: hudi-0.4.0-cp39-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 8.7 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.9.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fcc98a740fd54a111415c9acad309a6573598d4c7fb7d5ab5b6e44cb50ada192
|
|
| MD5 |
c7d8a1fcfac1b0c32cc9d0df08377bd7
|
|
| BLAKE2b-256 |
b1fd44ca1da4849348a7915f379247d86e399dc4f28973c49da66f96538b05ea
|
File details
Details for the file hudi-0.4.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: hudi-0.4.0-cp39-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 8.4 MB
- Tags: CPython 3.9+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.9.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
075d9f6550f7b48555b54270bafc09b82d52148831858ffbd760ae635a1c4418
|
|
| MD5 |
3dc4d0979953e44cf0c76e278942ac32
|
|
| BLAKE2b-256 |
bdc0bd2f8bebb46bbda9aaf2c6a727f48f4f1764a60f504ca0302939390077a0
|
File details
Details for the file hudi-0.4.0-cp39-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: hudi-0.4.0-cp39-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 7.6 MB
- Tags: CPython 3.9+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.9.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
345a5bb5867daba721641bd985cfafefebd0eae3d538c724f69f55f5ba25ef9b
|
|
| MD5 |
8290cf721903a280ede2cf00e17769f7
|
|
| BLAKE2b-256 |
0d36b480adc6f9524270ffcb9f78299e3257cbe9803b12b76fc5b318011ff13a
|
File details
Details for the file hudi-0.4.0-cp39-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: hudi-0.4.0-cp39-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 8.0 MB
- Tags: CPython 3.9+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? No
- Uploaded via: maturin/1.9.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6beae8f0ce73a6114d61124dccebbe5e6097cecb554019b4ece186f74e3cd3e1
|
|
| MD5 |
2f9ab7bcecd137a6fef2ae990f7746bf
|
|
| BLAKE2b-256 |
ab8c91a1a21261f64587198e6e57731d33c4a376459842678e79ca7af2bef6d8
|