Skip to content

Comments

feat: Implement an AsyncReader for avro using ObjectStore#8930

Merged
alamb merged 40 commits intoapache:mainfrom
EmilyMatt:avro-async-reader
Feb 2, 2026
Merged

feat: Implement an AsyncReader for avro using ObjectStore#8930
alamb merged 40 commits intoapache:mainfrom
EmilyMatt:avro-async-reader

Conversation

@EmilyMatt
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Allows for proper file splitting within an asynchronous context.

What changes are included in this PR?

The raw implementation, allowing for file splitting, starting mid-block(read until sync marker is found), and further reading until end of block is found.
This reader currently requires a reader_schema is provided if type-promotion, schema-evolution, or projection are desired.
This is done so because #8928 is currently blocking proper parsing from an ArrowSchema

Are these changes tested?

Yes

Are there any user-facing changes?

Only the addition.
Other changes are internal to the crate (namely the way Decoder is created from parts)

Copy link
Contributor

@jecsand838 jecsand838 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing a partial review with some high level thoughts.

I'll wait for you to finish before resuming.

@EmilyMatt
Copy link
Contributor Author

Flushing a partial review with some high level thoughts.

I'll wait for you to finish before resuming.

Honestly I think my main blocker is the schema thing here. I don't want to commit to the constructor before it is resolved as its a public API and I don't want it to be volatile

@jecsand838
Copy link
Contributor

jecsand838 commented Nov 26, 2025

Flushing a partial review with some high level thoughts.
I'll wait for you to finish before resuming.

Honestly I think my main blocker is the schema thing here. I don't want to commit to the constructor before it is resolved as its a public API and I don't want it to be volatile

100% I'm working on that right now and won't stop until I have a PR. That was a solid catch.

The schema logic is an area of the code I mean to (or would welcome) a full refactor of. I knew it would eventually come back.

@EmilyMatt
Copy link
Contributor Author

Sorry, I haven't dropped it, just found myself in a really busy week! The generic reader support does not seem to hard to implement from the dabbling I made, and I still need to get to the builder pattern change

…, separate object store file reader into a featuregated struct and use a generic async file reader trait
@EmilyMatt
Copy link
Contributor Author

@jecsand838 I believe this is now ready for a proper review^

Copy link
Contributor

@jecsand838 jecsand838 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EmilyMatt Thank you so much for getting these changes up!

I left a few comments. Let me know what you think.

EDIT: Should have mentioned that this is looking really good overall and I'm very excited for the AsyncReader!

@alamb
Copy link
Contributor

alamb commented Jan 10, 2026

@jecsand838 and @EmilyMatt -- how is this PR looking?

@EmilyMatt
Copy link
Contributor Author

@jecsand838 and @EmilyMatt -- how is this PR looking?

I had actually just returned to work on it 2 days ago, still having some issues with the schema now being provided, due to the problems I've described, @jecsand838 suggested removing the arrow schema and I'm starting to think that is the only viable way for now.
Making the fetch API a bit closer to the one parquet uses is the smaller issue, I do wish to keep the seperate semantics for the original fetch and extra fetch(for parquet for example, that will be the row groups ranges, and the footer range), will try a couple ways to do this

@EmilyMatt
Copy link
Contributor Author

Hope to push another version today and address some of the things above

@github-actions github-actions bot removed the parquet Changes to the parquet crate label Jan 25, 2026
@EmilyMatt
Copy link
Contributor Author

@jecsand838 I've removed the parquet changes, and synced with main, I believe this is ready for last reviews and test runs before merged.

CC: @alamb

@alamb
Copy link
Contributor

alamb commented Jan 25, 2026

I started the tests

@EmilyMatt
Copy link
Contributor Author

I started the tests

Thx, most failures were technicalities, believe I fixed all of them

Copy link
Contributor

@jecsand838 jecsand838 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@EmilyMatt LGTM!

At this point I'm fine with anything else that comes up being a follow-up issue if you are @mzabaluev (unless it's major).

I just left a few final comments related to improving the docs for this PR.

@alamb

//! is enabled, [`AvroObjectReader`] provides integration with object storage services
//! such as S3 via the [object_store] crate.
//!
//! ```ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this runnable

Suggested change
//! ```ignore
//! ```

Copy link
Contributor Author

@EmilyMatt EmilyMatt Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how to do it without failing the tests, because all the code here is featuregated, and the doctests also runs without features enabled

# Conflicts:
#	arrow-avro/Cargo.toml
#	arrow-avro/src/reader/mod.rs
Copy link
Contributor

@mzabaluev mzabaluev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and works well.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a look at this and it looks fine to me -- I mostly deferring the reviews content to @jecsand838 and @mzabaluev and focused on the API design

Thanks @jecsand838 and @mzabaluev for the review

@EmilyMatt
Copy link
Contributor Author

Are there any further blockers? This is ready on my end.
@jecsand838 @alamb @mzabaluev

@jecsand838
Copy link
Contributor

Are there any further blockers? This is ready on my end. @jecsand838 @alamb @mzabaluev

LGTM! Thank you for all of this @EmilyMatt. Really excited for the AsyncReader!

@alamb
Copy link
Contributor

alamb commented Feb 2, 2026

Let's gogogogogogogo

@alamb alamb merged commit c1033d1 into apache:main Feb 2, 2026
24 checks passed
@alamb
Copy link
Contributor

alamb commented Feb 2, 2026

Thank you @EmilyMatt @jecsand838 and @mzabaluev -- epic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-avro arrow-avro crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement an async AvroReader

5 participants