Skip to content

Conversation

@jadewang-db
Copy link
Contributor

Initial implementation of adding CloudFetch feature in Databricks Spark Driver.

  • create a new CloudFetchReader to handle CloudFetch file download and decompress.
  • Test case for small and large result.

Coming changes after this

  • Adding prefetch to the downloader
  • Adding renewal for expired presigned url
  • Retries

@github-actions github-actions bot added this to the ADBC Libraries 18 milestone Mar 20, 2025
@davidhcoe
Copy link
Contributor

I don’t think we want to do this in the Spark driver, do we?

@jadewang-db
Copy link
Contributor Author

I don’t think we want to do this in the Spark driver, do we?

the change should be backward compatible. how can I run the test?

Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change! I've left some feedback.

MemoryStream dataStream;

// If the data is LZ4 compressed, decompress it
if (this.isLz4Compressed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to leverage the Apache.Arrow.Compression assembly to do decompression? It works by passing a CompressionCodecFactory to the ArrowStreamReader constructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you have code pointers? I tried it, seems not working.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't, no. I can try to figure it out later; this doesn't need to be blocking.

Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file artifacts/Apache.Arrow.Adbc.TetsDrivers/Apache/Debug/net8.0/.msCoverageSourceRootsMapping_Apache.Arrow.Adbc.Tests.Drivers.Apache is still present in the latest iteration. Could you please remove it from the PR?

Looks fine to me otherwise. One thing we might consider in a future change is to (configurably) fetch more than one link in parallel in order to maximize throughput.

MemoryStream dataStream;

// If the data is LZ4 compressed, decompress it
if (this.isLz4Compressed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't, no. I can try to figure it out later; this doesn't need to be blocking.

Copy link
Contributor

@CurtHagenlocher CurtHagenlocher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@CurtHagenlocher CurtHagenlocher merged commit 9eee98d into apache:main Mar 31, 2025
7 checks passed
colin-rogers-dbt pushed a commit to dbt-labs/arrow-adbc that referenced this pull request Jun 10, 2025
…e#2634)

Initial implementation of adding CloudFetch feature in Databricks Spark
Driver.

- create a new CloudFetchReader to handle CloudFetch file download and
decompress.
- Test case for small and large result.

Coming changes after this

- Adding prefetch to the downloader
- Adding renewal for expired presigned url
- Retries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants