feat(csharp): Implement CloudFetch for Databricks Spark driver #2634

jadewang-db · 2025-03-20T21:11:39Z

Initial implementation of adding CloudFetch feature in Databricks Spark Driver.

create a new CloudFetchReader to handle CloudFetch file download and decompress.
Test case for small and large result.

Coming changes after this

Adding prefetch to the downloader
Adding renewal for expired presigned url
Retries

davidhcoe · 2025-03-20T21:35:09Z

I don’t think we want to do this in the Spark driver, do we?

jadewang-db · 2025-03-20T21:43:37Z

I don’t think we want to do this in the Spark driver, do we?

the change should be backward compatible. how can I run the test?

CurtHagenlocher

Thanks for the change! I've left some feedback.

...ers.Apache/Debug/net8.0/.msCoverageSourceRootsMapping_Apache.Arrow.Adbc.Tests.Drivers.Apache

csharp/src/Drivers/Apache/Hive2/HiveServer2Statement.cs

csharp/src/Drivers/Apache/Spark/SparkDatabricksConnection.cs

csharp/src/Drivers/Apache/Spark/CloudFetch/SparkCloudFetchReader.cs

CurtHagenlocher · 2025-03-21T18:28:37Z

csharp/src/Drivers/Apache/Spark/CloudFetch/SparkCloudFetchReader.cs

+                    MemoryStream dataStream;
+
+                    // If the data is LZ4 compressed, decompress it
+                    if (this.isLz4Compressed)


Is it possible to leverage the Apache.Arrow.Compression assembly to do decompression? It works by passing a CompressionCodecFactory to the ArrowStreamReader constructor.

do you have code pointers? I tried it, seems not working.

I don't, no. I can try to figure it out later; this doesn't need to be blocking.

csharp/src/Drivers/Apache/Spark/CloudFetch/SparkCloudFetchReader.cs

csharp/test/Drivers/Apache/Apache.Arrow.Adbc.Tests.Drivers.Apache.csproj

CurtHagenlocher

The file artifacts/Apache.Arrow.Adbc.TetsDrivers/Apache/Debug/net8.0/.msCoverageSourceRootsMapping_Apache.Arrow.Adbc.Tests.Drivers.Apache is still present in the latest iteration. Could you please remove it from the PR?

Looks fine to me otherwise. One thing we might consider in a future change is to (configurably) fetch more than one link in parallel in order to maximize throughput.

CurtHagenlocher · 2025-03-31T19:10:07Z

csharp/src/Drivers/Apache/Spark/CloudFetch/SparkCloudFetchReader.cs

+                    MemoryStream dataStream;
+
+                    // If the data is LZ4 compressed, decompress it
+                    if (this.isLz4Compressed)


I don't, no. I can try to figure it out later; this doesn't need to be blocking.

…Apache

CurtHagenlocher

Thanks!

…e#2634) Initial implementation of adding CloudFetch feature in Databricks Spark Driver. - create a new CloudFetchReader to handle CloudFetch file download and decompress. - Test case for small and large result. Coming changes after this - Adding prefetch to the downloader - Adding renewal for expired presigned url - Retries

jadewang-db added 5 commits March 19, 2025 17:39

initial build pass

9eb9255

E2E working

972efb1

fix build break

e85dfd0

fix precommit error

16ff995

clean up unnecessary code path

713c972

jadewang-db requested a review from CurtHagenlocher as a code owner March 20, 2025 21:11

github-actions bot added this to the ADBC Libraries 18 milestone Mar 20, 2025

CurtHagenlocher requested changes Mar 21, 2025

View reviewed changes

jadewang-db added 2 commits March 25, 2025 14:45

fix(csharp): Address PR feedback for CloudFetch implementation

3669c0a

fix build failures

2bb3fe5

jadewang-db requested a review from CurtHagenlocher March 25, 2025 22:47

jadewang-db added 2 commits March 28, 2025 11:11

adding config items

ef3ee5e

fix trailing whitespace

b9e5327

CurtHagenlocher requested changes Mar 31, 2025

View reviewed changes

Delete .msCoverageSourceRootsMapping_Apache.Arrow.Adbc.Tests.Drivers.…

3a31d03

…Apache

jadewang-db requested a review from CurtHagenlocher March 31, 2025 20:16

CurtHagenlocher approved these changes Mar 31, 2025

View reviewed changes

CurtHagenlocher merged commit 9eee98d into apache:main Mar 31, 2025
7 checks passed

feat(csharp): Implement CloudFetch for Databricks Spark driver #2634

feat(csharp): Implement CloudFetch for Databricks Spark driver #2634

Uh oh!

Conversation

jadewang-db commented Mar 20, 2025

Uh oh!

davidhcoe commented Mar 20, 2025

Uh oh!

jadewang-db commented Mar 20, 2025

Uh oh!

CurtHagenlocher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CurtHagenlocher Mar 21, 2025

Choose a reason for hiding this comment

Uh oh!

jadewang-db Mar 25, 2025

Choose a reason for hiding this comment

Uh oh!

CurtHagenlocher Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

CurtHagenlocher left a comment

Choose a reason for hiding this comment

Uh oh!

CurtHagenlocher Mar 31, 2025

Choose a reason for hiding this comment

Uh oh!

CurtHagenlocher left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants