-
Notifications
You must be signed in to change notification settings - Fork 173
feat(csharp): Implement CloudFetch for Databricks Spark driver #2634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(csharp): Implement CloudFetch for Databricks Spark driver #2634
Conversation
|
I don’t think we want to do this in the Spark driver, do we? |
the change should be backward compatible. how can I run the test? |
CurtHagenlocher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change! I've left some feedback.
...ers.Apache/Debug/net8.0/.msCoverageSourceRootsMapping_Apache.Arrow.Adbc.Tests.Drivers.Apache
Outdated
Show resolved
Hide resolved
csharp/src/Drivers/Apache/Spark/CloudFetch/SparkCloudFetchReader.cs
Outdated
Show resolved
Hide resolved
| MemoryStream dataStream; | ||
|
|
||
| // If the data is LZ4 compressed, decompress it | ||
| if (this.isLz4Compressed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to leverage the Apache.Arrow.Compression assembly to do decompression? It works by passing a CompressionCodecFactory to the ArrowStreamReader constructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have code pointers? I tried it, seems not working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't, no. I can try to figure it out later; this doesn't need to be blocking.
csharp/src/Drivers/Apache/Spark/CloudFetch/SparkCloudFetchReader.cs
Outdated
Show resolved
Hide resolved
csharp/src/Drivers/Apache/Spark/CloudFetch/SparkCloudFetchReader.cs
Outdated
Show resolved
Hide resolved
csharp/test/Drivers/Apache/Apache.Arrow.Adbc.Tests.Drivers.Apache.csproj
Outdated
Show resolved
Hide resolved
CurtHagenlocher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The file artifacts/Apache.Arrow.Adbc.TetsDrivers/Apache/Debug/net8.0/.msCoverageSourceRootsMapping_Apache.Arrow.Adbc.Tests.Drivers.Apache is still present in the latest iteration. Could you please remove it from the PR?
Looks fine to me otherwise. One thing we might consider in a future change is to (configurably) fetch more than one link in parallel in order to maximize throughput.
| MemoryStream dataStream; | ||
|
|
||
| // If the data is LZ4 compressed, decompress it | ||
| if (this.isLz4Compressed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't, no. I can try to figure it out later; this doesn't need to be blocking.
CurtHagenlocher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
…e#2634) Initial implementation of adding CloudFetch feature in Databricks Spark Driver. - create a new CloudFetchReader to handle CloudFetch file download and decompress. - Test case for small and large result. Coming changes after this - Adding prefetch to the downloader - Adding renewal for expired presigned url - Retries
Initial implementation of adding CloudFetch feature in Databricks Spark Driver.
Coming changes after this