Skip to content

Conversation

@CurtHagenlocher
Copy link
Contributor

@CurtHagenlocher CurtHagenlocher commented Oct 29, 2025

For illustration purposes.


/// <inheritdoc />
public void SetCompleted(Stream dataStream, long size)
public void SetCompleted(ReadOnlyMemory<byte> data, long size)
Copy link
Contributor Author

@CurtHagenlocher CurtHagenlocher Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this change, we probably don't need to take size separately anymore. That said, if any of these uncompressed buffers are more than 2 GB in size, things are going to break in general due to CLR limitations. That's a much larger topic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cloudfetch each file should be around 20MB uncompressed.
Are you going to check in this BTW?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Boy, I sure got the size thing wrong. Anyhow, I need to find the time to benchmark this to confirm that the numbers go in the right direction. I've created a new PR as #3652.

eric-wang-1990 and others added 5 commits October 29, 2025 13:42
…gic and improve resource disposal (apache#3649)

## Summary
This PR consolidates LZ4 decompression code paths and ensures proper
resource cleanup across both CloudFetch and non-CloudFetch readers in
the Databricks C# driver.

## Changes
- **Lz4Utilities.cs**
  - Add configurable buffer size parameter to `DecompressLz4()`
  - Add async variant `DecompressLz4Async()` for CloudFetch pipeline
  - Add proper `using` statements for MemoryStream disposal
  - Add default buffer size constant (80KB)

- **CloudFetchDownloader.cs**
  - Update to use shared `Lz4Utilities.DecompressLz4Async()`
  - Reduce code duplication (~12 lines consolidated)
  - Improve telemetry with compression ratio calculation

## Benefits
- **Code Quality**: Both code paths now share the same decompression
implementation, reducing duplication
- **Resource Management**: Explicit MemoryStream disposal improves
memory hygiene (though GC would handle cleanup)
- **Maintainability**: Single source of truth for LZ4 decompression
logic
- **Consistency**: Same error handling and telemetry patterns across
both paths

## Technical Details
- Default buffer size remains 80KB (81920 bytes) - no behavioral changes
- Async version returns `(byte[] buffer, int length)` tuple for
efficient MemoryStream wrapping in CloudFetch
- Buffer validity preserved after MemoryStream disposal via reference
counting
- Maintains cancellation token support in async path
- No performance impact - purely refactoring and cleanup

## Testing
- Existing unit tests pass
- No functional changes to decompression logic
- Telemetry output remains consistent

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude <[email protected]>
Expands support for arrow to include the latest version 57.

Also, the minor version of datafusion specified in the lock file has
been updated.

Supersede apache#3634.
@CurtHagenlocher
Copy link
Contributor Author

Bah, I botched the rebase yet again. Will resubmit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants