-
Notifications
You must be signed in to change notification settings - Fork 173
fix(csharp/src/Drivers/Databricks): Reduce LZ4 decompression memory by using Custom Array Pool #3654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(csharp/src/Drivers/Databricks): Reduce LZ4 decompression memory by using Custom Array Pool #3654
Conversation
…y by 96%⚠️ DISCUSSION REQUIRED: This reduces accumulated allocations but not peak concurrent memory. Requires discussion on benefit vs. complexity trade-off. Reduces LZ4 internal buffer memory allocation from ~900MB to ~40MB (96% reduction) by implementing custom ArrayPool that supports 4MB buffers. Problem: - Databricks uses LZ4 frames with 4MB maxBlockSize - .NET's ArrayPool.Shared has 1MB limit - 222 decompressions × 4MB fresh allocations = 888MB LOH Solution: - CustomLZ4FrameReader extends StreamLZ4FrameReader - Overrides AllocBuffer() to use custom ArrayPool (4MB max, 10 buffers) - CustomLZ4DecoderStream wrapper using CustomLZ4FrameReader - Updated Lz4Utilities to use custom implementation Results: - Total allocations: 900MB → 40MB (96% reduction) - GC pressure: Significantly reduced (fewer LOH allocations) - Peak concurrent memory: Unchanged (~8-16MB with parallelDownloads=1) Note: This primarily optimizes accumulated allocations over time, not peak concurrent memory usage. With sequential decompression (default), peak memory remains similar but with better reuse and fewer GC Gen2 collections. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…ooling Change CustomLZ4FrameReader to clear buffers when returning to pool (clearArray: true) to prevent stale data from previous decompressions from corrupting subsequent operations. Without clearing, buffers retain old data which can cause Arrow IPC stream corruption when the decompressor doesn't fully overwrite the buffer or there are length tracking issues. Performance impact is minimal (~1-2ms per 4MB buffer clear) compared to network I/O (10-100ms) and decompression time (5-20ms). Fixes: Power BI error "Corrupted IPC message. Received a continuation token at the end of the message." 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…on of cloud downloads (apache#3652)" This reverts commit eae39dd.
Add missing Stream implementation details and comprehensive documentation: - Check inner stream CanRead state in addition to disposed flag - Add explicit timeout property implementations (not supported) - Add FlushAsync() override for complete Stream contract - Document why K4os base classes cannot be inherited (private protected constructor) - Document intentionally omitted features (timeouts, write operations, DisposeAsync) - Add inline comments explaining disposal logic and buffer pooling These changes ensure CustomLZ4DecoderStream is a complete, well-documented Stream implementation that properly uses CustomLZ4FrameReader for 4MB buffer pooling. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…ize estimation Remove pre-allocation and telemetry code from LZ4 decompression: - Remove estimated size calculation based on compression ratios - Remove LZ4 frame header peeking in async version - Remove Activity telemetry events tracking buffer waste - Remove unused imports (System.Diagnostics, Apache.Arrow.Adbc.Tracing) Rationale: - MemoryStream grows efficiently on its own (doubles capacity when needed) - Estimation adds complexity without significant benefit - Focus is on pooling 4MB LZ4 internal buffers, not output buffer sizing 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
|
@CurtHagenlocher This is ready to merge, please let me know if you have any concerns |
|
There's a little bit of overlap between this PR and #3683. Is that deliberate? This one just changes decompression while the other one just replaces the implementation of the stream being returned? |
CurtHagenlocher
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Again, I'm worried about the lack of eventual cleanup in environments where things other than Databricks connections might be happening in the same process.
| /// This allows the 4MB buffers required by Databricks LZ4 frames to be pooled and reused. | ||
| /// maxArraysPerBucket=10 means we keep up to 10 buffers of each size in the pool. | ||
| /// </summary> | ||
| private static readonly ArrayPool<byte> LargeBufferPool = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to echo my comment on the other PR and point out that this will keep the memory allocated for the remainder of the process lifetime. It's not quite as bad as RecyclableMemoryStream because I it has a default per-bucket limit and will discard memory past that point once it's freed, but with ten arrays per bucket and a maximum bucket size of 40MB, it could (in the worst case) still end up being over 70 MB.
Assuming that testing doesn't show a problem, I'd recommend a similar strategy here as there -- which is to associate the pool with either the driver, the database or the connection so that it will eventually get freed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good, I attach this to the Database level.
Connection level would still be bad, if we have multiple connections in the connection pool we are then creating multiple same arraypool inside each connection.
This and the other PR you tagged is for different purpose.
|
…se instance Address code review feedback by moving ArrayPool lifecycle management from process-level (static) to Database instance-level. **Changes:** - DatabricksDatabase: Creates ArrayPool (one per database instance) - DatabricksConnection: Receives pool from database via property - CustomLZ4FrameReader: Accepts ArrayPool as constructor parameter - CustomLZ4DecoderStream: Accepts ArrayPool as constructor parameter - Lz4Utilities: Methods require ArrayPool parameter (no fallback) - CloudFetchDownloader: Changed to use IHiveServer2Statement (was ITracingStatement) - Readers: Get pool directly from connection and pass to LZ4 utilities **Benefits:** - Pool lifecycle tied to Database, not process lifetime - Memory freed when Database is disposed/GC'd - Perfect for connection pooling: multiple connections share one pool - Cleaner design: no optional/nullable complexity **Memory footprint:** - With connection pooling: 10 connections → 1 shared pool = 40MB - Without: Each Database instance → 1 pool = 40MB Addresses review comment: apache#3654 (comment) 🤖 Generated with [Claude Code](https://claude.com/claude-code)
…se instance Address code review feedback by moving ArrayPool lifecycle management from process-level (static) to Database instance-level. **Changes:** - DatabricksDatabase: Creates ArrayPool (one per database instance) - DatabricksConnection: Receives pool from database via property - CustomLZ4FrameReader: Accepts ArrayPool as constructor parameter - CustomLZ4DecoderStream: Accepts ArrayPool as constructor parameter - Lz4Utilities: Methods require ArrayPool parameter (no fallback) - CloudFetchDownloader: Changed to use IHiveServer2Statement (was ITracingStatement) - Readers: Get pool directly from connection and pass to LZ4 utilities **Benefits:** - Pool lifecycle tied to Database, not process lifetime - Memory freed when Database is disposed/GC'd - Perfect for connection pooling: multiple connections share one pool - Cleaner design: no optional/nullable complexity **Memory footprint:** - With connection pooling: 10 connections → 1 shared pool = 40MB - Without: Each Database instance → 1 pool = 40MB Addresses review comment: apache#3654 (comment) 🤖 Generated with [Claude Code](https://claude.com/claude-code)
c28a2d7 to
eb08d41
Compare
Improves on the property setter approach by passing Lz4BufferPool via internal constructor parameter, matching the RecyclableMemoryStreamManager pattern. Changes: - Added internal constructor that accepts Lz4BufferPool parameter - Public constructor chains to internal constructor with null (creates own pool) - DatabricksDatabase passes shared pool via internal constructor - Property changed from mutable to immutable (get-only) - No public API changes - backward compatible Benefits: - Compile-time safety - Database must explicitly pass the pool - Immutable after construction - can't be changed accidentally - Predictable initialization - pool created at construction time - Consistent pattern with RecyclableMemoryStreamManager 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
…mory-stream-lz4 Resolved conflicts by accepting ArrayPool approach from PR apache#3654: - DatabricksConnection.cs: Use Lz4BufferPool (ArrayPool<byte>) - DatabricksDatabase.cs: Use Lz4BufferPool instance - Lz4Utilities.cs: Use ArrayPool<byte> parameter - CloudFetchDownloader.cs: Use Lz4BufferPool - Added CustomLZ4DecoderStream and CustomLZ4FrameReader This replaces the RecyclableMemoryStreamManager approach with ArrayPool for better memory management of LZ4 buffers. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
This merge combines both memory management approaches: - RecyclableMemoryStreamManager (for output streams) - ArrayPool (for internal LZ4 decompression buffers) Changes: - DatabricksConnection: Added both RecyclableMemoryStreamManager and Lz4BufferPool properties - DatabricksDatabase: Instantiates both pools and passes to connections - Lz4Utilities: DecompressLz4Async now accepts both managers - CloudFetchDownloader: Uses both for decompression operations - Added CustomLZ4DecoderStream and CustomLZ4FrameReader from PR apache#3654
Fix: Reduce Lz4 decompression memory by using Customize ArrayPool.
Summary
Reduces LZ4 internal buffer memory allocation from ~900MB to ~40MB (96% reduction) for large Databricks query results by implementing a custom ArrayPool that supports buffer sizes larger than .NET's default 1MB limit.
Important: This optimization primarily reduces:
But does NOT significantly reduce:
parallelDownloads=1, peak is still ~8-16MB (1-2 buffers in use)Solution
Created a custom ArrayPool by overriding K4os.Compression.LZ4's buffer allocation methods:
StreamLZ4FrameReaderwith custom ArrayPool (4MB max, 10 buffers)CustomLZ4FrameReaderCustomLZ4DecoderStreaminstead of defaultLZ4Stream.Decode()Key Implementation
Performance
Why This Works
K4os Library Design:
LZ4FrameReaderhasvirtualmethods:AllocBuffer()andReleaseBuffer()BufferPool.Alloc()→DefaultArrayPool(1MB limit)Buffer Lifecycle:
parallelDownloads=1(default), only 1-2 buffers active at onceConcurrency Considerations
Recommendation: If using
parallel_downloads > 4, consider increasingmaxArraysPerBucketin future enhancement.Files Changed
New Files
src/Drivers/Databricks/CustomLZ4FrameReader.cs(~80 lines)src/Drivers/Databricks/CustomLZ4DecoderStream.cs(~118 lines)Modified Files
src/Drivers/Databricks/Lz4Utilities.cs- UseCustomLZ4DecoderStream, add telemetryTesting & Validation
Before:

After:
'
References