ARROW-12110: [Java] Implement ZSTD compression#9822
ARROW-12110: [Java] Implement ZSTD compression#9822emkornfield wants to merge 6 commits intoapache:masterfrom
Conversation
emkornfield
commented
Mar 27, 2021
- Add ZSTD codec implementation.
- Removes dependency on netty by using ArrowBuf methods.
- Updates docs and archery test (will wait on CI to run archery if that fails will do more debugging.)
| skip.add("Go") | ||
| skip.add("JS") | ||
| skip.add("Rust") | ||
| if name == 'zstd': |
There was a problem hiding this comment.
A downside to removing this comment completely is that someone might not easily figure out how to skip tests based on name (when working on another implementation).
Do you mind adding a comment to the effect that if an implementer is skipping one of the compression formats, they could use if name == 'format'?
There was a problem hiding this comment.
Added a link back to this PR>
Add link back to this PR for disabling specific codecs.
check decompressed length.
|
I ran archery tests for java locally and zstd seems to pass with this PR. I'm not sure how these changes could have affected the archery/crossbow tests. @dianaclarke I think you might have been the last person to touch this area? Any thoughts? |
I'm pretty new to Arrow, and I'm not quite sure I follow. What's failing? The Are you referring to the I'm probably too n00b to be of much help on this issue, but good luck! That said, I do hope to find some time to collaborate on the following PR with you all. ARROW-10031: [CI][Java] Support Java benchmark in Archery (#8210) I would be fun to get these Java benchmarks hooked up to Conbench. I have a blog post coming soon that better explains the Arrow/Archery/Conbench integration. Until then: https://conbench.ursa.dev/ Cheers & stay safe! |
|
@dianaclarke hmm, it looks like "Archery & Crossbow / Archery Unittests and Crossbow Check Config (pull_request" passed this time, it seems like it might be flaky (i don't know that there is a great way to see old runs). Yeah, I think the integration test failure is completely unrelated and there is already an open JIRA for it. |
|
|
||
| @Override | ||
| protected ArrowBuf doCompress(BufferAllocator allocator, ArrowBuf uncompressedBuffer) { | ||
| long maxSize = Zstd.compressBound(uncompressedBuffer.writerIndex()); |
There was a problem hiding this comment.
Is it possible to cause too much waste of memroy in the worst case?
There was a problem hiding this comment.
I think this is just a fact of life with compression algorithms. There are potentially ways of mitigating memory waste if changed the APIs around a little bit (trying to pack more data into a single ArrowBuf) or consolidating buffers into a larger buffer if compression level is high. But this is the guidance the documented methods give for sizing buffers.
| long bytesWritten = Zstd.compressUnsafe( | ||
| compressedBuffer.memoryAddress() + CompressionUtil.SIZE_OF_UNCOMPRESSED_LENGTH, dstSize, | ||
| /*src*/uncompressedBuffer.memoryAddress(), /*srcSize=*/uncompressedBuffer.writerIndex(), | ||
| /*level=*/3); |
There was a problem hiding this comment.
Should we make the level configurable?
There was a problem hiding this comment.
maybe at some point, the default is 3. LZ4 also has a compression level which I don't think we are setting.
There was a problem hiding this comment.
I'll open up a JIRA to track configuration.
|
I am curious to know about the performance of this codec (I suppose it should be good, as there are no on-heap/off-heap data copies). |
|
Yes, I expect it to be good. I'm going to try to run some end-to-end experiments and I can report back numbers. |
Ah, now I see. I didn't write that test. I think @bkietz did, but I can try to find some time this week to dig into it. |
I see what the issue is. That test is assuming order, when it can't. I'll create a ticket and submit a PR shortly. [Developer][Archery] Flaky test: test_static_runner_from_json |
|
@liyafan82 I opened https://issues.apache.org/jira/browse/ARROW-12163 to track making compression levels configurable, any other concerns? |
@emkornfield My only concern is performance. If we are going to address it in ARROW-11901, this PR is ready for merge IMO. |
Yes, I think functionality should come first and we can measure and iterate on performance. Thank you for your review. |
|
going to merge. |
* Add ZSTD codec implementation. * Removes dependency on netty by using ArrowBuf methods. * Updates docs and archery test (will wait on CI to run archery if that fails will do more debugging.) Closes apache#9822 from emkornfield/zstd Lead-authored-by: emkornfield <[email protected]> Co-authored-by: Micah Kornfield <[email protected]> Signed-off-by: Micah Kornfield <[email protected]>