-
Notifications
You must be signed in to change notification settings - Fork 506
ORC-1866: Avoid zlib decompression infinite loop #2127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
68c88c9 to
c02c82c
Compare
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM. Thank you, @cxzl25 .
### What changes were proposed in this pull request? This PR aims to fix ZlibCodec decompression of damaged files can be fast fail. ### Why are the changes needed? This is a long-standing issue. The decompress method implemented by ZlibCodec may enter an infinite loop when encountering some corrupt files. jstack ```java "main" #1 [4611] prio=5 os_prio=31 cpu=55921.47ms elapsed=57.53s tid=0x0000000139014600 nid=4611 runnable [0x000000016d9fa000] java.lang.Thread.State: RUNNABLE at java.util.zip.Inflater.inflateBytesBytes(java.base21.0.5/Native Method) at java.util.zip.Inflater.inflate(java.base21.0.5/Inflater.java:376) - locked <0x00000004367befc0> (a java.util.zip.Inflater$InflaterZStreamRef) at org.apache.orc.impl.ZlibCodec.decompress(ZlibCodec.java:168) at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:521) at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:548) at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:535) at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:2052) at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2071) at org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2169) at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:2001) at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1432) at org.apache.orc.tools.PrintData.printJsonData(PrintData.java:208) at org.apache.orc.tools.PrintData.main(PrintData.java:288) at org.apache.orc.tools.Driver.main(Driver.java:120) ``` ### How was this patch tested? 1. local test 2. Add UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #2127 from cxzl25/zlib_infinite_loop. Authored-by: sychen <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 8eaf92d) Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to fix ZlibCodec decompression of damaged files can be fast fail. ### Why are the changes needed? This is a long-standing issue. The decompress method implemented by ZlibCodec may enter an infinite loop when encountering some corrupt files. jstack ```java "main" #1 [4611] prio=5 os_prio=31 cpu=55921.47ms elapsed=57.53s tid=0x0000000139014600 nid=4611 runnable [0x000000016d9fa000] java.lang.Thread.State: RUNNABLE at java.util.zip.Inflater.inflateBytesBytes(java.base21.0.5/Native Method) at java.util.zip.Inflater.inflate(java.base21.0.5/Inflater.java:376) - locked <0x00000004367befc0> (a java.util.zip.Inflater$InflaterZStreamRef) at org.apache.orc.impl.ZlibCodec.decompress(ZlibCodec.java:168) at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:521) at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:548) at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:535) at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:2052) at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2071) at org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2169) at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:2001) at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1432) at org.apache.orc.tools.PrintData.printJsonData(PrintData.java:208) at org.apache.orc.tools.PrintData.main(PrintData.java:288) at org.apache.orc.tools.Driver.main(Driver.java:120) ``` ### How was this patch tested? 1. local test 2. Add UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #2127 from cxzl25/zlib_infinite_loop. Authored-by: sychen <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 8eaf92d) Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to fix ZlibCodec decompression of damaged files can be fast fail. ### Why are the changes needed? This is a long-standing issue. The decompress method implemented by ZlibCodec may enter an infinite loop when encountering some corrupt files. jstack ```java "main" #1 [4611] prio=5 os_prio=31 cpu=55921.47ms elapsed=57.53s tid=0x0000000139014600 nid=4611 runnable [0x000000016d9fa000] java.lang.Thread.State: RUNNABLE at java.util.zip.Inflater.inflateBytesBytes(java.base21.0.5/Native Method) at java.util.zip.Inflater.inflate(java.base21.0.5/Inflater.java:376) - locked <0x00000004367befc0> (a java.util.zip.Inflater$InflaterZStreamRef) at org.apache.orc.impl.ZlibCodec.decompress(ZlibCodec.java:168) at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:521) at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:548) at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:535) at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:2052) at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2071) at org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2169) at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:2001) at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1432) at org.apache.orc.tools.PrintData.printJsonData(PrintData.java:208) at org.apache.orc.tools.PrintData.main(PrintData.java:288) at org.apache.orc.tools.Driver.main(Driver.java:120) ``` ### How was this patch tested? 1. local test 2. Add UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #2127 from cxzl25/zlib_infinite_loop. Authored-by: sychen <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 8eaf92d) Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to fix ZlibCodec decompression of damaged files can be fast fail. ### Why are the changes needed? This is a long-standing issue. The decompress method implemented by ZlibCodec may enter an infinite loop when encountering some corrupt files. jstack ```java "main" #1 [4611] prio=5 os_prio=31 cpu=55921.47ms elapsed=57.53s tid=0x0000000139014600 nid=4611 runnable [0x000000016d9fa000] java.lang.Thread.State: RUNNABLE at java.util.zip.Inflater.inflateBytesBytes(java.base21.0.5/Native Method) at java.util.zip.Inflater.inflate(java.base21.0.5/Inflater.java:376) - locked <0x00000004367befc0> (a java.util.zip.Inflater$InflaterZStreamRef) at org.apache.orc.impl.ZlibCodec.decompress(ZlibCodec.java:168) at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:521) at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:548) at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:535) at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:2052) at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2071) at org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2169) at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:2001) at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1432) at org.apache.orc.tools.PrintData.printJsonData(PrintData.java:208) at org.apache.orc.tools.PrintData.main(PrintData.java:288) at org.apache.orc.tools.Driver.main(Driver.java:120) ``` ### How was this patch tested? 1. local test 2. Add UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #2127 from cxzl25/zlib_infinite_loop. Authored-by: sychen <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 8eaf92d) Signed-off-by: Dongjoon Hyun <[email protected]>
|
@cxzl25 Could you pls help me understand how could producer produce such corrupted files? How did you create corrupted test orc file? Context: We've a spark job that produced one such corrupted file and I wanted to know how could it happen. Spark application uses |
Usually when I encounter this problem, it may not necessarily be a problem at the ORC level. For example, it may be some data corruption caused by the HDFS EC storage policy, causing it to fail to decompress.
I created this damaged file because I adjusted the buffer size when writing, because I encountered a case in the production environment. The buffer size written at that time was incorrect due to hardware failure, resulting in reading failure. |
|
Thanks @cxzl25 for quick response! Are you aware of any other scenario this could happen? In our scenario, spark app has written 4000 files but only one is corrupted. I tried reading the file, I was able to read till 350208 records but stuck after it. From the file stats, it has 3M+ rows. IMO, this seems like some hardware failure similar to your scenario. B/w how did you confirm on the hardware failure?
Is it possible for you to share code snippet on how did you adjust buffer size? Any code reference that you might have?
|
I'm using the dcdiag tool to diagnose.
Since it's been a while, I don't have the specific code, but I vaguely remember that I adjusted the buffer when writing PostScript during debugging. orc/java/core/src/java/org/apache/orc/impl/WriterImpl.java Lines 632 to 633 in cff8877
|
|
What kind of hardware failure did you see? Can you explain a little bit about it?
|
The problem at that time was that some data failed to be read. There is no problem using Finally, I used dcdiagoutput Intel(R) Data Center Diagnostic Tool Version v630 (1aa16c950f34b774738531af947f307f77456356)
Testing started. This takes about 45 minutes.
Test failed (#d88c8fa).
Test completed and an error was detected on the physical processor containing /sys/devices/system/cpu/cpu20 (family-model-stepping 06-55-04) |
What changes were proposed in this pull request?
This PR aims to fix ZlibCodec decompression of damaged files can be fast fail.
Why are the changes needed?
This is a long-standing issue. The decompress method implemented by ZlibCodec may enter an infinite loop when encountering some corrupt files.
jstack
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
No