ORC-1866: Avoid zlib decompression infinite loop #2127

cxzl25 · 2025-02-08T08:35:15Z

What changes were proposed in this pull request?

This PR aims to fix ZlibCodec decompression of damaged files can be fast fail.

Why are the changes needed?

This is a long-standing issue. The decompress method implemented by ZlibCodec may enter an infinite loop when encountering some corrupt files.

jstack

"main" #1 [4611] prio=5 os_prio=31 cpu=55921.47ms elapsed=57.53s tid=0x0000000139014600 nid=4611 runnable  [0x000000016d9fa000]
   java.lang.Thread.State: RUNNABLE
        at java.util.zip.Inflater.inflateBytesBytes(java.base@21.0.5/Native Method)
        at java.util.zip.Inflater.inflate(java.base@21.0.5/Inflater.java:376)
        - locked <0x00000004367befc0> (a java.util.zip.Inflater$InflaterZStreamRef)
        at org.apache.orc.impl.ZlibCodec.decompress(ZlibCodec.java:168)
        at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:521)
        at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:548)
        at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:535)
        at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:2052)        at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2071)
        at org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2169)
        at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:2001)
        at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
        at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
        at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
        at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1432)
        at org.apache.orc.tools.PrintData.printJsonData(PrintData.java:208)
        at org.apache.orc.tools.PrintData.main(PrintData.java:288)
        at org.apache.orc.tools.Driver.main(Driver.java:120)

How was this patch tested?

local test
Add UT

Was this patch authored or co-authored using generative AI tooling?

No

dongjoon-hyun

+1, LGTM. Thank you, @cxzl25 .

### What changes were proposed in this pull request? This PR aims to fix ZlibCodec decompression of damaged files can be fast fail. ### Why are the changes needed? This is a long-standing issue. The decompress method implemented by ZlibCodec may enter an infinite loop when encountering some corrupt files. jstack ```java "main" #1 [4611] prio=5 os_prio=31 cpu=55921.47ms elapsed=57.53s tid=0x0000000139014600 nid=4611 runnable [0x000000016d9fa000] java.lang.Thread.State: RUNNABLE at java.util.zip.Inflater.inflateBytesBytes(java.base21.0.5/Native Method) at java.util.zip.Inflater.inflate(java.base21.0.5/Inflater.java:376) - locked <0x00000004367befc0> (a java.util.zip.Inflater$InflaterZStreamRef) at org.apache.orc.impl.ZlibCodec.decompress(ZlibCodec.java:168) at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:521) at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:548) at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:535) at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:2052) at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2071) at org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2169) at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:2001) at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100) at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77) at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1432) at org.apache.orc.tools.PrintData.printJsonData(PrintData.java:208) at org.apache.orc.tools.PrintData.main(PrintData.java:288) at org.apache.orc.tools.Driver.main(Driver.java:120) ``` ### How was this patch tested? 1. local test 2. Add UT ### Was this patch authored or co-authored using generative AI tooling? No Closes #2127 from cxzl25/zlib_infinite_loop. Authored-by: sychen <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 8eaf92d) Signed-off-by: Dongjoon Hyun <[email protected]>

dushyantk1509 · 2025-10-08T05:06:42Z

@cxzl25 Could you pls help me understand how could producer produce such corrupted files? How did you create corrupted test orc file?

Context: We've a spark job that produced one such corrupted file and I wanted to know how could it happen. Spark application uses rdd.saveAsNewAPIHadoopFile API with org.apache.orc.mapreduce.OrcOutputFormat to write orc files. Orc-core and orc-mapreduce version used is 1.6.10.

cxzl25 · 2025-10-08T05:15:54Z

Could you pls help me understand how could producer produce such corrupted files?

Usually when I encounter this problem, it may not necessarily be a problem at the ORC level. For example, it may be some data corruption caused by the HDFS EC storage policy, causing it to fail to decompress.

How did you create corrupted test orc file?

I created this damaged file because I adjusted the buffer size when writing, because I encountered a case in the production environment. The buffer size written at that time was incorrect due to hardware failure, resulting in reading failure.

dushyantk1509 · 2025-10-08T14:51:43Z

Thanks @cxzl25 for quick response!

Are you aware of any other scenario this could happen? In our scenario, spark app has written 4000 files but only one is corrupted. I tried reading the file, I was able to read till 350208 records but stuck after it. From the file stats, it has 3M+ rows. IMO, this seems like some hardware failure similar to your scenario. B/w how did you confirm on the hardware failure?

For example, it may be some data corruption caused by the HDFS EC storage policy, causing it to fail to decompress.

Is it possible for you to share code snippet on how did you adjust buffer size? Any code reference that you might have?

I created this damaged file because I adjusted the buffer size when writing

cxzl25 · 2025-10-08T15:23:44Z

how did you confirm on the hardware failure?

I'm using the dcdiag tool to diagnose.

https://www.intel.com/content/www/us/en/support/articles/000098269/processors.html

how did you adjust buffer size? Any code reference that you might have?

Since it's been a while, I don't have the specific code, but I vaguely remember that I adjusted the buffer when writing PostScript during debugging.

orc/java/core/src/java/org/apache/orc/impl/WriterImpl.java

Lines 632 to 633 in cff8877

    
           builder.setCompression(writeCompressionKind(codec.getKind())) 
        
                  .setCompressionBlockSize(unencryptedOptions.getBufferSize());

dushyantk1509 · 2025-10-09T02:53:04Z

What kind of hardware failure did you see? Can you explain a little bit about it?

The buffer size written at that time was incorrect due to hardware failure, resulting in reading failure.

cxzl25 · 2025-10-09T05:44:18Z

What kind of hardware failure did you see? Can you explain a little bit about it?

The problem at that time was that some data failed to be read.

There is no problem using memtester to detect memory.

Finally, I used dcdiag to detect that there may be a problem with the CPU.

dcdiag

output

Intel(R) Data Center Diagnostic Tool Version v630 (1aa16c950f34b774738531af947f307f77456356)
Testing started. This takes about 45 minutes.
Test failed (#d88c8fa).
Test completed and an error was detected on the physical processor containing /sys/devices/system/cpu/cpu20 (family-model-stepping 06-55-04)

cxzl25 marked this pull request as draft February 8, 2025 08:35

github-actions bot added the JAVA label Feb 8, 2025

cxzl25 added 2 commits March 25, 2025 20:47

avoid zlib decompression infinite loop

1244af3

add ut

c02c82c

cxzl25 force-pushed the zlib_infinite_loop branch from 68c88c9 to c02c82c Compare March 26, 2025 03:51

cxzl25 changed the title ~~Avoid zlib decompression infinite loop~~ ORC-1866: Avoid zlib decompression infinite loop Mar 26, 2025

cxzl25 marked this pull request as ready for review March 26, 2025 04:54

dongjoon-hyun approved these changes Mar 27, 2025

View reviewed changes

dongjoon-hyun added this to the 1.9.6 milestone Mar 30, 2025

dongjoon-hyun closed this in 8eaf92d Mar 30, 2025

dongjoon-hyun modified the milestones: 1.9.6, 1.8.9 Mar 30, 2025

This was referenced Mar 30, 2025

ORC-1866: Avoid zlib decompression infinite loop #2164

Closed

ORC-1866: Avoid zlib decompression infinite loop #2165

Closed

ORC-1866: Avoid zlib decompression infinite loop #2166

Closed

dongjoon-hyun mentioned this pull request May 8, 2025

ORC: Upgrade ORC to 1.9.6 apache/iceberg#13003

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ORC-1866: Avoid zlib decompression infinite loop #2127

ORC-1866: Avoid zlib decompression infinite loop #2127

Uh oh!

cxzl25 commented Feb 8, 2025 •

edited

Loading

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dushyantk1509 commented Oct 8, 2025 •

edited

Loading

Uh oh!

cxzl25 commented Oct 8, 2025

Uh oh!

dushyantk1509 commented Oct 8, 2025

Uh oh!

cxzl25 commented Oct 8, 2025

Uh oh!

dushyantk1509 commented Oct 9, 2025

Uh oh!

cxzl25 commented Oct 9, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ORC-1866: Avoid zlib decompression infinite loop #2127

ORC-1866: Avoid zlib decompression infinite loop #2127

Uh oh!

Conversation

cxzl25 commented Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dushyantk1509 commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cxzl25 commented Oct 8, 2025

Uh oh!

dushyantk1509 commented Oct 8, 2025

Uh oh!

cxzl25 commented Oct 8, 2025

Uh oh!

dushyantk1509 commented Oct 9, 2025

Uh oh!

cxzl25 commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cxzl25 commented Feb 8, 2025 •

edited

Loading

dushyantk1509 commented Oct 8, 2025 •

edited

Loading

cxzl25 commented Oct 9, 2025 •

edited

Loading