Skip to content

Conversation

@cxzl25
Copy link
Contributor

@cxzl25 cxzl25 commented Feb 8, 2025

What changes were proposed in this pull request?

This PR aims to fix ZlibCodec decompression of damaged files can be fast fail.

Why are the changes needed?

This is a long-standing issue. The decompress method implemented by ZlibCodec may enter an infinite loop when encountering some corrupt files.

jstack

"main" #1 [4611] prio=5 os_prio=31 cpu=55921.47ms elapsed=57.53s tid=0x0000000139014600 nid=4611 runnable  [0x000000016d9fa000]
   java.lang.Thread.State: RUNNABLE
        at java.util.zip.Inflater.inflateBytesBytes(java.base@21.0.5/Native Method)
        at java.util.zip.Inflater.inflate(java.base@21.0.5/Inflater.java:376)
        - locked <0x00000004367befc0> (a java.util.zip.Inflater$InflaterZStreamRef)
        at org.apache.orc.impl.ZlibCodec.decompress(ZlibCodec.java:168)
        at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:521)
        at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:548)
        at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:535)
        at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:2052)        at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2071)
        at org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2169)
        at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:2001)
        at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
        at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
        at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
        at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1432)
        at org.apache.orc.tools.PrintData.printJsonData(PrintData.java:208)
        at org.apache.orc.tools.PrintData.main(PrintData.java:288)
        at org.apache.orc.tools.Driver.main(Driver.java:120)

How was this patch tested?

  1. local test
  2. Add UT

Was this patch authored or co-authored using generative AI tooling?

No

@cxzl25 cxzl25 marked this pull request as draft February 8, 2025 08:35
@github-actions github-actions bot added the JAVA label Feb 8, 2025
@cxzl25 cxzl25 force-pushed the zlib_infinite_loop branch from 68c88c9 to c02c82c Compare March 26, 2025 03:51
@cxzl25 cxzl25 changed the title Avoid zlib decompression infinite loop ORC-1866: Avoid zlib decompression infinite loop Mar 26, 2025
@cxzl25 cxzl25 marked this pull request as ready for review March 26, 2025 04:54
Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Thank you, @cxzl25 .

@dongjoon-hyun dongjoon-hyun added this to the 1.9.6 milestone Mar 30, 2025
dongjoon-hyun pushed a commit that referenced this pull request Mar 30, 2025
### What changes were proposed in this pull request?
This PR aims to fix ZlibCodec decompression of damaged files can be fast fail.

### Why are the changes needed?

This is a long-standing issue. The decompress method implemented by ZlibCodec may enter an infinite loop when encountering some corrupt files.

jstack
```java
"main" #1 [4611] prio=5 os_prio=31 cpu=55921.47ms elapsed=57.53s tid=0x0000000139014600 nid=4611 runnable  [0x000000016d9fa000]
   java.lang.Thread.State: RUNNABLE
        at java.util.zip.Inflater.inflateBytesBytes(java.base21.0.5/Native Method)
        at java.util.zip.Inflater.inflate(java.base21.0.5/Inflater.java:376)
        - locked <0x00000004367befc0> (a java.util.zip.Inflater$InflaterZStreamRef)
        at org.apache.orc.impl.ZlibCodec.decompress(ZlibCodec.java:168)
        at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:521)
        at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:548)
        at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:535)
        at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:2052)        at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2071)
        at org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2169)
        at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:2001)
        at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
        at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
        at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
        at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1432)
        at org.apache.orc.tools.PrintData.printJsonData(PrintData.java:208)
        at org.apache.orc.tools.PrintData.main(PrintData.java:288)
        at org.apache.orc.tools.Driver.main(Driver.java:120)
```

### How was this patch tested?
1. local test
2. Add UT

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #2127 from cxzl25/zlib_infinite_loop.

Authored-by: sychen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 8eaf92d)
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Mar 30, 2025
### What changes were proposed in this pull request?
This PR aims to fix ZlibCodec decompression of damaged files can be fast fail.

### Why are the changes needed?

This is a long-standing issue. The decompress method implemented by ZlibCodec may enter an infinite loop when encountering some corrupt files.

jstack
```java
"main" #1 [4611] prio=5 os_prio=31 cpu=55921.47ms elapsed=57.53s tid=0x0000000139014600 nid=4611 runnable  [0x000000016d9fa000]
   java.lang.Thread.State: RUNNABLE
        at java.util.zip.Inflater.inflateBytesBytes(java.base21.0.5/Native Method)
        at java.util.zip.Inflater.inflate(java.base21.0.5/Inflater.java:376)
        - locked <0x00000004367befc0> (a java.util.zip.Inflater$InflaterZStreamRef)
        at org.apache.orc.impl.ZlibCodec.decompress(ZlibCodec.java:168)
        at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:521)
        at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:548)
        at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:535)
        at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:2052)        at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2071)
        at org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2169)
        at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:2001)
        at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
        at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
        at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
        at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1432)
        at org.apache.orc.tools.PrintData.printJsonData(PrintData.java:208)
        at org.apache.orc.tools.PrintData.main(PrintData.java:288)
        at org.apache.orc.tools.Driver.main(Driver.java:120)
```

### How was this patch tested?
1. local test
2. Add UT

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #2127 from cxzl25/zlib_infinite_loop.

Authored-by: sychen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 8eaf92d)
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Mar 30, 2025
### What changes were proposed in this pull request?
This PR aims to fix ZlibCodec decompression of damaged files can be fast fail.

### Why are the changes needed?

This is a long-standing issue. The decompress method implemented by ZlibCodec may enter an infinite loop when encountering some corrupt files.

jstack
```java
"main" #1 [4611] prio=5 os_prio=31 cpu=55921.47ms elapsed=57.53s tid=0x0000000139014600 nid=4611 runnable  [0x000000016d9fa000]
   java.lang.Thread.State: RUNNABLE
        at java.util.zip.Inflater.inflateBytesBytes(java.base21.0.5/Native Method)
        at java.util.zip.Inflater.inflate(java.base21.0.5/Inflater.java:376)
        - locked <0x00000004367befc0> (a java.util.zip.Inflater$InflaterZStreamRef)
        at org.apache.orc.impl.ZlibCodec.decompress(ZlibCodec.java:168)
        at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:521)
        at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:548)
        at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:535)
        at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:2052)        at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2071)
        at org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2169)
        at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:2001)
        at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
        at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
        at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
        at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1432)
        at org.apache.orc.tools.PrintData.printJsonData(PrintData.java:208)
        at org.apache.orc.tools.PrintData.main(PrintData.java:288)
        at org.apache.orc.tools.Driver.main(Driver.java:120)
```

### How was this patch tested?
1. local test
2. Add UT

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #2127 from cxzl25/zlib_infinite_loop.

Authored-by: sychen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 8eaf92d)
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Mar 30, 2025
### What changes were proposed in this pull request?
This PR aims to fix ZlibCodec decompression of damaged files can be fast fail.

### Why are the changes needed?

This is a long-standing issue. The decompress method implemented by ZlibCodec may enter an infinite loop when encountering some corrupt files.

jstack
```java
"main" #1 [4611] prio=5 os_prio=31 cpu=55921.47ms elapsed=57.53s tid=0x0000000139014600 nid=4611 runnable  [0x000000016d9fa000]
   java.lang.Thread.State: RUNNABLE
        at java.util.zip.Inflater.inflateBytesBytes(java.base21.0.5/Native Method)
        at java.util.zip.Inflater.inflate(java.base21.0.5/Inflater.java:376)
        - locked <0x00000004367befc0> (a java.util.zip.Inflater$InflaterZStreamRef)
        at org.apache.orc.impl.ZlibCodec.decompress(ZlibCodec.java:168)
        at org.apache.orc.impl.InStream$CompressedStream.readHeader(InStream.java:521)
        at org.apache.orc.impl.InStream$CompressedStream.ensureUncompressed(InStream.java:548)
        at org.apache.orc.impl.InStream$CompressedStream.read(InStream.java:535)
        at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.commonReadByteArrays(TreeReaderFactory.java:2052)        at org.apache.orc.impl.TreeReaderFactory$BytesColumnVectorUtil.readOrcByteArrays(TreeReaderFactory.java:2071)
        at org.apache.orc.impl.TreeReaderFactory$StringDirectTreeReader.nextVector(TreeReaderFactory.java:2169)
        at org.apache.orc.impl.TreeReaderFactory$StringTreeReader.nextVector(TreeReaderFactory.java:2001)
        at org.apache.orc.impl.reader.tree.StructBatchReader.readBatchColumn(StructBatchReader.java:65)
        at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatchForLevel(StructBatchReader.java:100)
        at org.apache.orc.impl.reader.tree.StructBatchReader.nextBatch(StructBatchReader.java:77)
        at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1432)
        at org.apache.orc.tools.PrintData.printJsonData(PrintData.java:208)
        at org.apache.orc.tools.PrintData.main(PrintData.java:288)
        at org.apache.orc.tools.Driver.main(Driver.java:120)
```

### How was this patch tested?
1. local test
2. Add UT

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #2127 from cxzl25/zlib_infinite_loop.

Authored-by: sychen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 8eaf92d)
Signed-off-by: Dongjoon Hyun <[email protected]>
@dongjoon-hyun dongjoon-hyun modified the milestones: 1.9.6, 1.8.9 Mar 30, 2025
@dushyantk1509
Copy link

dushyantk1509 commented Oct 8, 2025

@cxzl25 Could you pls help me understand how could producer produce such corrupted files? How did you create corrupted test orc file?

Context: We've a spark job that produced one such corrupted file and I wanted to know how could it happen. Spark application uses rdd.saveAsNewAPIHadoopFile API with org.apache.orc.mapreduce.OrcOutputFormat to write orc files. Orc-core and orc-mapreduce version used is 1.6.10.

@cxzl25
Copy link
Contributor Author

cxzl25 commented Oct 8, 2025

Could you pls help me understand how could producer produce such corrupted files?

Usually when I encounter this problem, it may not necessarily be a problem at the ORC level. For example, it may be some data corruption caused by the HDFS EC storage policy, causing it to fail to decompress.

How did you create corrupted test orc file?

I created this damaged file because I adjusted the buffer size when writing, because I encountered a case in the production environment. The buffer size written at that time was incorrect due to hardware failure, resulting in reading failure.

@dushyantk1509
Copy link

Thanks @cxzl25 for quick response!

Are you aware of any other scenario this could happen? In our scenario, spark app has written 4000 files but only one is corrupted. I tried reading the file, I was able to read till 350208 records but stuck after it. From the file stats, it has 3M+ rows. IMO, this seems like some hardware failure similar to your scenario. B/w how did you confirm on the hardware failure?

For example, it may be some data corruption caused by the HDFS EC storage policy, causing it to fail to decompress.

Is it possible for you to share code snippet on how did you adjust buffer size? Any code reference that you might have?

I created this damaged file because I adjusted the buffer size when writing

@cxzl25
Copy link
Contributor Author

cxzl25 commented Oct 8, 2025

how did you confirm on the hardware failure?

I'm using the dcdiag tool to diagnose.

https://www.intel.com/content/www/us/en/support/articles/000098269/processors.html

how did you adjust buffer size? Any code reference that you might have?

Since it's been a while, I don't have the specific code, but I vaguely remember that I adjusted the buffer when writing PostScript during debugging.

builder.setCompression(writeCompressionKind(codec.getKind()))
.setCompressionBlockSize(unencryptedOptions.getBufferSize());

@dushyantk1509
Copy link

What kind of hardware failure did you see? Can you explain a little bit about it?

The buffer size written at that time was incorrect due to hardware failure, resulting in reading failure.

@cxzl25
Copy link
Contributor Author

cxzl25 commented Oct 9, 2025

What kind of hardware failure did you see? Can you explain a little bit about it?

The problem at that time was that some data failed to be read.

There is no problem using memtester to detect memory.

Finally, I used dcdiag to detect that there may be a problem with the CPU.

dcdiag

output

Intel(R) Data Center Diagnostic Tool Version v630 (1aa16c950f34b774738531af947f307f77456356)
Testing started. This takes about 45 minutes.
Test failed (#d88c8fa).
Test completed and an error was detected on the physical processor containing /sys/devices/system/cpu/cpu20 (family-model-stepping 06-55-04)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants