[SPARK-19082] The config ignoreCorruptFiles doesn't work for Parquet - ASF Jira

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.1.1, 2.2.0
Component/s: SQL
Labels:
None

Description

We have a config spark.sql.files.ignoreCorruptFiles which can be used to ignore corrupt files when reading files in SQL. Currently the ignoreCorruptFiles config has two issues and can't work for Parquet:

1. We only ignore corrupt files in FileScanRDD . Actually, we begin to read those files as early as inferring data schema from the files. For corrupt files, we can't read the schema and fail the program. A related issue reported at http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tc20418.html

2. In FileScanRDD, we assume that we only begin to read the files when starting to consume the iterator. However, it is possibly the files are read before that. In this case, ignoreCorruptFiles config doesn't work too.

Attachments

Issue Links

relates to

SPARK-19885 The config ignoreCorruptFiles doesn't work for CSV

Resolved

links to

[Github] Pull Request #16474 (viirya)

Activity

People

Assignee:: L. C. Hsieh

Reporter:: L. C. Hsieh

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 05/Jan/17 04:47

Updated:: 09/Mar/17 18:49

Resolved:: 16/Jan/17 07:28