Description
We have a config spark.sql.files.ignoreCorruptFiles which can be used to ignore corrupt files when reading files in SQL. Currently the ignoreCorruptFiles config has two issues and can't work for Parquet:
1. We only ignore corrupt files in FileScanRDD . Actually, we begin to read those files as early as inferring data schema from the files. For corrupt files, we can't read the schema and fail the program. A related issue reported at http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tc20418.html
2. In FileScanRDD, we assume that we only begin to read the files when starting to consume the iterator. However, it is possibly the files are read before that. In this case, ignoreCorruptFiles config doesn't work too.
Attachments
Issue Links
- relates to
-
SPARK-19885 The config ignoreCorruptFiles doesn't work for CSV
-
- Resolved
-
- links to