GH-39527: [C++][Parquet] Validate page sizes before truncating to int32#39528
GH-39527: [C++][Parquet] Validate page sizes before truncating to int32#39528mapleFU merged 8 commits intoapache:mainfrom
Conversation
…o int32 Be defensive instead of writing invalid data.
Co-authored-by: Gang Wu <[email protected]>
| DataPageV1 over_compressed_limit(buffer, /*num_values=*/100, Encoding::BIT_PACKED, | ||
| Encoding::BIT_PACKED, Encoding::BIT_PACKED, | ||
| /*uncompressed_size=*/100); | ||
| EXPECT_THROW(pager->WriteDataPage(over_compressed_limit), ParquetException); |
There was a problem hiding this comment.
Should we uses EXPECT_THAT to gurantee "overflows to INT32_MAX"?
cpp/src/parquet/column_writer.cc
Outdated
|
|
||
| const uint8_t* output_data_buffer = compressed_data->data(); | ||
| if (compressed_data->size() > std::numeric_limits<int32_t>::max()) { | ||
| throw ParquetException("Compressed page size overflows to INT32_MAX."); |
There was a problem hiding this comment.
should we note that the page type?
There was a problem hiding this comment.
( Also can we print the page size? )
|
@emkornfield Could we move forward? I'll approve if comment fixed |
|
Yes, will try to update this week. |
|
@mapleFU apologies for the delay, I think I've addressed your feedback. |
|
|
||
| const uint8_t* output_data_buffer = compressed_data->data(); | ||
| if (compressed_data->size() > std::numeric_limits<int32_t>::max()) { | ||
| throw ParquetException( |
Co-authored-by: Antoine Pitrou <[email protected]>
pitrou
left a comment
There was a problem hiding this comment.
LGTM except for one last suggestion
| ThrowsMessage<ParquetException>(HasSubstr("overflows INT32_MAX"))); | ||
| DictionaryPage dictionary_over_compressed_limit(buffer, /*num_values=*/100, | ||
| Encoding::PLAIN); | ||
| EXPECT_THROW(pager->WriteDictionaryPage(dictionary_over_compressed_limit), |
There was a problem hiding this comment.
Sorry, didn't spot this, but perhaps also use EXPECT_THAT to check the exception message here?
|
After merging your PR, Conbench analyzed the 5 benchmarking runs that have been run so far on merge-commit 5d7f661. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 51 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…to int32 (apache#39528) Be defensive instead of writing invalid data. ### Rationale for this change Users can provide this API pages that are large to validly write and we silently truncate lengths before writing. ### What changes are included in this PR? Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build). ### Are these changes tested? Unit tested ### Are there any user-facing changes? This might start raising exceptions instead of writing out invalid parquet files. Closes apache#39527 **This PR contains a "Critical Fix".** * Closes: apache#39527 Lead-authored-by: emkornfield <[email protected]> Co-authored-by: Micah Kornfield <[email protected]> Co-authored-by: mwish <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Gang Wu <[email protected]> Signed-off-by: mwish <[email protected]>
…32 (#39528) Be defensive instead of writing invalid data. ### Rationale for this change Users can provide this API pages that are large to validly write and we silently truncate lengths before writing. ### What changes are included in this PR? Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build). ### Are these changes tested? Unit tested ### Are there any user-facing changes? This might start raising exceptions instead of writing out invalid parquet files. Closes #39527 **This PR contains a "Critical Fix".** * Closes: #39527 Lead-authored-by: emkornfield <[email protected]> Co-authored-by: Micah Kornfield <[email protected]> Co-authored-by: mwish <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Co-authored-by: Gang Wu <[email protected]> Signed-off-by: mwish <[email protected]>
Be defensive instead of writing invalid data.
Rationale for this change
Users can provide this API pages that are large to validly write and we silently truncate lengths before writing.
What changes are included in this PR?
Add validations and throw an exception if sizes are too large (this was previously checked only if page indexes are being build).
Are these changes tested?
Unit tested
Are there any user-facing changes?
This might start raising exceptions instead of writing out invalid parquet files.
Closes #39527
This PR contains a "Critical Fix".