Add ParquetError::NeedMoreData mark ParquetError as non_exhaustive#6630
Add ParquetError::NeedMoreData mark ParquetError as non_exhaustive#6630alamb merged 5 commits intoapache:masterfrom
ParquetError::NeedMoreData mark ParquetError as non_exhaustive#6630Conversation
| "file size of {} is less than footer + metadata {}", | ||
| file_size, | ||
| length + 8 | ||
| length + FOOTER_SIZE |
| /// An external error variant | ||
| External(Box<dyn Error + Send + Sync>), | ||
| /// Returned when a function needs more data to complete properly. | ||
| NeedMoreData(usize), |
There was a problem hiding this comment.
Can we also mark ParquetError non exhaustive at the same time (so future additions won't be breaking API changes)?
Like this:
arrow-rs/object_store/src/lib.rs
Line 1228 in 56525ef
There was a problem hiding this comment.
@alamb If add the #[non_exhaustive] to ParquetError, when the downstream matches ParquetError, we have to add the _, https://doc.rust-lang.org/reference/attributes/type_system.html#r-attributes.type-system.non_exhaustive.match
such as
match parquet_error {
... => {},
_ => {} // We have to add the `_`, https://doc.rust-lang.org/reference/attributes/type_system.html#r-attributes.type-system.non_exhaustive.match
}
I'm worried that if we add a new error type into ParquetError, the downstream will be unsensible due to _.
parquet/src/errors.rs
Outdated
| IndexOutOfBound(usize, usize), | ||
| /// An external error variant | ||
| External(Box<dyn Error + Send + Sync>), | ||
| /// Returned when a function needs more data to complete properly. |
There was a problem hiding this comment.
Could you also add some additional comments on how to interpret the argument. Is it the number of bytes that is needed?
Maybe we can return a Range or something more specific about what is needed if it isn't always clear
There was a problem hiding this comment.
In an earlier POC I did have two separate errors, one with a range, but we thought that a bit redundant at the time. I suppose we could add an Option<Range<usize>> to NeedMoreData to handle the case of insufficient buffer to read the page indexes. There we know the exact range we need. Right now the needed value is set to the number of bytes read from the tail of the file. If try_parse fails when reading the page indexes, then we really only need the page index range...there's no need to read the footer all over again. But then the error handling becomes more complex...if the range in NeedMoreData is None, then read the required bytes from the tail and call try_parse again; if range is not None, then read the range and call read_page_indexes. Ofc a user is free to ignore the range if they want to keep the error handling simple at the cost of more I/O (but that would likely be negligible since those bytes should still be in buffer cache).
For now I'll try to improve the documentation. Let me know if you want me to add the optional Range to NeedMoreData.
| /// Ok(_) => break, | ||
| /// Err(ParquetError::NeedMoreData(needed)) => { | ||
| /// // Read the needed number of bytes from the end of the file | ||
| /// bytes = get_bytes(&file, len - needed..len); |
ParquetError::NeedMoreDataParquetError::NeedMoreData mark ParquetError as non_exhaustive
|
Thank you for your help and patience @etseidl |
Which issue does this PR close?
Part of #6447.
Rationale for this change
ParquetMetaDataReadercurrently usesParquetError::IndexOutOfBoundto indicate the need for more data. This is not the intended use forIndexOutOfBound.What changes are included in this PR?
Adds a new enum value
NeedMoreData.Are there any user-facing changes?
Yes, this changes the
ParquetErrorAPI as well asParquetMetaDataReader.