Fix reading of dictionary encoded pages with null values (#1111)#1130
Conversation
Codecov Report
@@ Coverage Diff @@
## master #1130 +/- ##
==========================================
+ Coverage 82.33% 82.38% +0.04%
==========================================
Files 169 169
Lines 49773 50247 +474
==========================================
+ Hits 40981 41395 +414
- Misses 8792 8852 +60
Continue to review full report at Codecov.
|
|
Thank you @yordan-pavlov -- I plan to review / test this patch later today |
|
FWIW I tested with datafusion on the case where we initially observed this issue and this patch fixes the issue: apache/datafusion#1441 (comment) |
alamb
left a comment
There was a problem hiding this comment.
Thank you so much @yordan-pavlov -- as I mentioned, I verified that this change fixes the issues we had been seeing in DataFusion;
The basic idea of this PR looks great and makes sense to me, but I am not comfortable enough with the implementation of the parquet reader to fully understand it.
I wonder if @tustvold, @sunchao or @nevi-me could have a look?
Also, I wonder if there are ay benchmark results to demonstrate the performance change related to this PR?
| num_values: usize, | ||
| ) -> Result<usize> { | ||
| let mut def_level_decoder = LevelValueDecoder::new(level_decoder); | ||
| let def_level_array = |
There was a problem hiding this comment.
does this mean an entirely new array of definition levels is created for each column? Might this result in a non trivial amount of extra runtime overhead?
There was a problem hiding this comment.
Yes - def levels are decoded a second time for this fix and an i16 array and a boolean array are created to count the non-null values, but they only live for a very short time and the negative effect on performance is surprisingly small (3% to 8% in my benchmark run) see here: #1111 (comment) ; even after this change the ArrowArrayReader is still often several times faster for decoding strings compared to the old ArrayReader, this hasn't changed much.
It's probably possible to make this more efficient, but it would require more thinking and more time for a bigger change.
There was a problem hiding this comment.
makes sense -- thank you for running the numbers
|
Thank you for this 🎉. I will take a look today if I have time |
tustvold
left a comment
There was a problem hiding this comment.
This change makes sense to me, and seems like a pragmatic quick fix for the bug 👍
| offset, | ||
| def_levels_byte_len, | ||
| ); | ||
| let value_count = Self::count_def_level_values( |
There was a problem hiding this comment.
You might be able to do num_values - num_nulls here as the DataPageV2 has this information. Unfortunately very few implementations in practice seem to produce such pages, and tbh I'm not entirely sure if num_nulls is what I think it is...
|
Thanks again @yordan-pavlov |
Which issue does this PR close?
Closes #1111.
Rationale for this change
As explained in #1111
RleDecoderas used inVariableLenDictionaryDecoderas part of the implementation ofArrowArrayReader, incorrectly returns more keys than are actually available while at the same time, when the page contains NULLsVariableLenDictionaryDecoderis also requesting more keys than available becausenum_valuesis inclusive of NULLs. This then results in incorrectly decoding a dictionary-encoded page which also contains NULLs and returning more values than necessary.What changes are included in this PR?
This PR contains:
num_valuesfrom the data page) when creating the value decoder, so that it knows how many values are actually available. This is then used in existing code inVariableLenDictionaryDecoderto limit how many keys are requested from the nestedRleDecoder.test_arrow_array_reader_dict_enc_stringforArrowArrayReadertest_complex_array_reader_dict_enc_stringforArrayReaderAre there any user-facing changes?
No
@alamb @tustvold