ensure null-counts are written for all-null columns#307
ensure null-counts are written for all-null columns#307sunchao merged 1 commit intoapache:masterfrom
Conversation
| } | ||
| } | ||
|
|
||
| #[test] |
There was a problem hiding this comment.
I am not entirely happy with this test since it requires an arrow data structure to actually test an parquet-part. It would probably be nice to have a more "pure" test as well, however I wasn't able to formulate an easy one.
There was a problem hiding this comment.
I would say this looks more like an "integration" test rather than "unit" test (in the sense that you connect some higher level APIs and ensure the output is reasonable). I personally don't see any problems with this approach.
to remove the dependency on Arrow you would probably have to use the SerializedFileWriter API directly: https://docs.rs/parquet/4.0.0/parquet/file/writer/trait.FileWriter.html
But I think that would end up being quite a bit more code
|
I am pretty sure the integration test failure is unrelated: |
|
I have restarted the integration tests on this PR |
alamb
left a comment
There was a problem hiding this comment.
This looks good to me -- thank you @crepererum
| } | ||
| } | ||
|
|
||
| #[test] |
There was a problem hiding this comment.
I would say this looks more like an "integration" test rather than "unit" test (in the sense that you connect some higher level APIs and ensure the output is reasonable). I personally don't see any problems with this approach.
to remove the dependency on Arrow you would probably have to use the SerializedFileWriter API directly: https://docs.rs/parquet/4.0.0/parquet/file/writer/trait.FileWriter.html
But I think that would end up being quite a bit more code
| let max_rep_level = self.descr.max_rep_level(); | ||
|
|
||
| // always update column NULL count, no matter if page stats are used | ||
| self.num_column_nulls += self.num_page_nulls; |
There was a problem hiding this comment.
hmm just curious why we don't need to update self.num_page_nulls accordingly, it is updated as follow:
for &level in levels {
if level == self.descr.max_def_level() {
values_to_write += 1;
} else if calculate_page_stats {
self.num_page_nulls += 1
}
}There was a problem hiding this comment.
There are two variables here: num_page_nulls (which IS updated) and num_column_nulls (which prior to this PR was only updated if min/max values were set, which cannot happen for all-NULL columns). And the latter one (num_column_nulls) will be used to write the stats.
There was a problem hiding this comment.
which IS updated
(Sorry for the slow response) This is the part I don't quite understand. add_data_page is called in two places: flush_data_pages and write_mini_batch. I can see this handles the first case but don't see how it applies for the second case too.
There was a problem hiding this comment.
I'm not sure I follow:
- Is your point that
self.num_column_nulls += self.num_page_nulls;should only be triggered duringflush_data_pages, not duringwrite_mini_batch? - Or is the question under which condition
write_mini_batchcallsadd_data_page? For that: As far as I can tell it is NOT called during the test at least because ofshould_add_data_pagedecides not to do so (likely because of the small test case).
There was a problem hiding this comment.
I meant the second: write_mini_batch will call add_data_page if encoded data page size exceeds the limit. In this case it seems num_page_nulls and subsequently num_column_nulls will not be set.
There was a problem hiding this comment.
I've just checked:
#[test]
fn statistics_null_counts_big_mixed() {
// check that null-count statistics work for larger data sizes
let len = 1_000_000u64;
let data: Vec<_> = (0..len).map(|x| if x % 2 == 0 {Some(x)} else {None}).collect();
let values = Arc::new(UInt64Array::from(data));
let file = one_column_roundtrip("null_counts_big_mixed", values, true);
// check statistics are valid
let reader = SerializedFileReader::new(file).unwrap();
let metadata = reader.metadata();
assert_eq!(metadata.num_row_groups(), 1);
let row_group = metadata.row_group(0);
assert_eq!(row_group.num_columns(), 1);
let column = row_group.column(0);
let stats = column.statistics().unwrap();
assert_eq!(stats.null_count(), len / 2);
}This triggers the add_data_page via write_mini_batch code path (with a small patch to the roundtrip test code to set a higher batch size). That works (with this PR) and both values are set. TBH that's what I expected since I don't really understand your reasoning behind "In this case it seems num_page_nulls and subsequently num_column_nulls will not be set.".
There was a problem hiding this comment.
Oops, you're right. I misread calculate_page_stats in the code. My apologies.
Fixes #306. Co-authored-by: Marco Neumann <[email protected]>
Which issue does this PR close?
Fixes #306.
Rationale for this change
All-NULL columns are somewhat special because they don't have min/max values. However this shouldn't stop us from correctly calculating the null counts.
What changes are included in this PR?
Correctly populate null-counts from pages to columns.
Are there any user-facing changes?
Yes, parquet files now have a correct null-count