Respect max rowgroup size in Arrow writer#381
Conversation
|
CC @crepererum @houqp (we've spoken about this before) |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #381 +/- ##
==========================================
+ Coverage 82.60% 82.62% +0.02%
==========================================
Files 162 162
Lines 44199 44275 +76
==========================================
+ Hits 36509 36583 +74
- Misses 7690 7692 +2 ☔ View full report in Codecov by Sentry. |
| // We currently do not have a way of slicing nested arrays, thus we | ||
| // track this manually. | ||
| let num_rows = batch.num_rows(); | ||
| let batches = (num_rows + self.max_row_group_size - 1) / self.max_row_group_size; |
There was a problem hiding this comment.
Do we ensure that max_row_group_size > 0?
There was a problem hiding this comment.
I can check if there's ways of bypassing that
There was a problem hiding this comment.
I think that's sufficient :)
There was a problem hiding this comment.
Am I correct in thinking this code could result in non-uniform row group sizes?
like if we had max_row_group_size=10 and wrote a RecordBatch with 25 rows, would we get row groups like
(row_group 1: 10 rows)
(row_group 2: 10 rows)
(row_group 3: 5 rows)
(row_group 4: 10 rows)
(row_group 5: 10 rows)
(row_group 6: 5 rows)
?
If so I think this is fine (in that it is technically respecting max_row_group but it might be unexpected from a user, who might expect something more like
(row_group 1: 10 rows)
(row_group 2: 10 rows)
(row_group 3: 10 rows)
(row_group 4: 10 rows)
(row_group 5: 10 rows)
Perhaps it is worth a doc comment?
There was a problem hiding this comment.
Yes, that's a fair observation. I think it's a bit tricky because we would need to get the other 5 records from the next batch.
If we passed all batches at once, we would be able to segment them into equal rows.
This is something we can think of, as I think it's a valid expectation from a user.
I can check if we are able to keep row groups open, so that when the next batch comes in, we take its 5 records
There was a problem hiding this comment.
Maybe just adding a note in the docstring would be sufficient at this time
There was a problem hiding this comment.
Added a note, thanks :)
| // We currently do not have a way of slicing nested arrays, thus we | ||
| // track this manually. | ||
| let num_rows = batch.num_rows(); | ||
| let batches = (num_rows + self.max_row_group_size - 1) / self.max_row_group_size; |
There was a problem hiding this comment.
Am I correct in thinking this code could result in non-uniform row group sizes?
like if we had max_row_group_size=10 and wrote a RecordBatch with 25 rows, would we get row groups like
(row_group 1: 10 rows)
(row_group 2: 10 rows)
(row_group 3: 5 rows)
(row_group 4: 10 rows)
(row_group 5: 10 rows)
(row_group 6: 5 rows)
?
If so I think this is fine (in that it is technically respecting max_row_group but it might be unexpected from a user, who might expect something more like
(row_group 1: 10 rows)
(row_group 2: 10 rows)
(row_group 3: 10 rows)
(row_group 4: 10 rows)
(row_group 5: 10 rows)
Perhaps it is worth a doc comment?
parquet/src/arrow/arrow_writer.rs
Outdated
| let values = Arc::new(TimestampSecondArray::from_vec(raw_values, None)); | ||
|
|
||
| one_column_roundtrip("timestamp_second_single_column", values, false); | ||
| one_column_roundtrip( |
There was a problem hiding this comment.
it might be worth at least one test that divides into "more than 2" batches as well.
There was a problem hiding this comment.
I've changed the SMALL_SIZE to an odd number, and then changed the batch sizes of some tests.
| array_mask: vec![true, true], // both lists defined | ||
| max_definition: 0, | ||
| level_type: LevelType::Root, | ||
| offset: 0, |
There was a problem hiding this comment.
It might be cool to add a test here where the offset was something other than 0 -- all the examples I see have offset: 0.
There was a problem hiding this comment.
I've added a test to list_single_column with an offset.
|
I've addressed feedback, PTAL @alamb |
| level_type: LevelType::Root, | ||
| offset: 0, | ||
| length: 5, | ||
| offset: 2, |
|
I could not automatically cherry pick this to active_release (what will become Edit -- I will try again once I have cherry-picked #307 |
* Respect max rowgroup size in Arrow writer * simplify while loop * address review feedback
|
Back ported in #430 |
* Respect max rowgroup size in Arrow writer * simplify while loop * address review feedback Co-authored-by: Wakahisa <[email protected]>
Which issue does this PR close?
Closes #257.
Rationale for this change
Parquet splits batches into row groups, which are normally determined by a
max_row_group_sizesetting.The Arrow writer could not respect this setting because we cannot slice into structs and arrays correctly.
The issue is that when using
array.slice(offset: usize, len: usize), we don't propagate and calculate the slice of child data, leading to only the top-level data being sliced.What changes are included in this PR?
We use the
LevelInfostruct to keep track of its array's offset and length. This allows us to track nested arrays' offsets, and calculate the correct list offsets and lengths.We then use the
arrow::array::sliceto perform 0-copy slices from a batch, to limit the row group size correctly.I have changed all writer tests to use a max row group size, ensuring that we aren't introducing bugs when slicing.
Note that this is related to #225, but I don't think it quite covers all its use-cases.
If we have a sliced recordbatch per #343, we would need to account for its individual array offsets, as there is never a guarantee that a record batch has all child arrays starting from the same offset.
Are there any user-facing changes?
No. All changes are crate-internal.