Fixed parquet path partitioning when only selecting partitioned columns by pjmore · Pull Request #2000 · apache/datafusion

pjmore · 2022-03-12T22:24:46Z

Which issue does this PR close?

Partially #1999.

Rationale for this change

Fix behaviour when querying only partitioning columns for parquet file format.

What changes are included in this PR?

Use row group level metadata to return the correct number of partition columns.

Are there any user-facing changes?

No

datafusion/src/physical_plan/file_format/parquet.rs

yjshen

LGTM.

datafusion/src/physical_plan/file_format/parquet.rs

alamb · 2022-03-20T10:31:22Z

cc @rdettai

datafusion/src/physical_plan/file_format/parquet.rs

alamb · 2022-03-23T19:40:11Z

I think this PR is waiting on responses to @rdettai

datafusion/src/physical_plan/file_format/mod.rs

…rmine number of rows to emit

…partitioning

datafusion/src/physical_plan/file_format/parquet.rs

rdettai · 2022-03-24T08:24:42Z

datafusion/src/physical_plan/file_format/parquet.rs

+    for partitioned_file in partition {
+        let object_reader =
+            object_store.file_reader(partitioned_file.file_meta.sized_file.clone())?;
+        let file_reader = SerializedFileReader::new(ChunkObjectReader(object_reader))?;
+        let mut file_rows: usize = file_reader
+            .metadata()
+            .file_metadata()
+            .num_rows()
+            .try_into()
+            .expect("Row count should always be greater than or equal to 0");
+        let remaining_rows = limit.unwrap_or(usize::MAX);
+        if file_rows >= remaining_rows {
+            file_rows = remaining_rows;
+            limit = Some(0);
+        } else if let Some(remaining_limit) = &mut limit {
+            *remaining_limit -= file_rows;
+        }
+
+        while file_rows > batch_size {
+            send_result(
+                &response_tx,
+                partition_column_projector
+                    .project_empty(batch_size, &partitioned_file.partition_values),
+            )?;
+            file_rows -= batch_size;
+        }
+        if file_rows != 0 {
+            send_result(
+                &response_tx,
+                partition_column_projector
+                    .project_empty(batch_size, &partitioned_file.partition_values),
+            )?;
+        }
+
+        if limit == Some(0) {
+            break;
+        }
+    }
+    Ok(())
+}


I still feel this could be simplified and made more readable by using more iterators:

iterate over file

map them to their size

map each size to an iterator that repeats the batch size file_rows/batch_size times + residual

flat map the whole thing

apply limit with take(limit)

for_each(send)

I couldn't find a good way to implement what you suggested. The error handling when opening the file was the main issue that I ran into. I couldn't figure out another way to short circuit when the limit was met and short circuit on any errors that occured. If you're okay scanning all of the partition files even on an error I'm okay with it, I just figured that for remote object stores that that might be a bad idea.

let mut res: Result<()> = Ok(()); let mut batch_size_partition_iter = partition.iter() .map(|partitioned_file|{ let mut num_rows: usize = match object_store.file_reader(partitioned_file.file_meta.sized_file.clone()){ Ok(object_reader) => { match SerializedFileReader::new(ChunkObjectReader(object_reader)){ Ok(file_reader) => { file_reader .metadata() .file_metadata() .num_rows() .try_into() .expect("Row count should always be greater than or equal to 0 and less than usize::MAX") }, Err(e) =>{ res = Err(e.into()); 0 }, } }, Err(e) => { res = Err(e); 0 }, }; num_rows = limit.min(num_rows); limit -= num_rows; (num_rows, partitioned_file.partition_values.as_slice()) }) .take_while(|(num_rows, _)| *num_rows != 0) .flat_map(|(num_rows, partition_values)| BatchSizeIter::new(num_rows, batch_size).zip(std::iter::repeat(partition_values))); Iterator::try_for_each(&mut batch_size_partition_iter,|(batch_size, partition_values)| { send_result(&response_tx, partition_column_projector.project_empty(batch_size, partition_values)) })?; res?; Ok(())

Right, error management in iterators can quickly become annoying! Then I think the version with loop is fine for now.

datafusion/src/physical_plan/file_format/mod.rs

datafusion/src/physical_plan/file_format/parquet.rs

…olumns and reuse partition record batch

alamb

@rdettai / @tustvold would you like to review this PR again?

rdettai

I think the amount of repetition in read_partition_no_file_columns now reached a very satisfying level. @alamb do you agree?

alamb

Looks good to me. Thank you @pjmore and @rdettai

datafusion/tests/path_partition.rs

alamb · 2022-04-03T10:44:07Z

Looks like it just needs some updating to resolve conflicts. @pjmore I am happy to do so, let me know if you would like me to

…partitioning

pjmore · 2022-04-03T18:23:48Z

@alamb I had some extra test cases to add for the limit logic so I just fixed the conflicts then. Should be good to go now!

alamb · 2022-04-04T18:13:17Z

Thanks @pjmore -- epic work 👍

tustvold · 2022-04-11T13:50:57Z

I've created apache/arrow-rs#1537 to track pushing this functionality upstream, as I think it is generally useful. I will try to bash it out if I have some spare cycles.

Fixed parquet path partitioning when only selecting partitioned columns

b3ce07c

github-actions bot added the datafusion label Mar 12, 2022

Ted-Jiang reviewed Mar 13, 2022

View reviewed changes

datafusion/src/physical_plan/file_format/parquet.rs Outdated Show resolved Hide resolved

Ted-Jiang reviewed Mar 13, 2022

View reviewed changes

datafusion/src/physical_plan/file_format/parquet.rs Outdated Show resolved Hide resolved

pjmore added 2 commits March 13, 2022 20:39

Removed unnecesary row group pruning and file metrics

d61a1aa

Ran cargo fmt

dcc2cd0

yjshen requested changes Mar 16, 2022

View reviewed changes

datafusion/src/physical_plan/file_format/parquet.rs Show resolved Hide resolved

xudong963 added the bug Something isn't working label Mar 17, 2022

yjshen approved these changes Mar 18, 2022

View reviewed changes

alamb reviewed Mar 20, 2022

View reviewed changes

datafusion/src/physical_plan/file_format/parquet.rs Show resolved Hide resolved

rdettai suggested changes Mar 21, 2022

View reviewed changes

datafusion/src/physical_plan/file_format/parquet.rs Show resolved Hide resolved

datafusion/src/physical_plan/file_format/parquet.rs Outdated Show resolved Hide resolved

Dandandan reviewed Mar 23, 2022

View reviewed changes

datafusion/src/physical_plan/file_format/mod.rs Outdated Show resolved Hide resolved

pjmore added 4 commits March 23, 2022 20:52

Switched from row group level metadata to file level metadata to dete…

e3aa7dd

…rmine number of rows to emit

Reworked task spawning in ParquetExec::execute

5bec5e7

Changed index based partition column generating to iterator version

48a3627

Merge remote-tracking branch 'upstream/master' into fix-parquet-hive-…

b027375

…partitioning

rdettai suggested changes Mar 24, 2022

View reviewed changes

Moved limit unwrap outside of loop

a5f0404

houqp reviewed Mar 27, 2022

View reviewed changes

datafusion/src/physical_plan/file_format/parquet.rs Outdated Show resolved Hide resolved

houqp reviewed Mar 27, 2022

View reviewed changes

datafusion/src/physical_plan/file_format/parquet.rs Outdated Show resolved Hide resolved

Fixed bug about number of rows emitted when querying only partition c…

d331f8f

…olumns and reuse partition record batch

houqp approved these changes Mar 28, 2022

View reviewed changes

alamb reviewed Mar 28, 2022

View reviewed changes

rdettai approved these changes Mar 30, 2022

View reviewed changes

alamb approved these changes Mar 30, 2022

View reviewed changes

datafusion/tests/path_partition.rs Show resolved Hide resolved

pjmore added 2 commits April 3, 2022 12:15

Added limit logic tests for partitioned hive partitioned parquet file

ffac6ea

Merge remote-tracking branch 'upstream/master' into fix-parquet-hive-…

f4812f1

…partitioning

Formatted code

c420885

pjmore added 2 commits April 3, 2022 15:18

Fixed clippy lint

2fb3596

Fixed other clippy lint

6c4edd0

alamb approved these changes Apr 4, 2022

View reviewed changes

alamb merged commit fa5cef8 into apache:master Apr 4, 2022

This was referenced Apr 11, 2022

Support RecordBatch with zero columns but non zero row count apache/arrow-rs#1536

Closed

Support Empty Column Projection in ParquetRecordBatchReader apache/arrow-rs#1537

Closed

This was referenced Apr 20, 2022

Remove Parquet Empty Projection Workaround #2289

Merged

Single File Per ParquetExec, AvroExec, etc... #2293

Closed

pjmore mentioned this pull request Nov 15, 2022

Error occurs when only using partition columns in query #1999

Closed

Conversation

pjmore commented Mar 12, 2022 • edited by yjshen Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yjshen left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Mar 20, 2022

Uh oh!

Uh oh!

Uh oh!

alamb commented Mar 23, 2022

Uh oh!

Uh oh!

Uh oh!

rdettai Mar 24, 2022

Choose a reason for hiding this comment

Uh oh!

pjmore Mar 26, 2022

Choose a reason for hiding this comment

Uh oh!

rdettai Mar 30, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

rdettai left a comment

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb commented Apr 3, 2022

Uh oh!

pjmore commented Apr 3, 2022

Uh oh!

alamb commented Apr 4, 2022

Uh oh!

tustvold commented Apr 11, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

pjmore commented Mar 12, 2022 •

edited by yjshen

Loading