fix: Support partition values in feature branch comet-parquet-exec by viirya · Pull Request #1106 · apache/datafusion-comet

viirya · 2024-11-20T16:34:04Z

Which issue does this PR close?

Closes #1102.

Rationale for this change

What changes are included in this PR?

How are these changes tested?

mbutrovich · 2024-11-20T17:27:09Z

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

          val dataSchemaParquet =
            new SparkToParquetSchemaConverter(conf).convert(scan.relation.dataSchema)
+          val partitionSchemaParquet =
+            new SparkToParquetSchemaConverter(conf).convert(scan.relation.partitionSchema)


#1103 discusses how the schemas have already lost necessary information at this point. Should we construct a new partition schema from the true Parquet schema rather than the partitionSchema that may have lost/converted type information already?

This copies from existing code.

Actually, I can just convert the Spark schema to Arrow types in JVM and serialize it to native side. I did similar thing in shuffle writer. Then we won't lose any information.

parthchandra · 2024-11-21T02:13:57Z

native/proto/src/proto/operator.proto

  int64 start = 2;
  int64 length = 3;
  int64 file_size = 4;
+  repeated spark.spark_expression.Expr partition_values = 5;


aren't the partition values just strings?

No. Although for Hive partitioned table, partition values are dictionary names which are strings, but once Spark reads these strings back, they are casted to corresponding data types of partition columns.

Ah makes sense.

viirya added 3 commits November 19, 2024 12:54

init

51dd628

more

58fb6d2

more

460a4a6

viirya changed the title ~~Support partition values in feature branch comet-parquet-exec~~ fix: Support partition values in feature branch comet-parquet-exec Nov 20, 2024

fix clippy

f8d4d97

mbutrovich reviewed Nov 20, 2024

View reviewed changes

Use Spark and Arrow types for partition schema

a68ac54

parthchandra reviewed Nov 21, 2024

View reviewed changes

mbutrovich approved these changes Nov 22, 2024

View reviewed changes

viirya merged commit c3ad26e into apache:comet-parquet-exec Nov 22, 2024

viirya deleted the partition_values branch November 22, 2024 23:56

andygrove mentioned this pull request Dec 4, 2024

Support partition values in feature branch comet-parquet-exec #1102

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

fix: Support partition values in feature branch comet-parquet-exec#1106

fix: Support partition values in feature branch comet-parquet-exec#1106
viirya merged 5 commits intoapache:comet-parquet-execfrom
viirya:partition_values

viirya commented Nov 20, 2024

Uh oh!

mbutrovich Nov 20, 2024

Uh oh!

viirya Nov 20, 2024

Uh oh!

parthchandra Nov 21, 2024

Uh oh!

viirya Nov 21, 2024

Uh oh!

parthchandra Nov 21, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

viirya commented Nov 20, 2024

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

mbutrovich Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

viirya Nov 20, 2024

Choose a reason for hiding this comment

Uh oh!

parthchandra Nov 21, 2024

Choose a reason for hiding this comment

Uh oh!

viirya Nov 21, 2024

Choose a reason for hiding this comment

Uh oh!

parthchandra Nov 21, 2024

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants