fix: Support partition values in feature branch comet-parquet-exec#1106
fix: Support partition values in feature branch comet-parquet-exec#1106viirya merged 5 commits intoapache:comet-parquet-execfrom
Conversation
| val dataSchemaParquet = | ||
| new SparkToParquetSchemaConverter(conf).convert(scan.relation.dataSchema) | ||
| val partitionSchemaParquet = | ||
| new SparkToParquetSchemaConverter(conf).convert(scan.relation.partitionSchema) |
There was a problem hiding this comment.
#1103 discusses how the schemas have already lost necessary information at this point. Should we construct a new partition schema from the true Parquet schema rather than the partitionSchema that may have lost/converted type information already?
There was a problem hiding this comment.
This copies from existing code.
Actually, I can just convert the Spark schema to Arrow types in JVM and serialize it to native side. I did similar thing in shuffle writer. Then we won't lose any information.
| int64 start = 2; | ||
| int64 length = 3; | ||
| int64 file_size = 4; | ||
| repeated spark.spark_expression.Expr partition_values = 5; |
There was a problem hiding this comment.
aren't the partition values just strings?
There was a problem hiding this comment.
No. Although for Hive partitioned table, partition values are dictionary names which are strings, but once Spark reads these strings back, they are casted to corresponding data types of partition columns.
Which issue does this PR close?
Closes #1102.
Rationale for this change
What changes are included in this PR?
How are these changes tested?