chore: More refactoring of type checking logic#1744
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR refactors the type checking logic to reduce technical debt and better support native scans and complex type handling. Key changes include updating import paths and function calls for data source execution checks, refactoring the fallback message and condition for ByteType/ShortType support in native scans, and simplifying the logic for determining supported types in Comet sink operators.
Reviewed Changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| spark/src/test/scala/org/apache/spark/sql/comet/ParquetDatetimeRebaseSuite.scala | Adjusted imports and usage of usingDataSourceExec in tests |
| spark/src/test/scala/org/apache/spark/sql/CometTestBase.scala | Moved usingDataSourceExec and usingDataSourceExecWithIncompatTypes into the test base |
| spark/src/main/scala/org/apache/spark/sql/comet/CometScanExec.scala | Refactored type-checking logic for ByteType/ShortType with new conditions and fallback messages |
| spark/src/main/scala/org/apache/spark/sql/comet/CometNativeScanExec.scala | Updated type-checking logic for ByteType/ShortType with adjusted configuration checks |
| spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala | Simplified support check for Comet sink operators by removing conditional conversion logic |
| spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala | Removed redundant definitions of usingDataSourceExec and usingDataSourceExecWithIncompatTypes |
Comments suppressed due to low confidence (2)
spark/src/main/scala/org/apache/spark/sql/comet/CometScanExec.scala:530
- There appears to be a logic mismatch when handling ByteType/ShortType: the condition checks if COMET_SCAN_ALLOW_INCOMPATIBLE is true while the fallback message indicates it should be false. Please verify the intended configuration for native scans.
case ByteType | ShortType if CometConf.COMET_NATIVE_SCAN_IMPL.get() == CometConf.SCAN_NATIVE_ICEBERG_COMPAT && CometConf.COMET_SCAN_ALLOW_INCOMPATIBLE.get() =>
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala:2761
- [nitpick] The updated logic for supported types in Comet sink operators now always assumes complex types are allowed. Confirm that removing the conversion flag checks (e.g. COMET_CONVERT_FROM_PARQUET_ENABLED) is the intended behavior.
case op if isCometSink(op) =>
spark/src/main/scala/org/apache/spark/sql/comet/CometNativeScanExec.scala
Show resolved
Hide resolved
| def usingDataSourceExec(conf: SQLConf): Boolean = | ||
| Seq(CometConf.SCAN_NATIVE_ICEBERG_COMPAT, CometConf.SCAN_NATIVE_DATAFUSION).contains( | ||
| CometConf.COMET_NATIVE_SCAN_IMPL.get(conf)) | ||
|
|
||
| def usingDataSourceExecWithIncompatTypes(conf: SQLConf): Boolean = { | ||
| usingDataSourceExec(conf) && | ||
| !CometConf.COMET_SCAN_ALLOW_INCOMPATIBLE.get(conf) | ||
| } |
There was a problem hiding this comment.
These methods moved to CometTestBase and reduces the number of times we use these configs in implementation code.
| if isCometSink(op) && op.output.forall(a => | ||
| supportedDataType( | ||
| a.dataType, | ||
| // Complex type supported if | ||
| // - Native datafusion reader enabled (experimental) OR | ||
| // - conversion from Parquet/JSON enabled | ||
| allowComplex = | ||
| usingDataSourceExec(conf) || CometConf.COMET_CONVERT_FROM_PARQUET_ENABLED | ||
| .get(conf) || CometConf.COMET_CONVERT_FROM_JSON_ENABLED.get(conf))) => |
There was a problem hiding this comment.
I don't remember why these checks were once necessary but I don't think we need them now.
There was a problem hiding this comment.
These checks were limiting our support for complex types and removing them exposed a new bug with grouping on maps, which is now fixed in this PR.
mbutrovich
left a comment
There was a problem hiding this comment.
LGTM, and a new test for fun! Thanks @andygrove!
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1744 +/- ##
============================================
+ Coverage 56.12% 58.56% +2.43%
- Complexity 976 1133 +157
============================================
Files 119 130 +11
Lines 11743 12681 +938
Branches 2251 2369 +118
============================================
+ Hits 6591 7426 +835
- Misses 4012 4063 +51
- Partials 1140 1192 +52 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
I have a Spark SQL test failure to resolve: edit: The failure is specific to Spark 4.0.0, which added support for grouping by map types |
parthchandra
left a comment
There was a problem hiding this comment.
lgtm. Just one minor question.
| return None | ||
| } | ||
|
|
||
| if (groupingExpressions.exists(expr => |
There was a problem hiding this comment.
Are there other check we need to include here for structs/arrays?
There was a problem hiding this comment.
Spark 3.x supports grouping by structs and arrays and we already have tests passing for those cases. Grouping by map is new in Spark 4.
There was a problem hiding this comment.
hmm I suppose there could be structs containing maps 🤔
|
Thanks for the reviews @mbutrovich and @parthchandra |
Which issue does this PR close?
N/A
Rationale for this change
Fixing technical debt in preparation for other improvements for native scans and complex type support
What changes are included in this PR?
usingDataSourceExecandusingDataSourceExecWithIncompatTypesintoCometTestBaseHow are these changes tested?