chore: More refactoring of type checking logic by andygrove · Pull Request #1744 · apache/datafusion-comet

andygrove · 2025-05-16T14:00:04Z

Which issue does this PR close?

N/A

Rationale for this change

Fixing technical debt in preparation for other improvements for native scans and complex type support

What changes are included in this PR?

Move some type checking logic into the native scan execs
Improve fallback message for native scans reading byte/short
Move usingDataSourceExec and usingDataSourceExecWithIncompatTypes into CometTestBase
Reimplement type-checking logic when determining if a Comet sink is supported
Add one more fuzz test for shuffle, for more coverage
Add a check that we were missing for falling back to Spark for GROUP BY on complex types

How are these changes tested?

Copilot

Pull Request Overview

This PR refactors the type checking logic to reduce technical debt and better support native scans and complex type handling. Key changes include updating import paths and function calls for data source execution checks, refactoring the fallback message and condition for ByteType/ShortType support in native scans, and simplifying the logic for determining supported types in Comet sink operators.

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
spark/src/test/scala/org/apache/spark/sql/comet/ParquetDatetimeRebaseSuite.scala	Adjusted imports and usage of usingDataSourceExec in tests
spark/src/test/scala/org/apache/spark/sql/CometTestBase.scala	Moved usingDataSourceExec and usingDataSourceExecWithIncompatTypes into the test base
spark/src/main/scala/org/apache/spark/sql/comet/CometScanExec.scala	Refactored type-checking logic for ByteType/ShortType with new conditions and fallback messages
spark/src/main/scala/org/apache/spark/sql/comet/CometNativeScanExec.scala	Updated type-checking logic for ByteType/ShortType with adjusted configuration checks
spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala	Simplified support check for Comet sink operators by removing conditional conversion logic
spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala	Removed redundant definitions of usingDataSourceExec and usingDataSourceExecWithIncompatTypes

Comments suppressed due to low confidence (2)

spark/src/main/scala/org/apache/spark/sql/comet/CometScanExec.scala:530

There appears to be a logic mismatch when handling ByteType/ShortType: the condition checks if COMET_SCAN_ALLOW_INCOMPATIBLE is true while the fallback message indicates it should be false. Please verify the intended configuration for native scans.

case ByteType | ShortType if CometConf.COMET_NATIVE_SCAN_IMPL.get() == CometConf.SCAN_NATIVE_ICEBERG_COMPAT && CometConf.COMET_SCAN_ALLOW_INCOMPATIBLE.get() =>

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala:2761

[nitpick] The updated logic for supported types in Comet sink operators now always assumes complex types are allowed. Confirm that removing the conversion flag checks (e.g. COMET_CONVERT_FROM_PARQUET_ENABLED) is the intended behavior.

case op if isCometSink(op) =>

spark/src/main/scala/org/apache/spark/sql/comet/CometNativeScanExec.scala

andygrove · 2025-05-16T16:33:02Z

spark/src/main/scala/org/apache/comet/CometSparkSessionExtensions.scala

-  def usingDataSourceExec(conf: SQLConf): Boolean =
-    Seq(CometConf.SCAN_NATIVE_ICEBERG_COMPAT, CometConf.SCAN_NATIVE_DATAFUSION).contains(
-      CometConf.COMET_NATIVE_SCAN_IMPL.get(conf))
-
-  def usingDataSourceExecWithIncompatTypes(conf: SQLConf): Boolean = {
-    usingDataSourceExec(conf) &&
-    !CometConf.COMET_SCAN_ALLOW_INCOMPATIBLE.get(conf)
-  }


These methods moved to CometTestBase and reduces the number of times we use these configs in implementation code.

andygrove · 2025-05-16T16:38:07Z

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

-          if isCometSink(op) && op.output.forall(a =>
-            supportedDataType(
-              a.dataType,
-              // Complex type supported if
-              // - Native datafusion reader enabled (experimental) OR
-              // - conversion from Parquet/JSON enabled
-              allowComplex =
-                usingDataSourceExec(conf) || CometConf.COMET_CONVERT_FROM_PARQUET_ENABLED
-                  .get(conf) || CometConf.COMET_CONVERT_FROM_JSON_ENABLED.get(conf))) =>


I don't remember why these checks were once necessary but I don't think we need them now.

These checks were limiting our support for complex types and removing them exposed a new bug with grouping on maps, which is now fixed in this PR.

mbutrovich

LGTM, and a new test for fun! Thanks @andygrove!

codecov-commenter · 2025-05-16T17:48:49Z

Codecov Report

Attention: Patch coverage is 22.22222% with 14 lines in your changes missing coverage. Please review.

Project coverage is 58.56%. Comparing base (f09f8af) to head (3634fc1).
Report is 195 commits behind head on main.

Files with missing lines	Patch %	Lines
.../scala/org/apache/comet/serde/QueryPlanSerde.scala	30.00%	3 Missing and 4 partials ⚠️
...ala/org/apache/spark/sql/comet/CometScanExec.scala	20.00%	2 Missing and 2 partials ⚠️
...g/apache/spark/sql/comet/CometNativeScanExec.scala	0.00%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #1744      +/-   ##
============================================
+ Coverage     56.12%   58.56%   +2.43%     
- Complexity      976     1133     +157     
============================================
  Files           119      130      +11     
  Lines         11743    12681     +938     
  Branches       2251     2369     +118     
============================================
+ Hits           6591     7426     +835     
- Misses         4012     4063      +51     
- Partials       1140     1192      +52

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

andygrove · 2025-05-16T17:53:06Z

I have a Spark SQL test failure to resolve:

SPARK-47430 Support GROUP BY MapType *** FAILED *** (124 milliseconds)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1711.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1711.0 (TID 1466) (a357d001add0 executor driver): org.apache.comet.CometNativeException: Not yet implemented: not yet implemented: Map(Field { name: "entries", data_type: Struct([Field { name: "key", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, Field { name: "value", data_type: Int32, nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }]), nullable: false, dict_id: 0, dict_is_ordered: false, metadata: {} }, false)

edit: The failure is specific to Spark 4.0.0, which added support for grouping by map types

parthchandra

lgtm. Just one minor question.

parthchandra · 2025-05-19T16:26:41Z

spark/src/main/scala/org/apache/comet/serde/QueryPlanSerde.scala

          return None
        }

+        if (groupingExpressions.exists(expr =>


Are there other check we need to include here for structs/arrays?

Spark 3.x supports grouping by structs and arrays and we already have tests passing for those cases. Grouping by map is new in Spark 4.

hmm I suppose there could be structs containing maps 🤔

andygrove · 2025-05-19T16:32:40Z

Thanks for the reviews @mbutrovich and @parthchandra

andygrove added 6 commits May 16, 2025 07:40

Move some logic into scan execs

e47fc0a

improve type checking for sinks

7a50855

move usingDataSourceExecWithIncompatTypes* to CometTestBase

2a78694

scalastyle

8592b8f

scalastyle

52875ff

scalastyle

f5fd69b

andygrove requested a review from Copilot May 16, 2025 15:02

Copilot AI reviewed May 16, 2025

View reviewed changes

spark/src/main/scala/org/apache/spark/sql/comet/CometNativeScanExec.scala Show resolved Hide resolved

andygrove added 4 commits May 16, 2025 09:51

fix regression

0d11599

add shuffle fuzz test:

4d5c4b0

scalastyle

332be9b

scalastyle

ada181a

andygrove marked this pull request as ready for review May 16, 2025 16:31

andygrove commented May 16, 2025

View reviewed changes

oops

e8b95d4

andygrove commented May 16, 2025

View reviewed changes

mbutrovich approved these changes May 16, 2025

View reviewed changes

andygrove added 4 commits May 16, 2025 12:11

fix?

e3c5b09

fix?

ddcee36

scalastyle

1a106dc

fix?

3634fc1

andygrove mentioned this pull request May 17, 2025

chore: Add scanImpl attribute to CometScanExec #1746

Merged

parthchandra approved these changes May 19, 2025

View reviewed changes

andygrove merged commit 7717a25 into apache:main May 19, 2025
78 checks passed

andygrove deleted the scan-refactor-2 branch May 19, 2025 16:32

coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025

chore: More refactoring of type checking logic (apache#1744)

d384fda

Conversation

andygrove commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

andygrove May 16, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove May 16, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove May 16, 2025

Choose a reason for hiding this comment

Uh oh!

mbutrovich left a comment

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andygrove commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

parthchandra left a comment

Choose a reason for hiding this comment

Uh oh!

parthchandra May 19, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove May 19, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove May 19, 2025

Choose a reason for hiding this comment

Uh oh!

andygrove commented May 19, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

andygrove commented May 16, 2025 •

edited

Loading

codecov-commenter commented May 16, 2025 •

edited

Loading

andygrove commented May 16, 2025 •

edited

Loading