Skip to content

chore: Introduce ANSI support for remainder operation#1971

Merged
mbutrovich merged 5 commits intoapache:mainfrom
rishvin:rem_ansi_support
Jul 11, 2025
Merged

chore: Introduce ANSI support for remainder operation#1971
mbutrovich merged 5 commits intoapache:mainfrom
rishvin:rem_ansi_support

Conversation

@rishvin
Copy link
Contributor

@rishvin rishvin commented Jul 1, 2025

Which issue does this PR close?

Closes #532

Rationale for this change

Spark in ANSI mode can throw division by zero exception for remainder, however, the remainder implementation in Comet is not ANSI compliant and rather than throwing division by zero, returns NULL. This PR introduces remainder that is compliant with Spark behavior, both in ANSI and non-ANSI mode.

Ref doc.
As per the doc,
try_mod: identical to the remainder operator %, except that it returns NULL result instead of throwing an exception on dividing 0.

What changes are included in this PR?

  • Adds a new physical expression ModuloExpr. This expression supports both ANSI and non-ANSI mode (via fail_on_error flag). In ANSI mode, it translates the Arrow's division by zero error to Spark compliant error message.
  • Removes the call to nullIfWhenPrimitive, when serializing the remainder exception, and implements equivalent functionality in ModuloExpr. This method set null values for RHS record batches that have zero values in them. This is required for non-ANSI mode, otherwise Arrow's remainder implementation will throw divide by zero exception.
  • Adds rust and scala tests cases to verify remainder behaviour both in ANSI and non-ANSI mode.

How are these changes tested?

Added rust and scala tests to verify the behavior.


Results from spark-shell,

Without Comet (ANSI/non-ANSI)
Screenshot 2025-07-01 at 5 11 21 PM

With Comet (ANSI/non-ANSI)
Screenshot 2025-07-01 at 5 12 26 PM

fn evaluate_statistics(&self, children: &[&Distribution]) -> Result<Distribution> {
let (left, right) = (children[0], children[1]);

if let (Gaussian(left), Gaussian(right)) = (left, right) {
Copy link
Contributor Author

@rishvin rishvin Jul 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Borrowed these changes from BinaryExpr expression, here.

(Ok(DataType::Decimal128(p1, s1)), Ok(DataType::Decimal128(p2, s2)))
if max(s1, s2) as u8 + max(p1 - s1 as u8, p2 - s2 as u8) > DECIMAL128_MAX_PRECISION =>
{
let left = Arc::new(Cast::new(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Borrowed this code from

let data_type = return_type.map(to_arrow_datatype).unwrap();
. We were doing these casting before creating modulo operation.

@codecov-commenter
Copy link

codecov-commenter commented Jul 1, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.77%. Comparing base (f09f8af) to head (5ff9643).
⚠️ Report is 344 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #1971      +/-   ##
============================================
+ Coverage     56.12%   58.77%   +2.64%     
- Complexity      976     1154     +178     
============================================
  Files           119      134      +15     
  Lines         11743    13054    +1311     
  Branches       2251     2417     +166     
============================================
+ Hits           6591     7672    +1081     
- Misses         4012     4155     +143     
- Partials       1140     1227      +87     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rishvin
Copy link
Contributor Author

rishvin commented Jul 2, 2025

Note that vanilla spark message is a bit misleading. It says - Use try_divide(), however, it should have said - Use try_mod(). I checked this message in spark-4 also, the message even today says Use try_divide(). This is spark specific message problem, however, I just wanted to flag this out. Here is the sample from spark-4,

scala> spark.sql("SELECT mod(10, 0)").show()
     |
org.apache.spark.SparkArithmeticException: [DIVIDE_BY_ZERO] Division by zero. Use `try_divide` to tolerate divisor being 0 and return NULL instead. If necessary set "spark.sql.ansi.enabled" to "false" to bypass this error. SQLSTATE: 22012
== SQL (line 1, position 8) ==
SELECT mod(10, 0)

Filed minor ticket for spark: https://issues.apache.org/jira/browse/SPARK-52659

@rishvin rishvin changed the title chore: ANSI support for remainder operation. chore: Introduce ANSI support for remainder operation Jul 2, 2025
@rishvin
Copy link
Contributor Author

rishvin commented Jul 3, 2025

@andygrove : Could you please review and check if this is how you envisioned these changes?

@andygrove
Copy link
Member

@andygrove : Could you please review and check if this is how you envisioned these changes?

Thanks @rishvin. I took a first pass through, and this looks great. Thanks for adding comprehensive tests both in Scala and Rust. I do wonder if we should implement the expression as a ScalarUDFImpl rather than PhysicalExpr though. What do you think?

@rishvin
Copy link
Contributor Author

rishvin commented Jul 4, 2025

@andygrove : Could you please review and check if this is how you envisioned these changes?

Thanks @rishvin. I took a first pass through, and this looks great. Thanks for adding comprehensive tests both in Scala and Rust. I do wonder if we should implement the expression as a ScalarUDFImpl rather than PhysicalExpr though. What do you think?

Thanks @andygrove , yes SGTM. Let me rewrite these changes as UDF.

@rishvin
Copy link
Contributor Author

rishvin commented Jul 8, 2025

@andygrove I have made the changes, could you please recheck.

Copy link
Member

@andygrove andygrove left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @rishvin.

@andygrove
Copy link
Member

@parthchandra @mbutrovich would you also like to review?

@mbutrovich mbutrovich self-requested a review July 11, 2025 14:47
Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rishvin! LGTM

@mbutrovich mbutrovich merged commit 496cad9 into apache:main Jul 11, 2025
91 checks passed
@rishvin rishvin deleted the rem_ansi_support branch July 11, 2025 19:28
coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add ANSI support for Remainder

5 participants