perf: improve performance of update metrics by wForget · Pull Request #1329 · apache/datafusion-comet

wForget · 2025-01-23T03:43:52Z

Which issue does this PR close?

Closes #1328.

Rationale for this change

Improve performance of update metrics

What changes are included in this PR?

Define a NativeMetricNode proto type to pass all metric nodes at once to avoid iterative jni calls.
Call update metrics when releasing plan to reduce the number of calls.

How are these changes tested?

after this

sql metrics are displayed correctly:

cpu profile:

codecov-commenter · 2025-01-23T05:12:30Z

Codecov Report

Attention: Patch coverage is 90.90909% with 1 line in your changes missing coverage. Please review.

Project coverage is 39.06%. Comparing base (f09f8af) to head (71394ae).
Report is 19 commits behind head on main.

Files with missing lines	Patch %	Lines
...a/org/apache/spark/sql/comet/CometMetricNode.scala	83.33%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@              Coverage Diff              @@
##               main    #1329       +/-   ##
=============================================
- Coverage     56.12%   39.06%   -17.07%     
- Complexity      976     2071     +1095     
=============================================
  Files           119      263      +144     
  Lines         11743    60742    +48999     
  Branches       2251    12909    +10658     
=============================================
+ Hits           6591    23729    +17138     
- Misses         4012    32530    +28518     
- Partials       1140     4483     +3343

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

wForget · 2025-01-23T06:07:52Z

Although the proportion of udpate metric in cpu profile has been greatly reduced, the tpcds/tpch benchmark of small data set has not been improved.

native/core/src/execution/metrics/utils.rs

andygrove · 2025-01-23T14:38:55Z

@mbutrovich may be interested in reviewing this as well

andygrove · 2025-01-23T14:39:49Z

native/core/src/execution/jni_api.rs

            let poll_output = exec_context.runtime.block_on(async { poll!(next_item) });

-            // Update metrics
-            update_metrics(&mut env, exec_context)?;


I wonder if we should add a config so that we can choose between frequent metrics updates vs just updating once the query completes. It can sometimes be helpful to see live metrics.

Per-batch is probably always overkill. For long-running jobs is there a period that makes sense? It looks like Spark History defaults to 10s.

I do like the idea of updating metrics every N seconds

I think checking a coarse-grained clock (i.e., CLOCK_MONOTONIC_COARSE) to see if N seconds has elapsed to produce updated metrics would be a reasonable compromise on performance impact vs. fresh metrics.

I also like the idea of updating every N seconds. One good reason for updating frequently is to keep updating the live UI.

@mbutrovich Thank you for your idea, sounds great to me, I will try to do that later.

andygrove · 2025-01-23T14:55:46Z

Based on a single run of TPC-H @ 100GB, I see approximately 2% improvement in TPC-H (325s on main vs 318s with this PR)

wForget · 2025-02-05T01:50:27Z

@andygrove @mbutrovich @parthchandra Thank you for your review and sorry for the late reply. I have just finished my Chinese New Year holiday and will continue this work later.

andygrove · 2025-02-05T16:20:50Z

native/core/src/execution/metrics/utils.rs

+    spark_plan.children().iter().for_each(|child_plan| {
+        let child_node = to_native_metric_node(child_plan).unwrap();
+        native_metric_node.children.push(child_node);
+    });


If you change this to a for loop rather than using for_each then we can use ? to handle any error condition.

Suggested change

spark_plan.children().iter().for_each(|child_plan| {

let child_node = to_native_metric_node(child_plan).unwrap();

native_metric_node.children.push(child_node);

});

for child_plan in spark_plan.children() {

let child_node = to_native_metric_node(child_plan)?;

native_metric_node.children.push(child_node);

}

Thank you for your suggestion, changed. I am not familiar with rust yet, and I hope to learn rust by contributing to comet. 😁

mbutrovich · 2025-02-06T05:08:42Z

native/core/src/execution/jni_api.rs

            runtime,
            metrics,
+            metrics_update_interval,
+            metrics_last_update_time: Instant::now(),


https://github.com/jedisct1/rust-coarsetime

@andygrove thoughts on a coarse time crate? The overhead on these clock_gettime() as used underneath Instant::now() can really add up. Maybe it's a premature optimization, but I also don't want a "death by 1000 cuts" scenario with gettime() all over the place.

I ran coarsetime's benchmark on my laptop:

coarsetime_now(): 126.93 M/s coarsetime_recent(): 340.32 M/s coarsetime_elapsed(): 142.64 M/s coarsetime_since_recent(): 340.34 M/s stdlib_now(): 51.37 M/s stdlib_elapsed(): 42.42 M/s

I'm a bit stunned that Rust's stdlib doesn't provide a nice way to get coarse time on its own, since the performance difference can be quite large and a lot of tasks don't need nanosecond precision.

@mbutrovich I don't know much about coarse time, but I have no objection to adding this.

I assume that we could make this change in a follow on PR

I filed a follow on issue for this: #1381

andygrove

I tested this again and saw the incremental metrics updates. I also see a small improvement in the TPC-H benchmark locally. Thanks @wForget.

wForget changed the title ~~Improve performance of update metrics~~ perf: improve performance of update metrics Jan 23, 2025

wForget added 2 commits January 23, 2025 12:07

Improve performance of update metrics

e2c0178

fix style

958476b

wForget force-pushed the COMET-1328 branch from 590fb65 to 958476b Compare January 23, 2025 04:08

fix

8c5724d

wForget force-pushed the COMET-1328 branch from a5df4f1 to 8c5724d Compare January 23, 2025 05:34

fix

642c737

andygrove reviewed Jan 23, 2025

View reviewed changes

native/core/src/execution/metrics/utils.rs Outdated Show resolved Hide resolved

andygrove reviewed Jan 23, 2025

View reviewed changes

wForget added 2 commits February 5, 2025 13:01

add update metrics interval

b52b6ae

fix style

c869bbd

wForget marked this pull request as ready for review February 5, 2025 06:46

andygrove reviewed Feb 5, 2025

View reviewed changes

address comment

71394ae

mbutrovich reviewed Feb 6, 2025

View reviewed changes

andygrove mentioned this pull request Feb 7, 2025

Comet 0.6.0 Release (Feb 2025) #1361

Closed

andygrove approved these changes Feb 8, 2025

View reviewed changes

andygrove merged commit a1e6a39 into apache:main Feb 8, 2025
74 checks passed

andygrove mentioned this pull request Feb 8, 2025

Use coarse time when calculating metrics refresh interval #1381

Open

coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025

perf: improve performance of update metrics (apache#1329)

2daf490

Comments

Conversation

wForget commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

after this

Uh oh!

codecov-commenter commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wForget commented Jan 23, 2025

Uh oh!

Uh oh!

andygrove commented Jan 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wForget commented Feb 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andygrove left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

wForget commented Jan 23, 2025 •

edited

Loading

codecov-commenter commented Jan 23, 2025 •

edited

Loading

andygrove commented Jan 23, 2025 •

edited

Loading