perf: improve performance of update metrics#1329
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1329 +/- ##
=============================================
- Coverage 56.12% 39.06% -17.07%
- Complexity 976 2071 +1095
=============================================
Files 119 263 +144
Lines 11743 60742 +48999
Branches 2251 12909 +10658
=============================================
+ Hits 6591 23729 +17138
- Misses 4012 32530 +28518
- Partials 1140 4483 +3343 ☔ View full report in Codecov by Sentry. |
|
Although the proportion of udpate metric in cpu profile has been greatly reduced, the tpcds/tpch benchmark of small data set has not been improved. |
|
@mbutrovich may be interested in reviewing this as well |
| let poll_output = exec_context.runtime.block_on(async { poll!(next_item) }); | ||
|
|
||
| // Update metrics | ||
| update_metrics(&mut env, exec_context)?; |
There was a problem hiding this comment.
I wonder if we should add a config so that we can choose between frequent metrics updates vs just updating once the query completes. It can sometimes be helpful to see live metrics.
There was a problem hiding this comment.
Per-batch is probably always overkill. For long-running jobs is there a period that makes sense? It looks like Spark History defaults to 10s.
There was a problem hiding this comment.
I do like the idea of updating metrics every N seconds
There was a problem hiding this comment.
I think checking a coarse-grained clock (i.e., CLOCK_MONOTONIC_COARSE) to see if N seconds has elapsed to produce updated metrics would be a reasonable compromise on performance impact vs. fresh metrics.
There was a problem hiding this comment.
I also like the idea of updating every N seconds. One good reason for updating frequently is to keep updating the live UI.
There was a problem hiding this comment.
@mbutrovich Thank you for your idea, sounds great to me, I will try to do that later.
|
Based on a single run of TPC-H @ 100GB, I see approximately 2% improvement in TPC-H (325s on main vs 318s with this PR) |
|
@andygrove @mbutrovich @parthchandra Thank you for your review and sorry for the late reply. I have just finished my Chinese New Year holiday and will continue this work later. |
| spark_plan.children().iter().for_each(|child_plan| { | ||
| let child_node = to_native_metric_node(child_plan).unwrap(); | ||
| native_metric_node.children.push(child_node); | ||
| }); |
There was a problem hiding this comment.
If you change this to a for loop rather than using for_each then we can use ? to handle any error condition.
| spark_plan.children().iter().for_each(|child_plan| { | |
| let child_node = to_native_metric_node(child_plan).unwrap(); | |
| native_metric_node.children.push(child_node); | |
| }); | |
| for child_plan in spark_plan.children() { | |
| let child_node = to_native_metric_node(child_plan)?; | |
| native_metric_node.children.push(child_node); | |
| } |
There was a problem hiding this comment.
Thank you for your suggestion, changed. I am not familiar with rust yet, and I hope to learn rust by contributing to comet. 😁
| runtime, | ||
| metrics, | ||
| metrics_update_interval, | ||
| metrics_last_update_time: Instant::now(), |
There was a problem hiding this comment.
@andygrove thoughts on a coarse time crate? The overhead on these clock_gettime() as used underneath Instant::now() can really add up. Maybe it's a premature optimization, but I also don't want a "death by 1000 cuts" scenario with gettime() all over the place.
There was a problem hiding this comment.
I ran coarsetime's benchmark on my laptop:
coarsetime_now(): 126.93 M/s
coarsetime_recent(): 340.32 M/s
coarsetime_elapsed(): 142.64 M/s
coarsetime_since_recent(): 340.34 M/s
stdlib_now(): 51.37 M/s
stdlib_elapsed(): 42.42 M/s
I'm a bit stunned that Rust's stdlib doesn't provide a nice way to get coarse time on its own, since the performance difference can be quite large and a lot of tasks don't need nanosecond precision.
There was a problem hiding this comment.
@mbutrovich I don't know much about coarse time, but I have no objection to adding this.
There was a problem hiding this comment.
I assume that we could make this change in a follow on PR
Which issue does this PR close?
Closes #1328.
Rationale for this change
Improve performance of update metrics
What changes are included in this PR?
How are these changes tested?
after this
sql metrics are displayed correctly:
cpu profile:
