Add basic eventloop latency measurement.#11963
Conversation
5b6095c to
1d4b20a
Compare
madolson
left a comment
There was a problem hiding this comment.
Still strongly against this change. It feels like we're exposing metrics that are easily collectible instead of exposing information that is useful.
There was a problem hiding this comment.
i see few things i listed in the issue are missing, let's add them in the PR later on, after we're happy with the ones you already added.
- eventloop_duration_cron_sum - total time spent in serverCron, and beforeSleep (excluding writes and aof)
- instantaneous_eventloop_cycles - cycles per second in the last 1.6 seconds
- instantaneous_eventloop_duration_avg - average duration in the last 1.6 seconds
- instantaneous_eventloop_duration_max - max duration in the last 1.6 seconds
Is read (in beforeSleep) considered a part of cron? And about the 3 instantaneous stats, are they updated every 1.6 seconds, or updated every loop? I think we can leave these to external monitoring tools, so the server can just provide basic stats. |
|
@CharlesChen888 my idea for a "cron" metric was to measure the sum of active-expire, active-defrag and all other tasks done by cron and beforeSleep that are not counted by the other set of metrics (so we should exclude read, write and AOF from these). regardign the "instantaneous" metrics, i agree they're unnecessary for people who have a proper monitoring software connected to redis. they're there for easy consumption from command prompt by using redis-cli. |
agree, cron sum is useful, especially about the expire cycle and defrag cycle in moreover, we also have two other time events,
I'm not sure if they're useful, and Maybe you wanna record how many eventloop cycles in last xxx seconds, and average/max duration in the last xxx seconds. But I think it should take longer like 1 min, and we should introduce another way to record them instead of using the current instantaneous metric. |
|
i don't think we wanna break down cron to each and every task it contains. regarding the instantaneous metrics, IIRC last time we argued about these metrics (in #10062), i argued that we don't need them and you argued that we do... regarding the duration being 1 second or 1 minute, i don't wanna introduce another "infrastructure" for this, so i'd rather keep the 1.6 seconds thing we have now. bottom line, if you are ok with it, let's drop the max and add just two. |
|
So do we have a final design for this? INFO or histogram? And what metrics to expose? If we just show the metrics in info, we may have a new section If we want a histogram, we may have a new command (or perhaps a new subcommand under Or we show sum in info, and details in histogram? |
|
i'm leaning towards keeping the simple metrics in INFO STATS (with the instantaneous but without the the histogram data itself in the event loop should probably be plain (similar to the LATENCY HISTOGRAM output), but maybe put under a "histogram" field in a map, so we can extend that map in the future. |
I need clear it, I mean instantaneous max metric is strange. IMHO instantaneous is always used on xxx per second metric, like qps for query per second and bandwidth for net bytes per second, I don't know how to understand "duration time per second". |
|
i view the "instantaneous" as a metric representing the very recent time (last 1.6 seconds), it doesn't have to be a per-second metric.
|
seems you misunderstand the current instantaneous mechanism or I missed something? it is the average of the past 16 per-second metric samples, in another word it is indeed the per-second metric that just calculated as an average of 16 samples, nothing todo with 1.6 seconds. /* Add a sample to the operations per second array of samples. */
void trackInstantaneousMetric(int metric, long long current_reading) {
long long now = mstime();
long long t = now - server.inst_metric[metric].last_sample_time;
long long ops = current_reading -
server.inst_metric[metric].last_sample_count;
long long ops_sec;
ops_sec = t > 0 ? (ops*1000/t) : 0;
server.inst_metric[metric].samples[server.inst_metric[metric].idx] =
ops_sec;
server.inst_metric[metric].idx++;
server.inst_metric[metric].idx %= STATS_METRIC_SAMPLES;
server.inst_metric[metric].last_sample_time = now;
server.inst_metric[metric].last_sample_count = current_reading;
}
/* Return the mean of all the samples. */
long long getInstantaneousMetric(int metric) {
int j;
long long sum = 0;
for (j = 0; j < STATS_METRIC_SAMPLES; j++)
sum += server.inst_metric[metric].samples[j];
return sum / STATS_METRIC_SAMPLES;
} |
|
i think the confusion is that you're referring to the way they're implemented and i'm referring to the way they're presented to the user. i.e. the current mechanism takes a rate sample (ops / sec) once in 100ms, and then computes an average of these over 1.6 seconds. i'm saying 1.6 seconds because there are 16 samples and we take one every 100ms. |
|
since this was stale for a while, i'll try to promote a minimal version, so the status from my perspective (if we put aside the histogram feature) what's missing is:
IMO it could be nice to have two metrics for easy consumption from redis-cli (but not a must):
@CharlesChen888 can you handle these changes or maybe want me to try to step in and help? |
|
@oranagra Sorry about the delay, I went on vacation last weekend and I will get back to this tomorrow. |
tests/unit/info.tcl
Outdated
| set value2 [s $metric] | ||
| assert {$value2 > $value1} |
There was a problem hiding this comment.
maybe we can add a resetstat, and then we can put some boundaries on the values?
i.e. we know we have roughly 2 commands, 3 cycles, and duration is probably very low (in any case below 100)
it'll mean we can't use the foreach, but that's also a benefit, since we'll have just one after 110 for all 3 metrics, instead of having 3 sleeps.
There was a problem hiding this comment.
"Allowing some tolerance" reminds me that, such tests including time measurement may fail due to performance jittering, and we may end up with allowing a huge tolerance. I don't feel like to do so...
There was a problem hiding this comment.
even a huge tolerance (100 times what we expect) can catch some bugs, in comparison, the >= seems a little bit useless.
but anyway, i'm also ok keeping what we have now.
i'll try to find some time for manual testing, and then i'll merge it.
or if you can acknowledge that you did that already (monitor these metrics on a server under traffic, and on an idle one), i'll skip it and merge sooner.
thank you.
There was a problem hiding this comment.
I already did manual testing and the results looks OK. Also added some upper boundaries in the test.
|
|
@CharlesChen888 please take a look at the test failures, some are thresholds that should be adjusted (take a safe distance, maybe twice of what's printed), and there's also some odd fortify compilation error that i didn't look into.. |
|
tests are now passing, and i also verified that we don't see that fortification error in unstable. trying to figure out how it is related to this PR.. |
new test added in #11963, fails on freebsd CI which is slow.
The test fails on freebsd CI:
```
*** [err]: stats: eventloop metrics in tests/unit/info.tcl
Expected '31777' to be less than '16183' (context: type eval line 17 cmd
{assert_lessthan $el_sum2 [expr $el_sum1+10000] } proc ::test)
```
The test added in redis#11963, fails on freebsd CI which is slow,
increase tollerance and also add some verbose logs, now we can
see these logs in verbose mode (for better views):
```
eventloop metrics cycle1: 12, cycle2: 15
eventloop metrics el_sum1: 315, el_sum2: 411
eventloop metrics cmd_sum1: 126, cmd_sum2: 137
[ok]: stats: eventloop metrics (111 ms)
instantaneous metrics instantaneous_eventloop_cycles_per_sec: 8
instantaneous metrics instantaneous_eventloop_duration_usec: 55
[ok]: stats: instantaneous metrics (1603 ms)
[ok]: stats: debug metrics (112 ms)
```
…is#12385) The test fails on freebsd CI: ``` *** [err]: stats: eventloop metrics in tests/unit/info.tcl Expected '31777' to be less than '16183' (context: type eval line 17 cmd {assert_lessthan $el_sum2 [expr $el_sum1+10000] } proc ::test) ``` The test added in redis#11963, fails on freebsd CI which is slow, increase tollerance and also add some verbose logs, now we can see these logs in verbose mode (for better views): ``` eventloop metrics cycle1: 12, cycle2: 15 eventloop metrics el_sum1: 315, el_sum2: 411 eventloop metrics cmd_sum1: 126, cmd_sum2: 137 [ok]: stats: eventloop metrics (111 ms) instantaneous metrics instantaneous_eventloop_cycles_per_sec: 8 instantaneous metrics instantaneous_eventloop_duration_usec: 55 [ok]: stats: instantaneous metrics (1603 ms) [ok]: stats: debug metrics (112 ms) ```
To implement the basic eventloop duration in #10719.
The measured latency(duration) includes the list below, which can be shown by
INFO STATS.Also added some experimental metrics, which are shown only when
INFO DEBUGis called.This section isn't included in the default INFO, or even in
INFO ALLand the fields in this section can change in the future without considering backwards compatibility.All of these are being reset by CONFIG RESETSTAT