Add basic eventloop latency measurement. by CharlesChen888 · Pull Request #11963 · redis/redis

CharlesChen888 · 2023-03-24T03:59:50Z

To implement the basic eventloop duration in #10719.

The measured latency(duration) includes the list below, which can be shown by INFO STATS.

eventloop_cycles  // ever increasing counter
eventloop_duration_sum // cumulative duration of eventloop in microseconds
eventloop_duration_cmd_sum  // cumulative duration of executing commands in microseconds
instantaneous_eventloop_cycles_per_sec  // average eventloop count per second in recent 1.6s
instantaneous_eventloop_duration_usec  // average single eventloop duration in recent 1.6s

Also added some experimental metrics, which are shown only when INFO DEBUG is called.
This section isn't included in the default INFO, or even in INFO ALL and the fields in this section can change in the future without considering backwards compatibility.

eventloop_duration_aof_sum  // cumulative duration of writing AOF
eventloop_duration_cron_sum  // cumulative duration cron jobs (serverCron, beforeSleep excluding IO and AOF)
eventloop_cmd_per_cycle_max  // max number of commands executed in one eventloop
eventloop_duration_max  // max duration of one eventloop

All of these are being reset by CONFIG RESETSTAT

src/latency.h

src/latency.c

src/aof.c

src/server.c

madolson

Still strongly against this change. It feels like we're exposing metrics that are easily collectible instead of exposing information that is useful.

src/networking.c

src/server.c

oranagra

i see few things i listed in the issue are missing, let's add them in the PR later on, after we're happy with the ones you already added.

eventloop_duration_cron_sum - total time spent in serverCron, and beforeSleep (excluding writes and aof)

instantaneous_eventloop_cycles - cycles per second in the last 1.6 seconds

instantaneous_eventloop_duration_avg - average duration in the last 1.6 seconds

instantaneous_eventloop_duration_max - max duration in the last 1.6 seconds

src/latency.h

src/server.h

CharlesChen888 · 2023-03-27T06:51:05Z

i see few things i listed in the issue are missing, let's add them in the PR later on, after we're happy with the ones you already added.

eventloop_duration_cron_sum - total time spent in serverCron, and beforeSleep (excluding writes and aof)

instantaneous_eventloop_cycles - cycles per second in the last 1.6 seconds

instantaneous_eventloop_duration_avg - average duration in the last 1.6 seconds

instantaneous_eventloop_duration_max - max duration in the last 1.6 seconds

@oranagra

Is read (in beforeSleep) considered a part of cron?

And about the 3 instantaneous stats, are they updated every 1.6 seconds, or updated every loop? I think we can leave these to external monitoring tools, so the server can just provide basic stats.

oranagra · 2023-03-27T07:52:29Z

@CharlesChen888 my idea for a "cron" metric was to measure the sum of active-expire, active-defrag and all other tasks done by cron and beforeSleep that are not counted by the other set of metrics (so we should exclude read, write and AOF from these).

regardign the "instantaneous" metrics, i agree they're unnecessary for people who have a proper monitoring software connected to redis. they're there for easy consumption from command prompt by using redis-cli.
i meant to add them to the STATS_METRIC_COUNT mechanism.

src/latency.h

soloestoy · 2023-03-29T02:35:26Z

eventloop_duration_cron_sum - total time spent in serverCron, and beforeSleep (excluding writes and aof)

agree, cron sum is useful, especially about the expire cycle and defrag cycle in serverCron.

moreover, we also have two other time events, moduleTimerHandler and evictionTimeProc, do you think we should record these two? @oranagra

instantaneous_eventloop_cycles - cycles per second in the last 1.6 seconds

instantaneous_eventloop_duration_avg - average duration in the last 1.6 seconds

instantaneous_eventloop_duration_max - max duration in the last 1.6 seconds

I'm not sure if they're useful, and instantaneous...max these two words are a bit strange together.

Maybe you wanna record how many eventloop cycles in last xxx seconds, and average/max duration in the last xxx seconds. But I think it should take longer like 1 min, and we should introduce another way to record them instead of using the current instantaneous metric.

oranagra · 2023-03-29T07:14:21Z

i don't think we wanna break down cron to each and every task it contains.
we already have latency monitor metrics for active expiry and defrag, so for the purpose of event loop monitoring i think it can be inclusive without a breakdown (other than IO which we exclude from it).
then if someone realizes the event loop takes long because of cron, they can dig into it with other ways.

regarding the instantaneous metrics, IIRC last time we argued about these metrics (in #10062), i argued that we don't need them and you argued that we do...
i do think we need to keep them to a minimum, maybe we should only include only instantaneous_eventloop_cycles, or none of them.
having just duration_avg without duration_max feels a bit odd to me, and having just cycles per second without any duration metric also feels a bit odd. if the problem is the contradiction of "instantaneous" and "max" terms, maybe we can add "recent".

regarding the duration being 1 second or 1 minute, i don't wanna introduce another "infrastructure" for this, so i'd rather keep the 1.6 seconds thing we have now.
under constant load, it would be sufficient to have them represent 1 second (will show same values as 1 minute), so let's keep that. in that case maybe having an average duration, and no max is actually ok.

bottom line, if you are ok with it, let's drop the max and add just two.

CharlesChen888 · 2023-03-29T09:24:47Z

So do we have a final design for this? INFO or histogram? And what metrics to expose?

If we just show the metrics in info, we may have a new section INFO EVENTLOOP, and it contains the list below (may be drop the instantaneous_eventloop_duration_max).

eventloop_cycles
eventloop_duration_sum
eventloop_duration_max
eventloop_duration_cmd_sum
eventloop_duration_io_read_sum
eventloop_duration_io_write_sum
eventloop_duration_aof_sum
eventloop_duration_cron_sum

instantaneous_eventloop_cycles
instantaneous_eventloop_duration_avg 
instantaneous_eventloop_duration_max

If we want a histogram, we may have a new command (or perhaps a new subcommand under LATENCY?), and it returns the information below.

1) "eventloop"
2) 1) "count"
   2) (integer) 24
   3) "histogram_usec"
   4) 1) (integer) 132
      2) (integer) 18
      3) (integer) 264
      4) (integer) 24
3) "eventloop|commands"
4) 1) "count"
   2) (integer) 1
   3) "histogram_usec"
   4) 1) (integer) 4227
      2) (integer) 1
5) "eventloop|ioread"
6) 1) "count"
   2) (integer) 1
   3) "histogram_usec"
   4) 1) (integer) 4227
      2) (integer) 1
......

Or we show sum in info, and details in histogram?
@oranagra @madolson @soloestoy

oranagra · 2023-03-29T09:49:36Z

i'm leaning towards keeping the simple metrics in INFO STATS (with the instantaneous but without the max), and exposing the histogram in LATENCY EVENTLOOP.

the histogram data itself in the event loop should probably be plain (similar to the LATENCY HISTOGRAM output), but maybe put under a "histogram" field in a map, so we can extend that map in the future.

soloestoy · 2023-03-29T10:11:40Z

regarding the instantaneous metrics, IIRC last time we argued about these metrics (in #10062), i argued that we don't need them and you argued that we do...

I need clear it, I mean instantaneous max metric is strange. IMHO instantaneous is always used on xxx per second metric, like qps for query per second and bandwidth for net bytes per second, I don't know how to understand "duration time per second".

oranagra · 2023-03-29T10:59:18Z

i view the "instantaneous" as a metric representing the very recent time (last 1.6 seconds), it doesn't have to be a per-second metric.

instantaneous_ops_per_sec is a per-sec, but it doesn't mean there couldn't be an instantaneous_cpu_usage.
i.e. unlike used_memory_rss which represents the current state, CPU usage, and ops/sec represent delta of some metric over time, and we choose to show the delta over the last 1.6 seconds, i don't see a problem showing instantaneous_eventloop_duration_avg

soloestoy · 2023-03-29T11:20:08Z

i view the "instantaneous" as a metric representing the very recent time (last 1.6 seconds), it doesn't have to be a per-second metric.

seems you misunderstand the current instantaneous mechanism or I missed something? it is the average of the past 16 per-second metric samples, in another word it is indeed the per-second metric that just calculated as an average of 16 samples, nothing todo with 1.6 seconds.

/* Add a sample to the operations per second array of samples. */
void trackInstantaneousMetric(int metric, long long current_reading) {
    long long now = mstime();
    long long t = now - server.inst_metric[metric].last_sample_time;
    long long ops = current_reading -
                    server.inst_metric[metric].last_sample_count;
    long long ops_sec;

    ops_sec = t > 0 ? (ops*1000/t) : 0;

    server.inst_metric[metric].samples[server.inst_metric[metric].idx] =
        ops_sec;
    server.inst_metric[metric].idx++;
    server.inst_metric[metric].idx %= STATS_METRIC_SAMPLES;
    server.inst_metric[metric].last_sample_time = now;
    server.inst_metric[metric].last_sample_count = current_reading;
}

/* Return the mean of all the samples. */
long long getInstantaneousMetric(int metric) {
    int j;
    long long sum = 0;

    for (j = 0; j < STATS_METRIC_SAMPLES; j++)
        sum += server.inst_metric[metric].samples[j];
    return sum / STATS_METRIC_SAMPLES;
}

oranagra · 2023-03-29T12:24:38Z

i think the confusion is that you're referring to the way they're implemented and i'm referring to the way they're presented to the user.
if we go my way, we can rename trackInstantaneousMetric to trackInstantaneousRateMetric, and add a similar trackInstantaneousAvgMetric.

i.e. the current mechanism takes a rate sample (ops / sec) once in 100ms, and then computes an average of these over 1.6 seconds.
but we can add another similar function that takes a sum and count samples every 100ms to compute an average (average duration in this case), and then compute an average for these over 1.6 seconds.

i'm saying 1.6 seconds because there are 16 samples and we take one every 100ms.

src/server.c

oranagra · 2023-04-04T14:05:32Z

since this was stale for a while, i'll try to promote a minimal version, so the status from my perspective (if we put aside the histogram feature) what's missing is:

add eventloop_duration_cron_sum - total time spent in serverCron, and beforeSleep (excluding writes and aof)
agree, cron sum is useful, especially about the expire cycle and defrag cycle in serverCron.
remove eventloop_duration_max for now.

IMO it could be nice to have two metrics for easy consumption from redis-cli (but not a must):

instantaneous_eventloop_cycles - cycles per second in the last 1.6 seconds
instantaneous_eventloop_duration_avg - average duration in the last 1.6 seconds
to clarify (since i didn't see a response) we can rename trackInstantaneousMetric to trackInstantaneousRateMetric, and add a similar trackInstantaneousAvgMetric. getInstantaneousMetric can remain as is, and i don't think any of that contradicts the "instantaneous" concept from the user's perspective.

@CharlesChen888 can you handle these changes or maybe want me to try to step in and help?

CharlesChen888 · 2023-04-04T17:11:49Z

@oranagra Sorry about the delay, I went on vacation last weekend and I will get back to this tomorrow.

src/server.c

tests/unit/info.tcl

oranagra · 2023-05-11T05:54:43Z

tests/unit/info.tcl

+                set value2 [s $metric]
+                assert {$value2 > $value1}


maybe we can add a resetstat, and then we can put some boundaries on the values?
i.e. we know we have roughly 2 commands, 3 cycles, and duration is probably very low (in any case below 100)

it'll mean we can't use the foreach, but that's also a benefit, since we'll have just one after 110 for all 3 metrics, instead of having 3 sleeps.

"Allowing some tolerance" reminds me that, such tests including time measurement may fail due to performance jittering, and we may end up with allowing a huge tolerance. I don't feel like to do so...

even a huge tolerance (100 times what we expect) can catch some bugs, in comparison, the >= seems a little bit useless.
but anyway, i'm also ok keeping what we have now.

i'll try to find some time for manual testing, and then i'll merge it.
or if you can acknowledge that you did that already (monitor these metrics on a server under traffic, and on an idle one), i'll skip it and merge sooner.

thank you.

I already did manual testing and the results looks OK. Also added some upper boundaries in the test.

tests/unit/info.tcl

oranagra · 2023-05-12T12:32:33Z

full CI: https://github.com/redis/redis/actions/runs/4958856487

oranagra · 2023-05-12T13:07:07Z

before and after sleep 1	idle	memtier_benchmark	--pipeline 100
eventloop_cycles - delta 1s apart eventloop_duration_sum - delta 1s apart eventloop_duration_cmd_sum - delta 1s apart instantaneous_eventloop_cycles_per_sec instantaneous_eventloop_duration_usec eventloop_duration_aof_sum eventloop_duration_cron_sum eventloop_duration_max eventloop_cmd_per_cycle_max	13 1618 69 10 108 24 10043 247 1	1052 979175 237679 1058 932 478 209836 23137 199	133 1017352 294548 127 7841 646 292085 26621 18400

oranagra · 2023-05-12T13:21:36Z

@CharlesChen888 please take a look at the test failures, some are thresholds that should be adjusted (take a safe distance, maybe twice of what's printed), and there's also some odd fortify compilation error that i didn't look into..
thanks.

oranagra · 2023-05-12T15:44:43Z

tests are now passing, and i also verified that we don't see that fortification error in unstable. trying to figure out how it is related to this PR..

new test added in #11963, fails on freebsd CI which is slow.

src/server.c

In #11963, some new tests about eventloop duration were added, which includes time measurement in TCL scripts. This has caused some unexpected CI failures, such as #12169 and #12177, due to slow test servers or some performance jittering.

The test fails on freebsd CI: ``` *** [err]: stats: eventloop metrics in tests/unit/info.tcl Expected '31777' to be less than '16183' (context: type eval line 17 cmd {assert_lessthan $el_sum2 [expr $el_sum1+10000] } proc ::test) ``` The test added in redis#11963, fails on freebsd CI which is slow, increase tollerance and also add some verbose logs, now we can see these logs in verbose mode (for better views): ``` eventloop metrics cycle1: 12, cycle2: 15 eventloop metrics el_sum1: 315, el_sum2: 411 eventloop metrics cmd_sum1: 126, cmd_sum2: 137 [ok]: stats: eventloop metrics (111 ms) instantaneous metrics instantaneous_eventloop_cycles_per_sec: 8 instantaneous metrics instantaneous_eventloop_duration_usec: 55 [ok]: stats: instantaneous metrics (1603 ms) [ok]: stats: debug metrics (112 ms) ```

…is#12385) The test fails on freebsd CI: ``` *** [err]: stats: eventloop metrics in tests/unit/info.tcl Expected '31777' to be less than '16183' (context: type eval line 17 cmd {assert_lessthan $el_sum2 [expr $el_sum1+10000] } proc ::test) ``` The test added in redis#11963, fails on freebsd CI which is slow, increase tollerance and also add some verbose logs, now we can see these logs in verbose mode (for better views): ``` eventloop metrics cycle1: 12, cycle2: 15 eventloop metrics el_sum1: 315, el_sum2: 411 eventloop metrics cmd_sum1: 126, cmd_sum2: 137 [ok]: stats: eventloop metrics (111 ms) instantaneous metrics instantaneous_eventloop_cycles_per_sec: 8 instantaneous metrics instantaneous_eventloop_duration_usec: 55 [ok]: stats: instantaneous metrics (1603 ms) [ok]: stats: debug metrics (112 ms) ```

sundb reviewed Mar 24, 2023

View reviewed changes

src/latency.h Outdated Show resolved Hide resolved

src/latency.h Outdated Show resolved Hide resolved

src/latency.h Outdated Show resolved Hide resolved

src/latency.c Outdated Show resolved Hide resolved

sundb reviewed Mar 24, 2023

View reviewed changes

src/aof.c Outdated Show resolved Hide resolved

CharlesChen888 force-pushed the eventloop-latency branch from 5b6095c to 1d4b20a Compare March 24, 2023 09:14

filipecosta90 added the action:run-benchmark Triggers the benchmark suite for this Pull Request label Mar 24, 2023

madolson reviewed Mar 24, 2023

View reviewed changes

src/server.c Outdated Show resolved Hide resolved

madolson reviewed Mar 24, 2023

View reviewed changes

src/networking.c Outdated Show resolved Hide resolved

madolson reviewed Mar 24, 2023

View reviewed changes

src/server.c Outdated Show resolved Hide resolved

oranagra linked an issue Mar 26, 2023 that may be closed by this pull request

Redis server-side latency #10719

Closed

oranagra reviewed Mar 26, 2023

View reviewed changes

src/latency.h Outdated Show resolved Hide resolved

src/server.h Outdated Show resolved Hide resolved

oranagra reviewed Mar 27, 2023

View reviewed changes

src/latency.h Outdated Show resolved Hide resolved

src/latency.h Outdated Show resolved Hide resolved

filipecosta90 added action:run-benchmark Triggers the benchmark suite for this Pull Request and removed action:run-benchmark Triggers the benchmark suite for this Pull Request labels Mar 28, 2023

oranagra reviewed Apr 4, 2023

View reviewed changes

src/server.c Outdated Show resolved Hide resolved

oranagra mentioned this pull request Apr 4, 2023

Use a fixed 1% cpu time to db's rehashing in serverCron to prevent the throughput capacity of the redis from dropping too much due to a high hz #12003

Open

CharlesChen888 added 2 commits April 5, 2023 10:31

Add basic duration measurement.

d234962

Format fix and other optimizations.

b1eccb6

oranagra reviewed May 11, 2023

View reviewed changes

CharlesChen888 added 2 commits May 11, 2023 16:21

Improve tests.

af0322f

Use stat_numcommands instead of cnt in duration stats.

258ed53

oranagra approved these changes May 11, 2023

View reviewed changes

CharlesChen888 added 2 commits May 12, 2023 10:05

Add some boundaries for eventloop metrics tests.

d3491e5

Correct spelling.

a75bb1f

soloestoy approved these changes May 12, 2023

View reviewed changes

Expend test tolerance.

bb3c4c5

oranagra added 2 commits May 12, 2023 18:52

try to solve fortifier compilation error saying argv[0] is uninitialized

1c0d09b

more tollerance for valgrind tests

ef97846

oranagra merged commit 29ca879 into redis:unstable May 12, 2023

oranagra added state:major-decision Requires core team consensus release-notes indication that this issue needs to be mentioned in the release notes labels May 12, 2023

oranagra mentioned this pull request May 14, 2023

increase tolerance of new event loop test, fails on freebsd CI #12169

Merged

oranagra added a commit that referenced this pull request May 14, 2023

increase tollerance of new event loop test, fails on freebsd CI (#12169)

2ffde15

new test added in #11963, fails on freebsd CI which is slow.

oranagra mentioned this pull request May 15, 2023

Release Redis 7.2 RC2 #12173

Merged

This was referenced May 15, 2023

improve performance for keys with expiration time #12177

Merged

Allow bigger tolerance in eventloop duration test. #12179

Merged

madolson reviewed Jun 11, 2023

View reviewed changes

src/server.c Show resolved Hide resolved

CharlesChen888 mentioned this pull request Jun 12, 2023

Update info command with new eventloop metrics. redis/redis-doc#2439

Merged

enjoy-binbin mentioned this pull request Jul 5, 2023

Fix and increase tollerance of event loop test, add verbose logs #12385

Merged

oranagra mentioned this pull request Sep 27, 2023

add metric for pipeline in info command #12591

Open

oranagra mentioned this pull request Mar 5, 2024

Add INFO stat eviction_exceeded_time #9031

Merged

Conversation

CharlesChen888 commented Mar 24, 2023 • edited by oranagra Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

madolson left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

oranagra left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

CharlesChen888 commented Mar 27, 2023

Uh oh!

oranagra commented Mar 27, 2023

Uh oh!

Uh oh!

Uh oh!

soloestoy commented Mar 29, 2023

Uh oh!

oranagra commented Mar 29, 2023

Uh oh!

CharlesChen888 commented Mar 29, 2023

Uh oh!

oranagra commented Mar 29, 2023

Uh oh!

soloestoy commented Mar 29, 2023

Uh oh!

oranagra commented Mar 29, 2023

Uh oh!

soloestoy commented Mar 29, 2023

Uh oh!

oranagra commented Mar 29, 2023

Uh oh!

Uh oh!

oranagra commented Apr 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CharlesChen888 commented Apr 4, 2023

Uh oh!

Uh oh!

Uh oh!

oranagra May 11, 2023

Choose a reason for hiding this comment

Uh oh!

CharlesChen888 May 11, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

oranagra May 11, 2023

Choose a reason for hiding this comment

Uh oh!

CharlesChen888 May 12, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

oranagra commented May 12, 2023

Uh oh!

oranagra commented May 12, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

oranagra commented May 12, 2023

Uh oh!

oranagra commented May 12, 2023

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

CharlesChen888 commented Mar 24, 2023 •

edited by oranagra

Loading

oranagra left a comment •

edited

Loading

oranagra commented Apr 4, 2023 •

edited

Loading

CharlesChen888 May 11, 2023 •

edited

Loading

oranagra commented May 12, 2023 •

edited

Loading