Skip to content

Add INFO stat eviction_exceeded_time#9031

Merged
oranagra merged 9 commits intoredis:unstablefrom
huangzhw:evict-reached-info
Jul 26, 2021
Merged

Add INFO stat eviction_exceeded_time#9031
oranagra merged 9 commits intoredis:unstablefrom
huangzhw:evict-reached-info

Conversation

@huangzhw
Copy link
Contributor

@huangzhw huangzhw commented Jun 2, 2021

Add two INFO metrics:

total_eviction_exceeded_time:69734
current_eviction_exceeded_time:10230

current_eviction_exceeded_time if greater than 0, means how much time current used memory is greater than maxmemory. And we are still over the maxmemory. If used memory is below maxmemory, this metric is reset to 0.
total_eviction_exceeded_time means total time used memory is greater than maxmemory since server startup.
The units of these two metrics are ms.

Copy link
Contributor

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree this is useful, just some english nitpicks.

madolson
madolson previously approved these changes Jun 3, 2021
@madolson madolson added state:major-decision Requires core team consensus state:needs-doc-pr requires a PR to redis-doc repository labels Jun 3, 2021
@huangzhw huangzhw changed the title Add INFO stat evicted_time_cap_reached_count Add INFO stat eviction_time_limit_exceeded_count Jun 3, 2021
yossigo
yossigo previously approved these changes Jun 6, 2021
@madolson madolson added the approval-needed Waiting for core team approval to be merged label Jun 11, 2021
@madolson
Copy link
Contributor

@redis/core-team Please approve this small new info field.

@oranagra
Copy link
Member

this metric can indeed shed some light on the matter, but maybe we want to add something even better?
i.e. documenting this metric may be a bit hard (would be coupled with understanding the implementation).
we can (either instead or in addition to) add something that measures the amount of time we were over the memory limit.

e.g. we start some timer as soon as we went over the memory limit (returning EVICT_RUNNING for the first time after a previous EVICT_OK), and then stop measuring when we change the state back to EVICT_OK.
we can show two metrics, one may be the current time since the last EVICT_OK, and one is the accumulated one of all of these since the server started.
such a metric can give the same benefits of the one in this PR, and would be easier to understand and document (it's units are more clear).

WDYT?

@huangzhw
Copy link
Contributor Author

This metric is more clear than count. For count we have to know implementation and maxmemory-eviction-tenacity config.
But there is still problem. With this metric, we can't know how many time we spent on evict.

@oranagra
Copy link
Member

do we wanna know the total time we spent in eviction? was that the purpose of the current counter?
i'm not sure it's needed, but if we want it, we can add a metric with the total time we spent in eviction (and active expiry too).
i thought the purpose of this metric was to report incomplete eviction (that aborted early despite still being over the limit).

@huangzhw
Copy link
Contributor Author

@oranagra Maybe we won't. After starting eviction, most time are spent.
I still have a problem. performEvictions is called every command. If the QPS is very high (Over 2000), there are many commands at the same event loop. First command takes 0.5ms delay, second command take 1ms and so forth. The comand execute will be very slow.

@oranagra
Copy link
Member

yes, as long a there's eviction pending, the commands will suffer latency, but that's actually desired in my eyes, it gives some back pressure to the commands to slow them down, otherwise, an aggressive workload will certainly win over eviction.
anyway, let's put aside the actual eviction and discuss the metrics in this PR.

did you intend to measure the relative percentage of the cpu / time spent on eviction? (i.e. the damage it does to commands)
or the effect of the early exit of the eviction on timeout affects the eviction? (i.e. the "damage" the desire for low latency causes to the eviction, preventing it's swift completion)?

if it is the later, then i think the two metrics i suggested are better, if it is the first, we can add a metric for the total time spent in eviction (and maybe active expire), and try to match it against the total cpu time spent in redis, or specifically in commands.

@huangzhw
Copy link
Contributor Author

At first I intend to measure the relative percentage of the cpu / time spent on eviction. But after a second thought, when evicting, Redis uses all of its CPU and time. So this maybe not so useful. This may just tell us whether something slow down the eviction (expire or slow commands).

I think two metrics you suggested are more useful. We can know whether eviction slow down the system and how much time it takes.

@madolson
Copy link
Contributor

I agree, I think the time spent is probably better.

@huangzhw huangzhw marked this pull request as draft July 4, 2021 10:41
@huangzhw huangzhw dismissed stale reviews from yossigo and madolson via d116d61 July 4, 2021 10:53
@huangzhw huangzhw force-pushed the evict-reached-info branch from 667c3ec to d116d61 Compare July 4, 2021 10:53
src/evict.c Outdated
int result = doPerformEvictions();

if (result == EVICT_RUNNING) {
if (server.stat_last_eviction_exceeded_time == 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required under single line {}.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm confused whther `{} needed. And space like

f(a,b,c)
f(a, b, c)
a+b
a + b

Some codes have space and some not.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the old code many of them are without spaces, the back style is basically with spaces, but if you modify the old code, I tend not to change the original style.

src/evict.c Outdated

if (result == EVICT_RUNNING) {
if (server.stat_last_eviction_exceeded_time == 0) {
server.stat_last_eviction_exceeded_time = ustime();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to use server.ustime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intend to use server.ustime. This function may be called in serverCron, the cached server.ustime may be outdated.

src/evict.c Outdated
server.stat_last_eviction_exceeded_time = ustime();
}
} else if (server.stat_last_eviction_exceeded_time != 0) {
server.stat_eviction_exceeded_time += (ustime() - server.stat_last_eviction_exceeded_time);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the use of ustime() is actually wrong. we now have a monotonic clock, and that's what we should use to track (and sum) execution time (same as we now do in call for commandstats.

maybe we can also use the monotonic clock for current_eviction_exceeded_time?

src/evict.c Outdated
if (server.stat_last_eviction_exceeded_time == 0) {
server.stat_last_eviction_exceeded_time = ustime();
}
exceeded_info:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a better name would be update_metrics?

src/server.h Outdated
long long stat_evictedkeys; /* Number of evicted keys (maxmemory) */
long long stat_eviction_exceeded_time; /* Total time over the memory limit */
long long stat_last_eviction_exceeded_time; /* Timestamp of current eviction start */
monotime stat_total_eviction_exceeded_time; /* Total time over the memory limit */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that really a monotime type? i think just a long long with us units (better mention the units in the comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stat_last_eviction_exceeded_time is timestamp, so it is monotime.
stat_total_eviction_exceeded_time can be unsigned long long.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, that's what i meant. and if it's just a long long, then we need to document the units.

src/server.c Outdated
server.stat_expire_cycle_time_used/1000,
server.stat_evictedkeys,
server.stat_eviction_exceeded_time + current_eviction_exceeded_time,
(unsigned long long) server.stat_total_eviction_exceeded_time + current_eviction_exceeded_time,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think maybe we better convert both info fields to ms, or even seconds (not very useful in microseconds)

@huangzhw huangzhw changed the title Add INFO stat eviction_time_limit_exceeded_count Add INFO stat eviction_exceeded_time Jul 6, 2021
@huangzhw huangzhw marked this pull request as ready for review July 6, 2021 14:25
oranagra
oranagra previously approved these changes Jul 6, 2021
Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
@redis/core-team please approve the new info fields.
@huangzhw thank you! please update the top comment.

@oranagra oranagra added the release-notes indication that this issue needs to be mentioned in the release notes label Jul 6, 2021
@madolson
Copy link
Contributor

madolson commented Jul 7, 2021

Naming looks good to me, I didn't get a chance to look through the code again yet.

Copy link
Member

@itamarhaber itamarhaber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@madolson
Copy link
Contributor

@huangzhw Want to go ahead and create the doc PR now?

@madolson madolson removed the approval-needed Waiting for core team approval to be merged label Jul 20, 2021
@huangzhw
Copy link
Contributor Author

@madolson OK. I will do it.

huangzhw added a commit to huangzhw/redis-doc that referenced this pull request Jul 21, 2021
@oranagra oranagra merged commit 17511df into redis:unstable Jul 26, 2021
@huangzhw huangzhw deleted the evict-reached-info branch July 27, 2021 00:39
@huangzhw
Copy link
Contributor Author

@oranagra As this one was merged, we have metrics in expire and eviction. Do we need samiliar metrics in defrag?

@oranagra
Copy link
Member

@huangzhw like what? total time we were above the defrag thresholds and active defrag was "active"?
i suppose it can be useful..

@huangzhw
Copy link
Contributor Author

@oranagra Yes. Exactly what I mean. I will try to do it.

JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Sep 8, 2021
…ded_time (redis#9031)

Add two INFO metrics:
```
total_eviction_exceeded_time:69734
current_eviction_exceeded_time:10230
```
`current_eviction_exceeded_time` if greater than 0, means how much time current used memory is greater than `maxmemory`. And we are still over the maxmemory. If used memory is below `maxmemory`, this metric is reset to 0.
`total_eviction_exceeded_time` means total time used memory is greater than `maxmemory` since server startup. 
The units of these two metrics are ms.

Co-authored-by: Oran Agra <[email protected]>
oranagra pushed a commit that referenced this pull request Sep 9, 2021
Add two INFO metrics:
```
total_active_defrag_time:12345
current_active_defrag_time:456
```
`current_active_defrag_time` if greater than 0, means how much time has
passed since active defrag started running. If active defrag stops, this metric is reset to 0.
`total_active_defrag_time` means total time the fragmentation
was over the defrag threshold since the server started.

This is a followup PR for #9031
@enjoy-binbin
Copy link
Contributor

do we want samiliar metrics for activerehashing? i wanted to look for metrics (something like expire_cycle_cpu_milliseconds) to see how much time we generally spend in activerehashing, but i didn’t see it.

@oranagra
Copy link
Member

oranagra commented Mar 5, 2024

i don't these are comparable to the metrics in this PR.
this PR deals with wall-clock duration in which redis was in a certain state.
i.e. how much time passed since it went over the memory limit, or the total time it was in that state.
if we want to measure the CPU effort we invest in expire or rehashing, then that's more in the area of #11963
but anyway, since we do have it for expiry, i suppose we can add it for rehashing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes indication that this issue needs to be mentioned in the release notes state:major-decision Requires core team consensus state:needs-doc-pr requires a PR to redis-doc repository

Projects

Archived in project

Development

Successfully merging this pull request may close these issues.

8 participants