Adds isolated netstats for replication.#10062
Conversation
|
Oh I think I'd better count flows for base/incr sync too. |
|
I recently built some grafana dashboard that plots such graphs.
The instantaneous metrics are nice if you just look at a single INFO output with redis-cli, so you have a point here. p.s. i don't think i like to change the meaning of the existing fields (backwards compatibility), so they should keep including the replication data too, and users will have to deduct on their own |
8a2ffaf to
0207914
Compare
|
Hi @oranagra , I just made some updates, now This PR:
|
|
Also here is a test example with a 2-node redis(1 master and 1 slave). Steps:
Results:
|
4fc6455 to
89fe913
Compare
|
Also I think one thing about the second commit could be discussed here.
Not sure if these two things are elegant here, just to keep consistency. |
oranagra
left a comment
There was a problem hiding this comment.
sorry for the delay.. i've been busy with redis 7.0, now finally free to look into this..
src/server.h
Outdated
There was a problem hiding this comment.
i'm quite certain we don't want instantaneous metrics for all of these.
not for full-sync vs command-stream, and not even sure we need a separate replication input vs replication output.
i.e. these INFO fields are mainly for redis-cli users, not machines, and if they see high output traffic, and they know it's a master, they can deduct the replication bytes.
similarly if they see high input traffic on a replica, they'll deduct the replication bytes from that.
There was a problem hiding this comment.
If we only use single variable like net_repl_bytes, I think it might be ambiguous for structure like 'master->replica->replica'.
The middle redis node could both be a replica and master at the meantime.
There was a problem hiding this comment.
yes, but for the "instantaneous" metrics, i'm not sure it's that important.
they're meant for quick manual check via redis-cli, since a proper metrics system will take the raw counters and plot their changes over time.
There was a problem hiding this comment.
i see you removed all 4 instantaneous metrics you added. i think maybe we can keep one.
i.e. one that's about replication (sum of input and output)
There was a problem hiding this comment.
@oranagra I think it's better to track and display the instantaneous replication input and output separately, we built read-write split structure as @DarrenJiang13 said: master->replica->replica, the middle replica has both input and output replication flow, and it's important to tell users how many input/output flow comes from themselves and others are replication flow.
So we need instantaneous_input_kbps - instantaneous_repl_input_kbps and instantaneous_output_kbps - instantaneous_repl_output_kbps
There was a problem hiding this comment.
@soloestoy the way i see it (please tell me if you see it differently) these instantaneous metrics are just for human eyes.
real monitoring systems, take a snapshot of the INFO every second, and can measure the increase of the raw metrics (non-instantaneous) over time.
so for instance if you have a grafana dashboard showing redis metrics, it'll still be able to show the difference between input and output, and it'll also be accurate, since it doesn't depend on these (inaccurate) "instantaneous" metrics (corresponding to 1.6 seconds or so).
so considering that these metrics are just for a person taking one snapshot of INFO via redis-cli to examine it in the console, do we really need another metric for this case?
c6fa714 to
e37c68c
Compare
e37c68c to
158c286
Compare
2ade40a to
0801f5f
Compare
|
Thank you for the review. @oranagra
output of
|
58d99a0 to
e2593a6
Compare
e2593a6 to
bf970db
Compare
|
Updated. @oranagra
|
src/server.h
Outdated
There was a problem hiding this comment.
i see you removed all 4 instantaneous metrics you added. i think maybe we can keep one.
i.e. one that's about replication (sum of input and output)
|
@DarrenJiang13 can you please also make a redis-doc PR? |
| trackInstantaneousMetric(STATS_METRIC_NET_OUTPUT, | ||
| stat_net_output_bytes); | ||
| stat_net_output_bytes + stat_net_repl_output_bytes); | ||
| trackInstantaneousMetric(STATS_METRIC_NET_TOTAL_REPLICATION, |
There was a problem hiding this comment.
@oranagra I think it's better to track and display the instantaneous replication input and output separately, we built read-write split structure as @DarrenJiang13 said: master->replica->replica, the middle replica has both input and output replication flow, and it's important to tell users how many input/output flow comes from themselves and others are replication flow.
So we need instantaneous_input_kbps - instantaneous_repl_input_kbps and instantaneous_output_kbps - instantaneous_repl_output_kbps
@oranagra we don't use grafana, our monitor system depends the "instantaneous" metrics, and as I know some users in our platform use the redis' "instantaneous" metrics too. |
|
grafana or no grafana, i think a monitoring system should not use these. p.s. i'm ok with that mainly because it's already there.. but if someone would suggest new metrics like come to think of it, someone could suggest that since we have |
|
it's ok if we don't support new instantaneous metrics, but now for BTW, the instantaneous metrics help me a lot when troubleshooting problems, our monitor system is not good enough, especially it's hard to add a new item 😓 |
|
@oranagra I am not against the instantaneous metrics, I think they're useful for basic troubleshooting if you don't have a better monitoring system so I'm OK with adding that. But if we decide we need a new metric, the first choice should always be a raw value for proper monitoring systems to use. |
|
I'm not against them either. They're useful from command prompt, but I wouldn't want to add dozens of them, one for every other raw metric we have. I'm not sure this specific one will help a lot for troubleshooting, e.g. more than like forks per second, or error reply rate, new client connections per second, etc. If I had to add one new instantaneous metric, I'd probably go for new client connection rate rather than add breakdown for replication in/out. |
|
I agree that I think the instantaneous metrics should really only be used for debugging. AWS relies on incremental metrics for computing metrics we expose to end users so that we can compute deltas if we miss data points if the engine is busy or something. |
|
@yossigo @madolson I'm not sure how to interpret your comments. Are you in favor of merging #10810 or do you think one new instantaneous replication bandwidth metric is enough to realize that replication is or isn't the cause of high traffic. @soloestoy did we convince you that these are no good for real monitoring anyway? |
|
@oranagra I would prefer to not add the instantaneous metrics. I think just looking at the deltas should be enough to determine where the resource utilization is. |
Nope : ) because I cannot convince our users, since they prefer instantaneous metrics. I agree we don't add new instantaneous metrics from now on. But about |
|
@soloestoy Do you know how they are ingesting these instantaneous metrics? It seems weird that folks would be looking at this instead of periodically polling and doing math, but if users have a reason for it then I'm okay splitting it. It just seems like an anti-pattern to me. |
|
because these instantaneous metrics are very easy to collect and display, no need to do math. And sometimes users don't collect metrics every second (maybe collect every minute), so they can only rely on instantaneous metrics. |
The way to convince them is that they won't have any other choice 8-). but besides, this is a new metrics that they didn't have till now anyway.
i think it's the other way around. if you take two samples 1 minute apart, the instantaneous metric won't tell you what happened in between (just in the last 1.6 seconds), but computing deltas would!. but anyway, i think we've invested enough energy on this discussion, i think it's clear where we're going with other metrics, but let's merge that PR and put it behind us, maybe the excuse for me would be symmetry (i.e. the fact that all other metrics count input and output separately). |
…and instantaneous_output_repl_kbps. (#10810) A supplement to #10062 Split `instantaneous_repl_total_kbps` to `instantaneous_input_repl_kbps` and `instantaneous_output_repl_kbps`. ## Work: This PR: - delete 1 info field: - `instantaneous_repl_total_kbps` - add 2 info fields: - `instantaneous_input_repl_kbps / instantaneous_output_repl_kbps` ## Result: - master ``` total_net_input_bytes:26633673 total_net_output_bytes:21716596 total_net_repl_input_bytes:0 total_net_repl_output_bytes:18433052 instantaneous_input_kbps:0.02 instantaneous_output_kbps:0.00 instantaneous_input_repl_kbps:0.00 instantaneous_output_repl_kbps:0.00 ``` - slave ``` total_net_input_bytes:18433212 total_net_output_bytes:94790 total_net_repl_input_bytes:18433052 total_net_repl_output_bytes:0 instantaneous_input_kbps:0.00 instantaneous_output_kbps:0.05 instantaneous_input_repl_kbps:0.00 instantaneous_output_repl_kbps:0.00 ```
The amount of `server.stat_net_output_bytes/server.stat_net_input_bytes`
is actually the sum of replication flow and users' data flow.
It may cause confusions like this:
"Why does my server get such a large output_bytes while I am doing nothing? ".
After discussions and revisions, now here is the change about what this
PR brings (final version before merge):
- 2 server variables to count the network bytes during replication,
including fullsync and propagate bytes.
- `server.stat_net_repl_output_bytes`/`server.stat_net_repl_input_bytes`
- 3 info fields to print the input and output of repl bytes and instantaneous
value of total repl bytes.
- `total_net_repl_input_bytes` / `total_net_repl_output_bytes`
- `instantaneous_repl_total_kbps`
- 1 new API `rioCheckType()` to check the type of rio. So we can use this
to distinguish between diskless and diskbased replication
- 2 new counting items to keep network statistics consistent between master
and slave
- rdb portion during diskless replica. in `rdbLoadProgressCallback()`
- first line of the full sync payload. in `readSyncBulkPayload()`
Co-authored-by: Oran Agra <[email protected]>
…and instantaneous_output_repl_kbps. (redis#10810) A supplement to redis#10062 Split `instantaneous_repl_total_kbps` to `instantaneous_input_repl_kbps` and `instantaneous_output_repl_kbps`. ## Work: This PR: - delete 1 info field: - `instantaneous_repl_total_kbps` - add 2 info fields: - `instantaneous_input_repl_kbps / instantaneous_output_repl_kbps` ## Result: - master ``` total_net_input_bytes:26633673 total_net_output_bytes:21716596 total_net_repl_input_bytes:0 total_net_repl_output_bytes:18433052 instantaneous_input_kbps:0.02 instantaneous_output_kbps:0.00 instantaneous_input_repl_kbps:0.00 instantaneous_output_repl_kbps:0.00 ``` - slave ``` total_net_input_bytes:18433212 total_net_output_bytes:94790 total_net_repl_input_bytes:18433052 total_net_repl_output_bytes:0 instantaneous_input_kbps:0.00 instantaneous_output_kbps:0.05 instantaneous_input_repl_kbps:0.00 instantaneous_output_repl_kbps:0.00 ```
Hi guys,
I find that the amount of
server.stat_net_output_bytes/server.stat_net_input_bytesis actually the sum of replication flow and users' data flow.It may cause confusions like this, "Why does my server get such a large output_bytes while I am doing nothing? ".
After discussions and revisions, now here is the change about what this PR brings (final version before merge):
server.stat_net_repl_output_bytes/server.stat_net_repl_input_bytestotal_net_repl_input_bytes/total_net_repl_output_bytesinstantaneous_repl_total_kbpsrioCheckType()to check the type of rio. So we can use this to distinguish between diskless and diskbased replicationrdbLoadProgressCallback()readSyncBulkPayload()Here is the result
Steps
redis-benchmarkto write to 30001./src/redis-benchmark -t set -r 10000000 -d 1 -n 100000000 -p 30001slaveof 127.0.0.1 30001info statsto check the flow count of the master and slaveResult
for disk-based replication
for diskless loading (by setting
repl-diskless-loadtoswapdbinredis.conf)