Fix and improve module error reply statistics by oranagra · Pull Request #10278 · redis/redis

oranagra · 2022-02-10T20:18:13Z

This PR handles several aspects

Calls to RM_ReplyWithError from thread safe contexts don't violate thread safety.
Errors returning from RM_Call to the module aren't counted in the statistics (they might be handled silently by the module)
When a module propagates a reply it got from RM_Call to it's client, then the error statistics are counted.

This is done by:

When appending an error reply to the output buffer, we avoid updating the global error statistics, instead we cache that error in a deferred list in the client struct.
When creating a RedisModuleCallReply object, the deferred error list is moved from the client into that object.
when a module calls RM_ReplyWithCallReply we copy the deferred replies to the dest client (if that's a real client, then that's when the error statistics are updated to the server)

Note about RM_ReplyWithCallReply: if the original reply had an array with errors, and the module
replied with just a portion of the original reply, and not the entire reply, the errors are currently not
propagated and the errors stats will not get propagated.

Fix #10180

MeirShpilraien

👍

src/module.c

oranagra · 2022-02-13T16:33:09Z

@yossigo do you think this fix (thread safety and statistics) be on the release notes?

yossigo · 2022-02-13T18:12:54Z

@oranagra Yes, very unlikely it has been affecting anyone but in case they come look for the problem and a solution.

This is a followup work for #10278, and a discussion about #10279 The changes: - fix failed_calls in command stats for blocked clients that got error. including CLIENT UNBLOCK, and module replying an error from a thread. - fix latency stats for XREADGROUP that filed with -NOGROUP Theory behind which errors should be counted: - error stats represents errors returned to the user, so an error handled by a module should not be counted. - total error counter should be the same. - command stats represents execution of commands (even with RM_Call, and if they fail or get rejected it counts these calls in commandstats, so it should also count failed_calls) Some thoughts about Scripts: for scripts it could be different since they're part of user code, not the infra (not an extension to redis) we certainly want commandstats to contain all calls and errors a simple script is like mult-exec transaction so an error inside it should be counted in error stats a script that replies with an error to the user (using redis.error_reply) should also be counted in error stats but then the problem is that a plain `return redis.call("SET")` should not be counted twice (once for the SET and once for EVAL) so that's something left to be resolved in #10279

This PR fix 2 issues on Lua scripting: * Server error reply statistics (some errors were counted twice). * Error code and error strings returning from scripts (error code was missing / misplaced). ## Statistics a Lua script user is considered part of the user application, a sophisticated transaction, so we want to count an error even if handled silently by the script, but when it is propagated outwards from the script we don't wanna count it twice. on the other hand, if the script decides to throw an error on its own (using `redis.error_reply`), we wanna count that too. Besides, we do count the `calls` in command statistics for the commands the script calls, we we should certainly also count `failed_calls`. So when a simple `eval "return redis.call('set','x','y')" 0` fails, it should count the failed call to both SET and EVAL, but the `errorstats` and `total_error_replies` should be counted only once. The PR changes the error object that is raised on errors. Instead of raising a simple Lua string, Redis will raise a Lua table in the following format: ``` { err='<error message (including error code)>', source='<User source file name>', line='<line where the error happned>', ignore_error_stats_update=true/false, } ``` The `luaPushError` function was modified to construct the new error table as describe above. The `luaRaiseError` was renamed to `luaError` and is now simply called `lua_error` to raise the table on the top of the Lua stack as the error object. The reason is that since its functionality is changed, in case some Redis branch / fork uses it, it's better to have a compilation error than a bug. The `source` and `line` fields are enriched by the error handler (if possible) and the `ignore_error_stats_update` is optional and if its not present then the default value is `false`. If `ignore_error_stats_update` is true, the error will not be counted on the error stats. When parsing Redis call reply, each error is translated to a Lua table on the format describe above and the `ignore_error_stats_update` field is set to `true` so we will not count errors twice (we counted this error when we invoke the command). The changes in this PR might have been considered as a breaking change for users that used Lua `pcall` function. Before, the error was a string and now its a table. To keep backward comparability the PR override the `pcall` implementation and extract the error message from the error table and return it. Example of the error stats update: ``` 127.0.0.1:6379> lpush l 1 (integer) 2 127.0.0.1:6379> eval "return redis.call('get', 'l')" 0 (error) WRONGTYPE Operation against a key holding the wrong kind of value. script: e471b73f1ef44774987ab00bdf51f21fd9f7974a, on @user_script:1. 127.0.0.1:6379> info Errorstats # Errorstats errorstat_WRONGTYPE:count=1 127.0.0.1:6379> info commandstats # Commandstats cmdstat_eval:calls=1,usec=341,usec_per_call=341.00,rejected_calls=0,failed_calls=1 cmdstat_info:calls=1,usec=35,usec_per_call=35.00,rejected_calls=0,failed_calls=0 cmdstat_lpush:calls=1,usec=14,usec_per_call=14.00,rejected_calls=0,failed_calls=0 cmdstat_get:calls=1,usec=10,usec_per_call=10.00,rejected_calls=0,failed_calls=1 ``` ## error message We can now construct the error message (sent as a reply to the user) from the error table, so this solves issues where the error message was malformed and the error code appeared in the middle of the error message: ```diff 127.0.0.1:6379> eval "return redis.call('set','x','y')" 0 -(error) ERR Error running script (call to 71e6319f97b0fe8bdfa1c5df3ce4489946dda479): @user_script:1: OOM command not allowed when used memory > 'maxmemory'. +(error) OOM command not allowed when used memory > 'maxmemory' @user_script:1. Error running script (call to 71e6319f97b0fe8bdfa1c5df3ce4489946dda479) ``` ```diff 127.0.0.1:6379> eval "redis.call('get', 'l')" 0 -(error) ERR Error running script (call to f_8a705cfb9fb09515bfe57ca2bd84a5caee2cbbd1): @user_script:1: WRONGTYPE Operation against a key holding the wrong kind of value +(error) WRONGTYPE Operation against a key holding the wrong kind of value script: 8a705cfb9fb09515bfe57ca2bd84a5caee2cbbd1, on @user_script:1. ``` Notica that `redis.pcall` was not change: ``` 127.0.0.1:6379> eval "return redis.pcall('get', 'l')" 0 (error) WRONGTYPE Operation against a key holding the wrong kind of value ``` ## other notes Notice that Some commands (like GEOADD) changes the cmd variable on the client stats so we can not count on it to update the command stats. In order to be able to update those stats correctly we needed to promote `realcmd` variable to be located on the client struct. Tests was added and modified to verify the changes. Related PR's: #10279, #10218, #10278, #10309 Co-authored-by: Oran Agra <[email protected]>

This PR handles several aspects 1. Calls to RM_ReplyWithError from thread safe contexts don't violate thread safety. 2. Errors returning from RM_Call to the module aren't counted in the statistics (they might be handled silently by the module) 3. When a module propagates a reply it got from RM_Call to it's client, then the error statistics are counted. This is done by: 1. When appending an error reply to the output buffer, we avoid updating the global error statistics, instead we cache that error in a deferred list in the client struct. 2. When creating a RedisModuleCallReply object, the deferred error list is moved from the client into that object. 3. when a module calls RM_ReplyWithCallReply we copy the deferred replies to the dest client (if that's a real client, then that's when the error statistics are updated to the server) Note about RM_ReplyWithCallReply: if the original reply had an array with errors, and the module replied with just a portion of the original reply, and not the entire reply, the errors are currently not propagated and the errors stats will not get propagated. Fix #10180 (cherry picked from commit b099889)

module reply error stats

d71e8d5

oranagra requested a review from MeirShpilraien February 10, 2022 20:18

oranagra added the 7.0-must-have label Feb 10, 2022

typo

4a23bb5

oranagra mentioned this pull request Feb 13, 2022

Script reply errors cleanup and stats #10279

Closed

oranagra requested a review from yossigo February 13, 2022 11:32

yossigo approved these changes Feb 13, 2022

View reviewed changes

MeirShpilraien approved these changes Feb 13, 2022

View reviewed changes

oranagra commented Feb 13, 2022

View reviewed changes

src/module.c Show resolved Hide resolved

Update src/module.c

3412de2

oranagra merged commit b099889 into redis:unstable Feb 13, 2022

oranagra deleted the module_error_reply_stats branch February 13, 2022 16:37

oranagra added the release-notes indication that this issue needs to be mentioned in the release notes label Feb 13, 2022

oranagra mentioned this pull request Feb 17, 2022

Fix error stats and failed command stats for blocked clients #10309

Merged

oranagra mentioned this pull request Feb 22, 2022

Sort out the mess around Lua error messages and error stats #10329

Merged

oranagra mentioned this pull request Feb 28, 2022

Release 7.0 RC2 #10355

Merged

This was referenced Apr 27, 2022

Redis 6.2.7 #10653

Closed

Redis 6.2.7 #10654

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix and improve module error reply statistics#10278

Fix and improve module error reply statistics#10278
oranagra merged 3 commits intoredis:unstablefrom
oranagra:module_error_reply_stats

oranagra commented Feb 10, 2022 •

edited

Loading

Uh oh!

MeirShpilraien left a comment

Uh oh!

Uh oh!

oranagra commented Feb 13, 2022

Uh oh!

yossigo commented Feb 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

oranagra commented Feb 10, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MeirShpilraien left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

oranagra commented Feb 13, 2022

Uh oh!

yossigo commented Feb 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

oranagra commented Feb 10, 2022 •

edited

Loading