Skip to content

Script reply errors cleanup and stats#10279

Closed
oranagra wants to merge 1 commit intoredis:unstablefrom
oranagra:script_error_reply_stats
Closed

Script reply errors cleanup and stats#10279
oranagra wants to merge 1 commit intoredis:unstablefrom
oranagra:script_error_reply_stats

Conversation

@oranagra
Copy link
Member

@oranagra oranagra commented Feb 10, 2022

This PR messes with two things

  1. Server error reply statistics in script calls.
  2. Error code and error strings returning from scripts.

For error statistics, this PR does the following:

  • When a script gets an error from redis.pcall and handles it, the error should not be counted (used to be counted).
  • If the error from redis.pcall is returned by the script to the client, then it is counted (was already counted, so with the above it means it was counted twice).
  • Error that is thrown by redis.call or by any other mechanism is obviously counted (not changed by this PR)

For error strings and error codes, this PR attempts to make sure the error replies that are generated by the redis command calls and wrapped with additional Lua context details retain the original error code.
This is an additional improvement on top of what was recently done in #10218

 127.0.0.1:6379> eval "return redis.call('set','x','y')" 0
-(error) ERR Error running script (call to 71e6319f97b0fe8bdfa1c5df3ce4489946dda479): @user_script:1: OOM command not allowed when used memory > 'maxmemory'.
+(error) OOM command not allowed when used memory > 'maxmemory'.. @user_script:1. Error running script (call to 71e6319f97b0fe8bdfa1c5df3ce4489946dda479)

Other changes:

  1. remove excessive newlines when calling addReplyErrorFormat in luaCreateFunction and luaCallFunction

Unresolved issues:

  1. Many parts of the code simply push an error string to Lua without an error code, e.g lua_pushstring(lua, "Invalid command passed to redis.acl_check_cmd()");, now that this is the initial part of the error string (not appended after a generic Lua context), it lacks an error code (return will return -Invalid)
  2. errors that are returned from the global protection code are starting like this: user_script:1: Script attempted to create global and i'm not sure how to properly prepend an ERR error code to them.

Note somewhat related change done for modules: #10278

Copy link
Contributor

@yoav-steinberg yoav-steinberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has to resolve the 2 remaining issues in the top comment.
Added some minor notes suggestions.

" end\n"
" if i then\n"
" return i.source .. ':' .. i.currentline .. ': ' .. err\n"
" return err .. '. ' .. i.source .. ':' .. i.currentline\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets try to use this PR to unify this with the errh_func script in scriptingInit().

" end\n"
" if i then\n"
" return i.source .. ':' .. i.currentline .. ': ' .. err\n"
" return err .. '. ' .. i.source .. ':' .. i.currentline\n"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is a bit nicer:

Suggested change
" return err .. '. ' .. i.source .. ':' .. i.currentline\n"
" return err .. ' (' .. i.source .. ':' .. i.currentline .. ')'\n"

oranagra added a commit that referenced this pull request Feb 21, 2022
This is a followup work for #10278, and a discussion about #10279

The changes:
- fix failed_calls in command stats for blocked clients that got error.
  including CLIENT UNBLOCK, and module replying an error from a thread.
- fix latency stats for XREADGROUP that filed with -NOGROUP

Theory behind which errors should be counted:
- error stats represents errors returned to the user, so an error handled by a
  module should not be counted.
- total error counter should be the same.
- command stats represents execution of commands (even with RM_Call, and if
  they fail or get rejected it counts these calls in commandstats, so it should
  also count failed_calls)

Some thoughts about Scripts:
for scripts it could be different since they're part of user code, not the infra (not an extension to redis)
we certainly want commandstats to contain all calls and errors
a simple script is like mult-exec transaction so an error inside it should be counted in error stats
a script that replies with an error to the user (using redis.error_reply) should also be counted in error stats
but then the problem is that a plain `return redis.call("SET")` should not be counted twice (once for the SET
and once for EVAL)
so that's something left to be resolved in #10279
@oranagra
Copy link
Member Author

closing in favor of #10329

if (c != NULL) {
addReplyErrorFormat(c,
"Error compiling script (new function): %s\n",
"Error compiling script (new function): %s",
Copy link
Member Author

@oranagra oranagra Feb 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MeirShpilraien why are the newlines here? (and in the other error below, and similar one in script_lua.c)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i mean that i think you should mirror these 3 changes (trimming excessive newline in 3 places) to your PR (the one that replaces this one)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, will add it.

oranagra added a commit that referenced this pull request Feb 27, 2022
This PR fix 2 issues on Lua scripting:
* Server error reply statistics (some errors were counted twice).
* Error code and error strings returning from scripts (error code was missing / misplaced).

## Statistics
a Lua script user is considered part of the user application, a sophisticated transaction,
so we want to count an error even if handled silently by the script, but when it is
propagated outwards from the script we don't wanna count it twice. on the other hand,
if the script decides to throw an error on its own (using `redis.error_reply`), we wanna
count that too.
Besides, we do count the `calls` in command statistics for the commands the script calls,
we we should certainly also count `failed_calls`.
So when a simple `eval "return redis.call('set','x','y')" 0` fails, it should count the failed call
to both SET and EVAL, but the `errorstats` and `total_error_replies` should be counted only once.

The PR changes the error object that is raised on errors. Instead of raising a simple Lua
string, Redis will raise a Lua table in the following format:

```
{
    err='<error message (including error code)>',
    source='<User source file name>',
    line='<line where the error happned>',
    ignore_error_stats_update=true/false,
}
```

The `luaPushError` function was modified to construct the new error table as describe above.
The `luaRaiseError` was renamed to `luaError` and is now simply called `lua_error` to raise
the table on the top of the Lua stack as the error object.
The reason is that since its functionality is changed, in case some Redis branch / fork uses it,
it's better to have a compilation error than a bug.

The `source` and `line` fields are enriched by the error handler (if possible) and the
`ignore_error_stats_update` is optional and if its not present then the default value is `false`.
If `ignore_error_stats_update` is true, the error will not be counted on the error stats.

When parsing Redis call reply, each error is translated to a Lua table on the format describe
above and the `ignore_error_stats_update` field is set to `true` so we will not count errors
twice (we counted this error when we invoke the command).

The changes in this PR might have been considered as a breaking change for users that used
Lua `pcall` function. Before, the error was a string and now its a table. To keep backward
comparability the PR override the `pcall` implementation and extract the error message from
the error table and return it.

Example of the error stats update:

```
127.0.0.1:6379> lpush l 1
(integer) 2
127.0.0.1:6379> eval "return redis.call('get', 'l')" 0
(error) WRONGTYPE Operation against a key holding the wrong kind of value. script: e471b73f1ef44774987ab00bdf51f21fd9f7974a, on @user_script:1.

127.0.0.1:6379> info Errorstats
# Errorstats
errorstat_WRONGTYPE:count=1

127.0.0.1:6379> info commandstats
# Commandstats
cmdstat_eval:calls=1,usec=341,usec_per_call=341.00,rejected_calls=0,failed_calls=1
cmdstat_info:calls=1,usec=35,usec_per_call=35.00,rejected_calls=0,failed_calls=0
cmdstat_lpush:calls=1,usec=14,usec_per_call=14.00,rejected_calls=0,failed_calls=0
cmdstat_get:calls=1,usec=10,usec_per_call=10.00,rejected_calls=0,failed_calls=1
```

## error message
We can now construct the error message (sent as a reply to the user) from the error table,
so this solves issues where the error message was malformed and the error code appeared
in the middle of the error message:

```diff
127.0.0.1:6379> eval "return redis.call('set','x','y')" 0
-(error) ERR Error running script (call to 71e6319f97b0fe8bdfa1c5df3ce4489946dda479): @user_script:1: OOM command not allowed when used memory > 'maxmemory'.
+(error) OOM command not allowed when used memory > 'maxmemory' @user_script:1. Error running script (call to 71e6319f97b0fe8bdfa1c5df3ce4489946dda479)
```

```diff
127.0.0.1:6379> eval "redis.call('get', 'l')" 0
-(error) ERR Error running script (call to f_8a705cfb9fb09515bfe57ca2bd84a5caee2cbbd1): @user_script:1: WRONGTYPE Operation against a key holding the wrong kind of value
+(error) WRONGTYPE Operation against a key holding the wrong kind of value script: 8a705cfb9fb09515bfe57ca2bd84a5caee2cbbd1, on @user_script:1.
```

Notica that `redis.pcall` was not change:
```
127.0.0.1:6379> eval "return redis.pcall('get', 'l')" 0
(error) WRONGTYPE Operation against a key holding the wrong kind of value
```


## other notes
Notice that Some commands (like GEOADD) changes the cmd variable on the client stats so we
can not count on it to update the command stats. In order to be able to update those stats correctly
we needed to promote `realcmd` variable to be located on the client struct.

Tests was added and modified to verify the changes.

Related PR's: #10279, #10218, #10278, #10309

Co-authored-by: Oran Agra <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants