Skip to content

Reintroduce lua argument cache in luaRedisGenericCommand removed in v7.0 #11541

Merged
oranagra merged 17 commits intoredis:unstablefrom
filipecosta90:lua.cmd.object.cache
Dec 5, 2022
Merged

Reintroduce lua argument cache in luaRedisGenericCommand removed in v7.0 #11541
oranagra merged 17 commits intoredis:unstablefrom
filipecosta90:lua.cmd.object.cache

Conversation

@filipecosta90
Copy link
Contributor

@filipecosta90 filipecosta90 commented Nov 24, 2022

As discussed in #10981 (comment) this PR re-enables argv argument caching.

This mechanism aims to reduce calls to malloc and free when preparing the arguments the script sends to redis commands.
This is a mechanism was originally implemented in 48c49c4 and 4f68655, and was removed in #10220 (thinking it's not needed and that it has no impact), but it now turns out it was wrong, and it indeed provides some 5% performance improvement.

The implementation is a little bit too simplistic, it assumes consecutive calls use the same size in the same arg index, but that's arguably sufficient since it's only aimed at caching very small things.

We could even consider always pre-allocating args to the full LUA_CMD_OBJCACHE_MAX_LEN (64 bytes) rather than the right size for the argument, that would increase the chance they'll be able to be re-used.
But in some way this is already happening since we're using sdsalloc, which in turn uses s_malloc_usable and takes ownership of the full side of the allocation, so we are padded to the allocator bucket size.

To easily test it:

tclsh tests/test_helper.tcl --single unit/scripting

Here's the impact using the following benchmark command:

memtier_benchmark --command="eval \"redis.call('hset', 'h1', 'k', 'v');redis.call('hset', 'h2', 'k', 'v');return redis.call('ping')\" 0"  - --hide-histogram --test-time 60 --pipeline 10 -x 1

v6.2.6 ~ 222K ops/sec -- 4930d19

ALL STATS
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Evals      223707.29         8.93841         8.83100        17.53500        21.37500     28400.34 
Totals     223707.29         8.93841         8.83100        17.53500        21.37500     28400.34

Unstable ~ 164K ops/sec -- abc345a

ALL STATS
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Evals      163821.59        12.20666        12.03100        24.06300        24.31900     20797.66 
Totals     163821.59        12.20666        12.03100        24.06300        24.31900     20797.66 

This PR ~ 184K ops/sec -- 6310ea8

ALL STATS
==================================================================================================
Type         Ops/sec    Avg. Latency     p50 Latency     p99 Latency   p99.9 Latency       KB/sec 
--------------------------------------------------------------------------------------------------
Evals      183052.96        10.92378        10.81500        21.37500        25.21500     23239.15 
Totals     183052.96        10.92378        10.81500        21.37500        25.21500     23239.15 

@filipecosta90 filipecosta90 added the action:run-benchmark Triggers the benchmark suite for this Pull Request label Nov 24, 2022
lua_args_cached_objects_len[j] >= obj_len)
{
sds s = lua_args_cached_objects[j]->ptr;
lua_argv[j] = lua_args_cached_objects[j];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i now realize that this is affecting trimStringObjectIfNeeded.
i.e. if we run a SET command from a lua script, it'll get to tryObjectEncoding and possibly reallocate (memcpy) that string.
@uriyage this may mean that the assumption in #11508 that 32kb (BIG_ARG) is the only case where strings have excess space is wrong.
on the other hand, it means that the argv caching in this PR is sometimes not really achieving it's purpose (for SET command).

not sure what to do.. i don't particularly like that lame argument caching mechanism.
i wasn't actually aware of it until recently, and i see it only works efficiently if you re-use the same arguments sizes in the exact same argv indexes.

on the other hand, i'm not entirely comfortable with the BIG_ARG condition in trimStringObjectIfNeeded either.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit ugly, but we can pass an argument to tryObjectEncoding to indicate if we should check the BIG_ARG condition or not depending on the client's CLIENT_SCRIPT flag.

@oranagra
Copy link
Member

@filipecosta90 after reviewing this caching mechanism for the first time, i must say i don't like it.
it relies on the arguments to repeated calls from (various) scripts to be of the same size in the same argv index.
i find it odd that Yoav wasn't able to detect that this optimization he useful, and we now see it caused a regression on a very specific change.

maybe we should attack this from a different angle, and instead of blindly re-introduce the cache, try to profile is and maybe come up with a different conclusion and design?

@filipecosta90
Copy link
Contributor Author

maybe we should attack this from a different angle, and instead of blindly re-introduce the cache, try to profile is and maybe come up with a different conclusion and design?

Agree. Let's try to see what the profile gives us. I'll profile already with the change in #11521

@filipecosta90
Copy link
Contributor Author

closing this one based on the above discussion

@oranagra
Copy link
Member

i'd rather keep it open until we have another lead.

@oranagra oranagra reopened this Nov 29, 2022
@sundb
Copy link
Collaborator

sundb commented Dec 3, 2022

@filipecosta90 The reason for the crash in CI:

    robj **argv = luaArgsToRedisArgv(lua, &argc);
    replaceClientCommandVector(...) <- c->argv will be freed

This will indirectly cause lua_argv to be fredd, but lua_argv_size is not updated.
You can comment zfree(c->argv); in freeClientArgv to find out that it doesn't crash anymore.

It seems that the same issue also occurs in 6.2.7.

@filipecosta90
Copy link
Contributor Author

replaceClientCommandVector

@sundb have you managed to replicate it on 6.2.7? I can easily replicate the issue on this branch running the following command twice but not on 6.2:

eval 'redis.call("SET", "k1", "value")  local encoded = redis.call("DUMP", "k1")  redis.call("RESTORE", "k2", 1, encoded, "REPLACE")  redis.call("RESTORE", "k3", 1, encoded, "REPLACE") redis.call("EXPIREAT", "k2", 1)' 0

@oranagra
Copy link
Member

oranagra commented Dec 3, 2022

i don't see any problem in 6.2 around these area.
the elements themselves are removed from the cache, and are later re-cached if needed (we don't keep the old ones).
and the argv array itself has a piece of code that notices that it was replaced, and works around that (if (c->argv != argv))

the problem with this PR is that it forgets to correctly set c->argv_len (which is a new thing in 7.0), that's what causes the crash here. it's not use after free, or double free issue, it's just an array out of bounds issue.

@oranagra
Copy link
Member

oranagra commented Dec 3, 2022

i still don't like this mechanism. the approach seems silly and overly simplistic, it assumes arguments at the same indexes will be the same size in sequential commands.
i.e. will not give the desired effect in many cases (e.g. SET key bigvalue and then SETRANGE key offset bigvalue)
also, it seems violates the assumptions in #11508

i'd like to check why the allocations make such a big difference, and learn why scripting requires that and MULTI-EXEC doesn't.
if we need that, i'd like to create a mechanism that's a lot smarter, and doesn't rely on the same indexes being used.

@filipecosta90
Copy link
Contributor Author

i still don't like this mechanism. the approach seems silly and overly simplistic, it assumes arguments at the same indexes will be the same size in sequential commands. i.e. will not give the desired effect in many cases (e.g. SET key bigvalue and then SETRANGE key offset bigvalue) also, it seems violates the assumptions in #11508

i'd like to check why the allocations make such a big difference, and learn why scripting requires that and MULTI-EXEC doesn't. if we need that, i'd like to create a mechanism that's a lot smarter, and doesn't rely on the same indexes being used.

@oranagra agree with the simplicity of caching vs the added complexity of code.
WRT to "check why the allocations make such a big difference" we can see that out of the luaRedisGenericCommand logic 7.04% is from luaArgsToRedisArgv (out of which 5.5 % are from zmalloc and zcalloc ) + 4.55% from freeClientArgv.

image

@sundb
Copy link
Collaborator

sundb commented Dec 4, 2022

As i mentioned in this comment #10981 (comment)

  1. Fix invalid access in lpFind on corrupted listpack #9819 -1.5%
    We can force the hash to dict to check it.

  2. Redis Function #9780 -3
    I remember last time I found that luaSaveOnRegistry() was a relatively large overhead.

@oranagra
Copy link
Member

oranagra commented Dec 4, 2022

  1. Fix invalid access in lpFind on corrupted listpack #9819 -1.5%
    We can force the hash to dict to check it.

@sundb i don't understand what you suggest?
to tune the redis config of our benchmark to use dict rather than listpack? (i.e. just for the sake of better comparison with 6.2)

  1. Redis Function #9780 -3
    I remember last time I found that luaSaveOnRegistry() was a relatively large overhead.

@MeirShpilraien maybe you can think of another, more performant alternative?

@sundb
Copy link
Collaborator

sundb commented Dec 4, 2022

@sundb i don't understand what you suggest? to tune the redis config of our benchmark to use dict rather than listpack? (i.e. just for the sake of better comparison with 6.2)

Yes.

* scriptCall takes a client, and the caller is anyway messing with it,
  so no need to pass argv and argc, instead script the caller can set
  them and other vars.
* properly set c->argv_len before the call
* no need to set c->argv_len_sum (only needed for real clients), and if
  we did need it, we need to update it during the building of argv.
* add some comments
* properly release the argv back to the cache on args parsing error
* add test coverage for the above
* add a call to freeClientArgv (who knows what it'll do one day)
@oranagra
Copy link
Member

oranagra commented Dec 4, 2022

i thought about this mechanism a bit more, and performed some benchmarks.
eventually i concluded that it's not that bad that this mechanism is silly (assumes consecutive calls use the same size in the same arg index), since it's only aimed at caching very small things.

i'd even consider always pre-allocating args to the full LUA_CMD_OBJCACHE_MAX_LEN (64) rather than the right size, that would increase the chance they'll be able to be re-used.
I suppose that in some way this is already happening since we're using sdsalloc, which in turn uses s_malloc_usable and takes ownership of the full side of the allocation, so we are padded to the allocator bucket size.

i edited the code and changed a few things, see the commit comment.
let me know what you think and let's merge this.

@oranagra oranagra merged commit 2d80cd7 into redis:unstable Dec 5, 2022
@oranagra oranagra added the release-notes indication that this issue needs to be mentioned in the release notes label Dec 5, 2022
@oranagra oranagra mentioned this pull request Dec 11, 2022
oranagra added a commit that referenced this pull request Dec 12, 2022
…7.0 (#11541)

This mechanism aims to reduce calls to malloc and free when
preparing the arguments the script sends to redis commands.
This is a mechanism was originally implemented in 48c49c4
and 4f68655, and was removed in #10220 (thinking it's not needed
and that it has no impact), but it now turns out it was wrong, and it
indeed provides some 5% performance improvement.

The implementation is a little bit too simplistic, it assumes consecutive
calls use the same size in the same arg index, but that's arguably
sufficient since it's only aimed at caching very small things.

We could even consider always pre-allocating args to the full
LUA_CMD_OBJCACHE_MAX_LEN (64 bytes) rather than the right size for the argument,
that would increase the chance they'll be able to be re-used.
But in some way this is already happening since we're using
sdsalloc, which in turn uses s_malloc_usable and takes ownership
of the full side of the allocation, so we are padded to the allocator
bucket size.

Co-authored-by: Oran Agra <[email protected]>
Co-authored-by: sundb <[email protected]>
(cherry picked from commit 2d80cd7)
oranagra added a commit that referenced this pull request Jan 4, 2023
…libc realloc (#11652)

TLDR: solve a problem introduced in Redis 7.0.6 (#11541) with
RM_CommandFilterArgInsert being called from scripts, which can
lead to memory corruption.

Libc realloc can return the same pointer even if the size was changed. The code in
freeLuaRedisArgv had an assumption that if the pointer didn't change, then the
allocation didn't change, and the cache can still be reused.
However, if rewriteClientCommandArgument or RM_CommandFilterArgInsert were
used, it could be that we realloced the argv array, and the pointer didn't change, then
a consecutive command being executed from Lua can use that argv cache reaching
beyond its size.
This was actually only possible with modules, since the decision to realloc was based
on argc, rather than argv_len.
oranagra added a commit that referenced this pull request Jan 16, 2023
…libc realloc (#11652)

TLDR: solve a problem introduced in Redis 7.0.6 (#11541) with
RM_CommandFilterArgInsert being called from scripts, which can
lead to memory corruption.

Libc realloc can return the same pointer even if the size was changed. The code in
freeLuaRedisArgv had an assumption that if the pointer didn't change, then the
allocation didn't change, and the cache can still be reused.
However, if rewriteClientCommandArgument or RM_CommandFilterArgInsert were
used, it could be that we realloced the argv array, and the pointer didn't change, then
a consecutive command being executed from Lua can use that argv cache reaching
beyond its size.
This was actually only possible with modules, since the decision to realloc was based
on argc, rather than argv_len.

(cherry picked from commit c805212)
madolson pushed a commit to madolson/redis that referenced this pull request Apr 19, 2023
…7.0 (redis#11541)

This mechanism aims to reduce calls to malloc and free when
preparing the arguments the script sends to redis commands.
This is a mechanism was originally implemented in 48c49c4
and 4f68655, and was removed in redis#10220 (thinking it's not needed
and that it has no impact), but it now turns out it was wrong, and it
indeed provides some 5% performance improvement.

The implementation is a little bit too simplistic, it assumes consecutive
calls use the same size in the same arg index, but that's arguably
sufficient since it's only aimed at caching very small things.

We could even consider always pre-allocating args to the full
LUA_CMD_OBJCACHE_MAX_LEN (64 bytes) rather than the right size for the argument,
that would increase the chance they'll be able to be re-used.
But in some way this is already happening since we're using
sdsalloc, which in turn uses s_malloc_usable and takes ownership
of the full side of the allocation, so we are padded to the allocator
bucket size.


Co-authored-by: Oran Agra <[email protected]>
Co-authored-by: sundb <[email protected]>
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
…7.0 (redis#11541)

This mechanism aims to reduce calls to malloc and free when
preparing the arguments the script sends to redis commands.
This is a mechanism was originally implemented in 48c49c4
and 4f68655, and was removed in redis#10220 (thinking it's not needed
and that it has no impact), but it now turns out it was wrong, and it
indeed provides some 5% performance improvement.

The implementation is a little bit too simplistic, it assumes consecutive
calls use the same size in the same arg index, but that's arguably
sufficient since it's only aimed at caching very small things.

We could even consider always pre-allocating args to the full
LUA_CMD_OBJCACHE_MAX_LEN (64 bytes) rather than the right size for the argument,
that would increase the chance they'll be able to be re-used.
But in some way this is already happening since we're using
sdsalloc, which in turn uses s_malloc_usable and takes ownership
of the full side of the allocation, so we are padded to the allocator
bucket size.


Co-authored-by: Oran Agra <[email protected]>
Co-authored-by: sundb <[email protected]>
enjoy-binbin pushed a commit to enjoy-binbin/redis that referenced this pull request Jul 31, 2023
…libc realloc (redis#11652)

TLDR: solve a problem introduced in Redis 7.0.6 (redis#11541) with
RM_CommandFilterArgInsert being called from scripts, which can
lead to memory corruption.

Libc realloc can return the same pointer even if the size was changed. The code in
freeLuaRedisArgv had an assumption that if the pointer didn't change, then the
allocation didn't change, and the cache can still be reused.
However, if rewriteClientCommandArgument or RM_CommandFilterArgInsert were
used, it could be that we realloced the argv array, and the pointer didn't change, then
a consecutive command being executed from Lua can use that argv cache reaching
beyond its size.
This was actually only possible with modules, since the decision to realloc was based
on argc, rather than argv_len.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

action:run-benchmark Triggers the benchmark suite for this Pull Request release-notes indication that this issue needs to be mentioned in the release notes

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants