-
Notifications
You must be signed in to change notification settings - Fork 23.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid used_memory contention when update from multiple threads. #13431
Avoid used_memory contention when update from multiple threads. #13431
Conversation
I don't know the tests content but I'm surprised that little change makes difference as much as %15 on some tests. I'd say this shouldn't be possible :) Isn't looping for 16 atomic ops slow down main thread btw? (As we call |
src/zmalloc.c
Outdated
atomicGet(used_memory[i].used_memory, thread_used_mem); | ||
totol_mem += thread_used_mem; | ||
} | ||
return totol_mem; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MeirShpilraien what about adding a assertion assert(total_mem > 0)
to run a fully CI, and then finally delete it? or keep it.
EDIT: i forgot it wan't long long.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to long long
and added the assert, but I think we should remove it once we see the tests pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like it pass successfully
I was also surprised :)
Still running some tests with @filipecosta90 , but initial results look good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
the benchmark text at the top doesn't mention how many threads there were? did it use io-threads
Tests run on vanila Redis without IO threads, but its a good point, will run it with IO threads. |
|
||
static void update_zmalloc_stat_alloc(long long num) { | ||
init_my_thread_index(); | ||
atomicIncr(used_memory[my_thread_index].used_memory, num); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the atomic is not necessary here, the thread already has its own slot, only one writer exists.
Ref: valkey-io/valkey#674
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There might be more than one writer, we reuse slots if we have more than 16 threads.
Automated performance analysis summaryThis comment was automatically generated given there is performance data available. Using platform named: intel64-ubuntu22.04-redis-icx1 to do the comparison. In summary:
You can check a comparison in detail via the grafana link Comparison between unstable and test_used_memory_per_thread_array.Time Period from 5 months ago. (environment used: oss-standalone) Improvements Table
Full Results table:
|
ASAN thread for this PR passed in my local. |
@fcostaoliveira can you update the results for the io-threads benchmark? I guess this is the only thing left to merge this PR. |
updated #13431 (comment) 👍 |
src/zmalloc.c
Outdated
atomicGet(used_memory[i].used_memory, thread_used_mem); | ||
total_mem += thread_used_mem; | ||
} | ||
assert(total_mem >= 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can remove this line now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
CE Performance Automation : step 1 of 2 (build) STARTING...This comment was automatically generated given a benchmark was triggered.
|
CE Performance Automation : step 2 of 2 (benchmark) FINISHED.This comment was automatically generated given a benchmark was triggered. Started benchmark suite at 2024-10-19 09:50:58.823901 and took 5111.718101 seconds to finish. In total will run 135 benchmarks. |
…s#13431) The PR attempt to avoid contention on the `used_memory` global variable when allocate or free memory from multiple threads at the same time. Each time a thread is allocating or releasing a memory, it needs to update the `used_memory` global variable. This update might cause a contention when done aggressively from multiple threads. ### The solution Instead of having a single global variable that need to be updated from multiple thread. We create an array of used_memory, each entry in the array is updated by a single thread and the main thread summarizes all the values to accumulate the memory usage. This solution, though reduces the contention between threads on updating the `used_memory` global variable, it adds work to the main thread that need to summarize all the entries at the `used_memory` array. To avoid increasing the work done by the main thread by too much, we limit the size of the used memory array to 16. This means that up to 16 threads can run without any contention between them. If there are more than 16 threads, we will reuse entries on the used_memory array, in this case we might still have contention between threads, but it will be much less significant. Notice, that in order to really avoid contention, the entries in the `used_memory` array must reside on different cache lines. To achieve that we create a struct with padding such that its size will be exactly cache_line size. In addition we make sure the address of the `used_memory` array will be aligned to cache_line size. ### Benchmark Some benchmark shows improvement (up to 15%): | Test Case |Baseline unstable (median obs. +- std.dev)|Comparison test_used_memory_per_thread_array (median obs. +- std.dev)|% change (higher-better)| Note | |-------------------------------------------------------------------------------|------------------------------------------|--------------------------------------------------------------------:|------------------------|------------------------------------| |memtier_benchmark-1key-list-100-elements-lrange-all-elements | 92657 +- 2.0% (2 datapoints) | 101445|9.5% |IMPROVEMENT | |memtier_benchmark-1key-list-1K-elements-lrange-all-elements | 14965 +- 1.3% (2 datapoints) | 16296|8.9% |IMPROVEMENT | |memtier_benchmark-1key-set-10-elements-smembers-pipeline-10 | 431019 +- 5.2% (2 datapoints) | 461039|7.0% |waterline=5.2%. IMPROVEMENT | |memtier_benchmark-1key-set-100-elements-smembers | 74367 +- 0.0% (2 datapoints) | 80190|7.8% |IMPROVEMENT | |memtier_benchmark-1key-set-1K-elements-smembers | 11730 +- 0.4% (2 datapoints) | 13519|15.3% |IMPROVEMENT | Full results: | Test Case |Baseline unstable (median obs. +- std.dev)|Comparison test_used_memory_per_thread_array (median obs. +- std.dev)|% change (higher-better)| Note | |-------------------------------------------------------------------------------|------------------------------------------|--------------------------------------------------------------------:|------------------------|------------------------------------| |memtier_benchmark-10Mkeys-load-hash-5-fields-with-1000B-values | 88613 +- 1.0% (2 datapoints) | 88688|0.1% |No Change | |memtier_benchmark-10Mkeys-load-hash-5-fields-with-1000B-values-pipeline-10 | 124786 +- 1.2% (2 datapoints) | 123671|-0.9% |No Change | |memtier_benchmark-10Mkeys-load-hash-5-fields-with-100B-values | 122460 +- 1.4% (2 datapoints) | 122990|0.4% |No Change | |memtier_benchmark-10Mkeys-load-hash-5-fields-with-100B-values-pipeline-10 | 333384 +- 5.1% (2 datapoints) | 319221|-4.2% |waterline=5.1%. potential REGRESSION| |memtier_benchmark-10Mkeys-load-hash-5-fields-with-10B-values | 137354 +- 0.3% (2 datapoints) | 138759|1.0% |No Change | |memtier_benchmark-10Mkeys-load-hash-5-fields-with-10B-values-pipeline-10 | 401261 +- 4.3% (2 datapoints) | 398524|-0.7% |No Change | |memtier_benchmark-1Mkeys-100B-expire-use-case | 179058 +- 0.4% (2 datapoints) | 180114|0.6% |No Change | |memtier_benchmark-1Mkeys-10B-expire-use-case | 180390 +- 0.2% (2 datapoints) | 180401|0.0% |No Change | |memtier_benchmark-1Mkeys-1KiB-expire-use-case | 175993 +- 0.7% (2 datapoints) | 175147|-0.5% |No Change | |memtier_benchmark-1Mkeys-4KiB-expire-use-case | 165771 +- 0.0% (2 datapoints) | 164434|-0.8% |No Change | |memtier_benchmark-1Mkeys-bitmap-getbit-pipeline-10 | 931339 +- 2.1% (2 datapoints) | 929487|-0.2% |No Change | |memtier_benchmark-1Mkeys-generic-exists-pipeline-10 | 999462 +- 0.4% (2 datapoints) | 963226|-3.6% |potential REGRESSION | |memtier_benchmark-1Mkeys-generic-expire-pipeline-10 | 905333 +- 1.4% (2 datapoints) | 896673|-1.0% |No Change | |memtier_benchmark-1Mkeys-generic-expireat-pipeline-10 | 885015 +- 1.0% (2 datapoints) | 865010|-2.3% |No Change | |memtier_benchmark-1Mkeys-generic-pexpire-pipeline-10 | 897115 +- 1.2% (2 datapoints) | 887544|-1.1% |No Change | |memtier_benchmark-1Mkeys-generic-scan-pipeline-10 | 451103 +- 3.2% (2 datapoints) | 465571|3.2% |potential IMPROVEMENT | |memtier_benchmark-1Mkeys-generic-touch-pipeline-10 | 996809 +- 0.6% (2 datapoints) | 984478|-1.2% |No Change | |memtier_benchmark-1Mkeys-generic-ttl-pipeline-10 | 979570 +- 1.7% (2 datapoints) | 958752|-2.1% |No Change | |memtier_benchmark-1Mkeys-hash-hget-hgetall-hkeys-hvals-with-100B-values | 180888 +- 0.5% (2 datapoints) | 182295|0.8% |No Change | |memtier_benchmark-1Mkeys-hash-hmget-5-fields-with-100B-values-pipeline-10 | 717881 +- 1.0% (2 datapoints) | 724814|1.0% |No Change | |memtier_benchmark-1Mkeys-hash-transactions-multi-exec-pipeline-20 | 1055447 +- 0.4% (2 datapoints) | 1065836|1.0% |No Change | |memtier_benchmark-1Mkeys-lhash-hexists | 164332 +- 0.1% (2 datapoints) | 163636|-0.4% |No Change | |memtier_benchmark-1Mkeys-lhash-hincbry | 171674 +- 0.3% (2 datapoints) | 172737|0.6% |No Change | |memtier_benchmark-1Mkeys-list-lpop-rpop-with-100B-values | 180904 +- 1.1% (2 datapoints) | 179467|-0.8% |No Change | |memtier_benchmark-1Mkeys-list-lpop-rpop-with-10B-values | 181746 +- 0.8% (2 datapoints) | 182416|0.4% |No Change | |memtier_benchmark-1Mkeys-list-lpop-rpop-with-1KiB-values | 182004 +- 0.7% (2 datapoints) | 180237|-1.0% |No Change | |memtier_benchmark-1Mkeys-load-hash-5-fields-with-1000B-values | 105191 +- 0.9% (2 datapoints) | 105058|-0.1% |No Change | |memtier_benchmark-1Mkeys-load-hash-5-fields-with-1000B-values-pipeline-10 | 150683 +- 0.9% (2 datapoints) | 153597|1.9% |No Change | |memtier_benchmark-1Mkeys-load-hash-hmset-5-fields-with-1000B-values | 104122 +- 0.7% (2 datapoints) | 105236|1.1% |No Change | |memtier_benchmark-1Mkeys-load-list-with-100B-values | 149770 +- 0.9% (2 datapoints) | 150510|0.5% |No Change | |memtier_benchmark-1Mkeys-load-list-with-10B-values | 165537 +- 1.9% (2 datapoints) | 164329|-0.7% |No Change | |memtier_benchmark-1Mkeys-load-list-with-1KiB-values | 113315 +- 0.5% (2 datapoints) | 114110|0.7% |No Change | |memtier_benchmark-1Mkeys-load-stream-1-fields-with-100B-values | 131201 +- 0.7% (2 datapoints) | 129545|-1.3% |No Change | |memtier_benchmark-1Mkeys-load-stream-1-fields-with-100B-values-pipeline-10 | 352891 +- 2.8% (2 datapoints) | 348338|-1.3% |No Change | |memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values | 104386 +- 0.7% (2 datapoints) | 105796|1.4% |No Change | |memtier_benchmark-1Mkeys-load-stream-5-fields-with-100B-values-pipeline-10 | 227593 +- 5.5% (2 datapoints) | 218783|-3.9% |waterline=5.5%. potential REGRESSION| |memtier_benchmark-1Mkeys-load-string-with-100B-values | 167552 +- 0.2% (2 datapoints) | 170282|1.6% |No Change | |memtier_benchmark-1Mkeys-load-string-with-100B-values-pipeline-10 | 646888 +- 0.5% (2 datapoints) | 639680|-1.1% |No Change | |memtier_benchmark-1Mkeys-load-string-with-10B-values | 174891 +- 0.7% (2 datapoints) | 174382|-0.3% |No Change | |memtier_benchmark-1Mkeys-load-string-with-10B-values-pipeline-10 | 749988 +- 5.1% (2 datapoints) | 769986|2.7% |waterline=5.1%. No Change | |memtier_benchmark-1Mkeys-load-string-with-1KiB-values | 155929 +- 0.1% (2 datapoints) | 156387|0.3% |No Change | |memtier_benchmark-1Mkeys-load-zset-with-10-elements-double-score | 92241 +- 0.2% (2 datapoints) | 92189|-0.1% |No Change | |memtier_benchmark-1Mkeys-load-zset-with-10-elements-int-score | 114328 +- 1.3% (2 datapoints) | 113154|-1.0% |No Change | |memtier_benchmark-1Mkeys-string-get-100B | 180685 +- 0.2% (2 datapoints) | 180359|-0.2% |No Change | |memtier_benchmark-1Mkeys-string-get-100B-pipeline-10 | 991291 +- 3.1% (2 datapoints) | 1020086|2.9% |No Change | |memtier_benchmark-1Mkeys-string-get-10B | 181183 +- 0.3% (2 datapoints) | 177868|-1.8% |No Change | |memtier_benchmark-1Mkeys-string-get-10B-pipeline-10 | 1032554 +- 0.8% (2 datapoints) | 1023120|-0.9% |No Change | |memtier_benchmark-1Mkeys-string-get-1KiB | 180479 +- 0.9% (2 datapoints) | 182215|1.0% |No Change | |memtier_benchmark-1Mkeys-string-get-1KiB-pipeline-10 | 979286 +- 0.9% (2 datapoints) | 989888|1.1% |No Change | |memtier_benchmark-1Mkeys-string-mget-1KiB | 121950 +- 0.4% (2 datapoints) | 120996|-0.8% |No Change | |memtier_benchmark-1key-geo-60M-elements-geodist | 179404 +- 1.0% (2 datapoints) | 181232|1.0% |No Change | |memtier_benchmark-1key-geo-60M-elements-geodist-pipeline-10 | 1023797 +- 0.5% (2 datapoints) | 1014980|-0.9% |No Change | |memtier_benchmark-1key-geo-60M-elements-geohash | 180808 +- 1.2% (2 datapoints) | 180606|-0.1% |No Change | |memtier_benchmark-1key-geo-60M-elements-geohash-pipeline-10 | 1056458 +- 1.6% (2 datapoints) | 1040050|-1.6% |No Change | |memtier_benchmark-1key-geo-60M-elements-geopos | 181808 +- 0.2% (2 datapoints) | 175945|-3.2% |potential REGRESSION | |memtier_benchmark-1key-geo-60M-elements-geopos-pipeline-10 | 1038180 +- 3.4% (2 datapoints) | 1033005|-0.5% |No Change | |memtier_benchmark-1key-geo-60M-elements-geosearch-fromlonlat | 142614 +- 0.3% (2 datapoints) | 144259|1.2% |No Change | |memtier_benchmark-1key-geo-60M-elements-geosearch-fromlonlat-bybox | 141008 +- 0.4% (2 datapoints) | 139602|-1.0% |No Change | |memtier_benchmark-1key-geo-60M-elements-geosearch-fromlonlat-pipeline-10 | 560698 +- 0.8% (2 datapoints) | 548806|-2.1% |No Change | |memtier_benchmark-1key-list-10-elements-lrange-all-elements | 166132 +- 0.9% (2 datapoints) | 170259|2.5% |No Change | |memtier_benchmark-1key-list-100-elements-lrange-all-elements | 92657 +- 2.0% (2 datapoints) | 101445|9.5% |IMPROVEMENT | |memtier_benchmark-1key-list-1K-elements-lrange-all-elements | 14965 +- 1.3% (2 datapoints) | 16296|8.9% |IMPROVEMENT | |memtier_benchmark-1key-pfadd-4KB-values-pipeline-10 | 264156 +- 0.2% (2 datapoints) | 262582|-0.6% |No Change | |memtier_benchmark-1key-set-10-elements-smembers | 138916 +- 1.7% (2 datapoints) | 138016|-0.6% |No Change | |memtier_benchmark-1key-set-10-elements-smembers-pipeline-10 | 431019 +- 5.2% (2 datapoints) | 461039|7.0% |waterline=5.2%. IMPROVEMENT | |memtier_benchmark-1key-set-10-elements-smismember | 173545 +- 1.1% (2 datapoints) | 173488|-0.0% |No Change | |memtier_benchmark-1key-set-100-elements-smembers | 74367 +- 0.0% (2 datapoints) | 80190|7.8% |IMPROVEMENT | |memtier_benchmark-1key-set-100-elements-smismember | 155682 +- 1.6% (2 datapoints) | 151367|-2.8% |No Change | |memtier_benchmark-1key-set-1K-elements-smembers | 11730 +- 0.4% (2 datapoints) | 13519|15.3% |IMPROVEMENT | |memtier_benchmark-1key-set-200K-elements-sadd-constant | 181070 +- 1.1% (2 datapoints) | 180214|-0.5% |No Change | |memtier_benchmark-1key-set-2M-elements-sadd-increasing | 166364 +- 0.1% (2 datapoints) | 166944|0.3% |No Change | |memtier_benchmark-1key-zincrby-1M-elements-pipeline-1 | 46071 +- 0.6% (2 datapoints) | 44979|-2.4% |No Change | |memtier_benchmark-1key-zrank-1M-elements-pipeline-1 | 48429 +- 0.4% (2 datapoints) | 49265|1.7% |No Change | |memtier_benchmark-1key-zrem-5M-elements-pipeline-1 | 48528 +- 0.4% (2 datapoints) | 48869|0.7% |No Change | |memtier_benchmark-1key-zrevrangebyscore-256K-elements-pipeline-1 | 100580 +- 1.5% (2 datapoints) | 101782|1.2% |No Change | |memtier_benchmark-1key-zrevrank-1M-elements-pipeline-1 | 48621 +- 2.0% (2 datapoints) | 48473|-0.3% |No Change | |memtier_benchmark-1key-zset-10-elements-zrange-all-elements | 83485 +- 0.6% (2 datapoints) | 83095|-0.5% |No Change | |memtier_benchmark-1key-zset-10-elements-zrange-all-elements-long-scores | 118673 +- 0.8% (2 datapoints) | 118006|-0.6% |No Change | |memtier_benchmark-1key-zset-100-elements-zrange-all-elements | 19009 +- 1.1% (2 datapoints) | 19293|1.5% |No Change | |memtier_benchmark-1key-zset-100-elements-zrangebyscore-all-elements | 18957 +- 0.5% (2 datapoints) | 19419|2.4% |No Change | |memtier_benchmark-1key-zset-100-elements-zrangebyscore-all-elements-long-scores| 171693 +- 0.5% (2 datapoints) | 172432|0.4% |No Change | |memtier_benchmark-1key-zset-1K-elements-zrange-all-elements | 3566 +- 0.6% (2 datapoints) | 3672|3.0% |No Change | |memtier_benchmark-1key-zset-1M-elements-zcard-pipeline-10 | 1067713 +- 0.4% (2 datapoints) | 1071550|0.4% |No Change | |memtier_benchmark-1key-zset-1M-elements-zrevrange-5-elements | 169195 +- 0.7% (2 datapoints) | 169620|0.3% |No Change | |memtier_benchmark-1key-zset-1M-elements-zscore-pipeline-10 | 914338 +- 0.2% (2 datapoints) | 905540|-1.0% |No Change | |memtier_benchmark-2keys-lua-eval-hset-expire | 88346 +- 1.7% (2 datapoints) | 87259|-1.2% |No Change | |memtier_benchmark-2keys-lua-evalsha-hset-expire | 103273 +- 1.2% (2 datapoints) | 102393|-0.9% |No Change | |memtier_benchmark-2keys-set-10-100-elements-sdiff | 15418 +- 10.9% UNSTABLE (2 datapoints) | 14369|-6.8% |UNSTABLE (very high variance) | |memtier_benchmark-2keys-set-10-100-elements-sinter | 83601 +- 3.6% (2 datapoints) | 82508|-1.3% |No Change | |memtier_benchmark-2keys-set-10-100-elements-sunion | 14942 +- 11.2% UNSTABLE (2 datapoints) | 14001|-6.3% |UNSTABLE (very high variance) | |memtier_benchmark-2keys-stream-5-entries-xread-all-entries | 75938 +- 0.4% (2 datapoints) | 76565|0.8% |No Change | |memtier_benchmark-2keys-stream-5-entries-xread-all-entries-pipeline-10 | 120781 +- 1.1% (2 datapoints) | 119142|-1.4% |No Change |
### New Features in binary distributions - 7 new data structures: JSON, Time series, Bloom filter, Cuckoo filter, Count-min sketch, Top-k, t-digest - Redis scalable query engine (including vector search) ### Potentially breaking changes - #12272 `GETRANGE` returns an empty bulk when the negative end index is out of range - #12395 Optimize `SCAN` command when matching data type ### Bug fixes - #13510 Fix `RM_RdbLoad` to enable AOF after RDB loading is completed - #13489 `ACL CAT` - return module commands - #13476 Fix a race condition in the `cache_memory` of `functionsLibCtx` - #13473 Fix incorrect lag due to trimming stream via `XTRIM` command - #13338 Fix incorrect lag field in `XINFO` when tombstone is after the `last_id` of the consume group - #13470 On `HDEL` of last field - update the global hash field expiration data structure - #13465 Cluster: Pass extensions to node if extension processing is handled by it - #13443 Cluster: Ensure validity of myself when loading cluster config - #13422 Cluster: Fix `CLUSTER SHARDS` command returns empty array ### Modules API - #13509 New API calls: `RM_DefragAllocRaw`, `RM_DefragFreeRaw`, and `RM_RegisterDefragCallbacks` - defrag API to allocate and free raw memory ### Performance and resource utilization improvements - #13503 Avoid overhead of comparison function pointer calls in listpack `lpFind` - #13505 Optimize `STRING` datatype write commands - #13499 Optimize `SMEMBERS` command - #13494 Optimize `GEO*` commands reply - #13490 Optimize `HELLO` command - #13488 Optimize client query buffer - #12395 Optimize `SCAN` command when matching data type - #13529 Optimize `LREM`, `LPOS`, `LINSERT`, and `LINDEX` commands - #13516 Optimize `LRANGE` and other commands that perform several writes to client buffers per call - #13431 Avoid `used_memory` contention when updating from multiple threads ### Other general improvements - #13495 Reply `-LOADING` on replica while flushing the db ### CLI tools - #13411 redis-cli: Fix wrong `dbnum` showed after the client reconnected ### Notes - No backward compatibility for replication or persistence. - Additional distributions, upgrade paths, features, and improvements will be introduced in upcoming pre-releases. - With the GA release of 8.0 we will deprecate Redis Stack.
The PR attempt to avoid contention on the
used_memory
global variable when allocate or free memory from multiple threads at the same time.Each time a thread is allocating or releasing a memory, it needs to update the
used_memory
global variable. This update might cause a contention when done aggressively from multiple threads.The solution
Instead of having a single global variable that need to be updated from multiple thread. We create an array of used_memory, each entry in the array is updated by a single thread and the main thread summarizes all the values to accumulate the memory usage.
This solution, though reduces the contention between threads on updating the
used_memory
global variable, it adds work to the main thread that need to summarize all the entries at theused_memory
array. To avoid increasing the work done by the main thread by too much, we limit the size of the used memory array to 16. This means that up to 16 threads can run without any contention between them. If there are more than 16 threads, we will reuse entries on the used_memory array, in this case we might still have contention between threads, but it will be much less significant.Notice, that in order to really avoid contention, the entries in the
used_memory
array must reside on different cache lines. To achieve that we create a struct with padding such that its size will be exactly cache_line size. In addition we make sure the address of theused_memory
array will be aligned to cache_line size.todo:
Benchmark
Some benchmark shows improvement (up to 15%):
Full results: