Add sanitizer support and clean up sanitizer findings by tezc · Pull Request #9601 · redis/redis

tezc · 2021-10-05T12:23:59Z

Added sanitizer support. address, undefined and thread sanitizers are available.
To build Redis with desired sanitizer : make SANITIZER=undefined
There were some sanitizer findings, cleaned up codebase
Added tests with address and undefined behavior sanitizers to daily CI.
Added tests with address sanitizer to the per-PR CI (smoke out mem leaks sooner).

Basically, there are three types of issues :

1- Unaligned load/store : Most probably, this issue may cause a crash on a platform that does not support unaligned access. Redis does unaligned access only on supported platforms.

2- Signed integer overflow. Although, signed overflow issue can be problematic time to time and change how compiler generates code, current findings mostly about signed shift or simple addition overflow. For most platforms Redis can be compiled for, this wouldn't cause any issue as far as I can tell (checked generated code on godbolt.org).

3 -Minor leak (redis-cli), use-after-free(just before calling exit());

UB means nothing guaranteed and risky to reason about program behavior but I don't think any of the fixes here worth backporting. As sanitizers are now part of the CI, preventing new issues will be the real benefit.

[EDIT] It seems that gcc 12.2.1 present on Alpine produces a bug in BITFIELD overflow detection.
looks like in this case, the fix here is not just to silence a warning, it actually fixes a bug.

src/bitops.c

src/ziplist.c

src/siphash.c

src/sha256.c

src/rdb.c

src/lzf_c.c

src/config.c

src/bitops.c

src/dict.c

src/expire.c

src/hyperloglog.c

src/rax.c

src/sentinel.c

src/cli_common.c

src/debug.c

bjosv

Nice!

src/Makefile

oranagra · 2021-10-06T11:49:20Z

https://godbolt.org/z/o7fej75xM

WOW, i'm always impressed by what compilers do (in this case both collapsing lots of reads and shifts into one, and also deleting a libc function call and replacing it with assembly).

Also, that's a great demonstration of how much the UB is in many cases BS, in all 3 cases the compiler still generates code that's still doing unaligned access. 😄

of course, had this code is used on an architecture that can't handle unaligned access, the compiler generate different results: https://godbolt.org/z/r58nTMMhP

oranagra

@tezc thank you for that PR.
This is very useful.
please don't take my comments personally, i guess i'm emotional about the subject.

the 3 main topics we need to discuss to continue are:

if there were actual bugs this uncovered.
if the changes could impact performance (even on non-modern, realistic system)
just discuss compilers and interesting edge cases for the fun of it.

.github/workflows/daily.yml

src/bitops.c

src/sentinel.c

src/sha256.c

src/siphash.c

src/ziplist.c

tezc · 2021-10-07T10:23:53Z

@oranagra Just here my understanding overall

I'm not a compiler dev by any means but I think undefined behavior something like "almost" impossible to reason about. Anything can happen. At some point, questions like if integer shift really causes an issue or if overflow would be fine in a specific case loses its meaning because it may not cause a problem with the current compiler version but next version. (or on different architecture). It may work for funcA() calls generally but for one of funcA() calls, compiler decides to inline the function and then decides to apply some optimization which takes advantage of undefined behavior. So, not easy to answer if any of these issues are actual problems :)

More reasons to hate UBs: A jaw dropping example

Good news, I believe almost all of the reasonable size C projects (if not all) have undefined behaviors (what a good news :) ). Redis has more than these sanitizer findings probably (A lot of things are UB if you follow the standard, makes you depressed each time you learn another thing is also UB). So, no need to be a paranoid and fix each issue. Question is whether we want help from sanitizers or not. If yes, we can follow a pragmatic approach, we can fix some of the issues, suppress ones we don't like.

oranagra · 2021-10-07T11:31:56Z

More reasons to hate UBs: A jaw dropping example

WOW, i'm not a compiler dev either, but IMHO that optimization shouldn't exist. (instead if C++ devs want speed, they should switch to C!).

anyway, i do agree we wanna avoid using undefined behavior, and that we want to use the sanitizer to help catch them.
i just think that if we can make sure to define the behavior, or only use it when we know it's defined (and fall back to less efficient implementation otherwise), that's better in some of these cases (the ones which uglify the code or make it less efficient).

but i don't want to rely on specific flags that exist only in some compilers, and now i'm also paranoid about the fact that a compiler will decide to erase all the files on my disk [see below].
so at this point i'm not sure what to do, it seems my two desires collide.

[below]
i.e. because if we do something that's undefined, then erasing all the files on the disk is a possible outcome. (if on some theoretical platform that would be the behavior, why not apply it for all).
i know that's not the reason in the above link, but who knows what other interpretations of undefined behaviors exist...

tezc · 2021-10-20T22:14:05Z

@oranagra

I tried to minimize the changes, please take another look at the PR. Also, realized that latest clang gives more warnings, so I fixed them as well and added clang build to the daily. Let me know if you're okay with that. (We can use GCC and/or Clang with UBSAN and ASAN).

I suppressed unaligned load issues. To be honest, I don't think these would cause any performance issues in any environment but it's impossible to benchmark it anyway. So, suppressed them in a few place.

src/debug.c

src/redisassert.c

oranagra

i've reviewed the new (smaller) diff from unstable, and also the changes in the last commit, they're mostly LGTM.

One last concern is if we conclude any of the fixes here really fixes a bug, we may want to backport it to older releases,
there was a small insignificant leak in the CLI that i don't think we care for.
and some issue in sentinel for which i'm waiting for an analysis.
other than that i think all the changes will not have any actual impact on redis.
i would be happy if you can go over them and add notes to the top comment dividing them to groups according to type and impact.

tests/support/util.tcl

src/bitops.c

.github/workflows/ci.yml

tests/instances.tcl

tests/test_helper.tcl

tests/support/server.tcl

oranagra · 2021-11-09T13:59:53Z

@tezc i see all comments are addressed.. so are we good to merge this?

triggered a full CI to see there are no complaints by some esoteric platform: https://github.com/redis/redis/actions/runs/1439775340

tezc · 2021-11-09T14:18:53Z

@oranagra I was waiting Daily to be green to rebase and run full CI for this branch. You triggered that action for "unstable" right? Anyway, I also triggered CI for these PR + rebase here : https://github.com/tezc/redis/actions/runs/1439837027

Edit: Ok I see, you used github action from unstable. Sorry, I was confused for a moment. It's better to use github action from this branch so it runs sanitizer builds as well. CI build I triggered does that. Let's wait and see if any new issue is introduced after rebase. I'll ping you when it ends.

oranagra · 2021-11-09T14:30:40Z

ohh, right. the action i triggered is silly:

it uses the old yaml files.
the branch wasn't rebased lately and unstable was a mess so it may contain bugs that are now fixed.

…), buffer size should be at least 256 bytes.

tezc · 2021-11-11T11:50:50Z

@oranagra Rebased the Pr and added two new commits, please take a look

In redis#8287, some overflow checks have been added. But when `when *= 1000` overflows, it will become a positive number. And the check not able to catch it. The key will be added with a short expiration time and will deleted a few seconds later. In redis#9601, will check the overflow after `*=` and return an error first, and avoiding this situation. In this commit, added some tests to cover those code paths. Found it in redis#9825, and close it.

In #8287, some overflow checks have been added. But when `when *= 1000` overflows, it will become a positive number. And the check not able to catch it. The key will be added with a short expiration time and will deleted a few seconds later. In #9601, will check the overflow after `*=` and return an error first, and avoiding this situation. In this commit, added some tests to cover those code paths. Found it in #9825, and close it.

In redis#8287, some overflow checks have been added. But when `when *= 1000` overflows, it will become a positive number. And the check not able to catch it. The key will be added with a short expiration time and will deleted a few seconds later. In redis#9601, will check the overflow after `*=` and return an error first, and avoiding this situation. In this commit, added some tests to cover those code paths. Found it in redis#9825, and close it.

In #8287, some overflow checks have been added. But when `when *= 1000` overflows, it will become a positive number. And the check not able to catch it. The key will be added with a short expiration time and will deleted a few seconds later. In #9601, will check the overflow after `*=` and return an error first, and avoiding this situation. In this commit, added some tests to cover those code paths. Found it in #9825, and close it. (cherry picked from commit 9273d09)

Partial cherry pick from #9601 in order for the tests in #9601 to pass (cherry picked from commit b91d8b2)

oranagra · 2022-08-02T12:47:23Z

Gents, I don't remember if we discussed this, but how do we feel about changing the Makefile to default to -O3 after this set of changes?

tezc · 2022-08-03T10:22:28Z

@oranagra Sure, we can do it. In my experience, PGO is a bit more risky for UB. So, I think we can just switch to -O3 now.

Also, what about -flto? Usually, it’s more important. -O3 might cause some more inlining but I guess -flto will inline functions a lot. (between compilation units). So, one downside is we might see less accurate stacktraces on a crash. Are we okay with that? Also, -flto will make build a bit slower.

Another concern, these flags might make some scenarios slower. We need to run some benchmarks for it, hopefully we won’t see regressions.

oranagra · 2022-08-03T10:43:08Z

@filipecosta90 already has some benchmarks indicating 5% improvement (with both -O3 and -flto are used). but i wonder they cover a wide range of cases, and if there are any regressions too.
i'm still paranoid about bugs, but considering we're far away from the next release, i guess we can try it.
I also wonder if there were previous discussions on that subject and why we stayed with O2.
@yossigo are you aware of anything from the past?

yossigo · 2022-08-03T15:00:50Z

@oranagra I'm not aware of anything, this is probably a good timing to try this.

tezc · 2022-08-08T09:17:35Z

@oranagra I was off last week, catching up.

i'm still paranoid about bugs, but considering we're far away from the next release, i guess we can try it.

We won't know without trying and I guess sanitizer will help us with a small part of the possible bugs. Still, we might do something extra for it if you want. I think our best bet would be running tests with sanitizers on different architectures manually just to check if we see any failures/warnings. I'd pick these three builds: 32 bit (we can run on our local), armv8 (maybe we can ask someone with new Apple laptop or we run on the cloud) and s390x (we can run on qemu and try to ignore timing related failures. If there are many timing issues, maybe we just check if there is a sanitizer warning and ignore the rest of the failures).

Edit: I'll try these, let's see if we get any warnings.

oranagra · 2022-12-11T10:08:22Z

in the process of making a new 6.2.x release, the CI on Alpine fails due to bad BITFIELD overflow detection.
https://github.com/redis/redis/actions/runs/3667947707/jobs/6200741331
looks like in this case, the fix here is not just to silence a warning, it actually fixes a bug.
trying to decide between cherry picking this whole PR over there, ignoring the error and letting 6.2 remain as is, or just fix this specific issue with BITFIELD.

oranagra · 2022-12-11T10:51:07Z

I took parts of this PR to 6.2.8 see 4418cf1
and mentioned the BITFIELD issue in the release notes.
i'll also update the top comment here with a note about that issue.
let me know if you see something wrong or have more info.

**Signed integer overflow.** Although, signed overflow issue can be problematic time to time and change how compiler generates code, current findings mostly about signed shift or simple addition overflow. For most platforms Redis can be compiled for, this wouldn't cause any issue as far as I can tell (checked generated code on godbolt.org). UB means nothing guaranteed and risky to reason about program behavior but I don't think any of the fixes here worth backporting. As sanitizers are now part of the CI, preventing new issues will be the real benefit. partial cherry pick from commit b91d8b2 The bug in BITFIELD seems to affect 12.2.1 used on Alpine

**Signed integer overflow.** Although, signed overflow issue can be problematic time to time and change how compiler generates code, current findings mostly about signed shift or simple addition overflow. For most platforms Redis can be compiled for, this wouldn't cause any issue as far as I can tell (checked generated code on godbolt.org). UB means nothing guaranteed and risky to reason about program behavior but I don't think any of the fixes here worth backporting. As sanitizers are now part of the CI, preventing new issues will be the real benefit. partial cherry pick from commit b91d8b2 The bug in BITFIELD seems to affect 12.2.1 used on Alpine (cherry picked from commit 4418cf1)