Skip to content

Add sanitizer support and clean up sanitizer findings#9601

Merged
oranagra merged 14 commits intoredis:unstablefrom
tezc:sanitizer-build
Nov 11, 2021
Merged

Add sanitizer support and clean up sanitizer findings#9601
oranagra merged 14 commits intoredis:unstablefrom
tezc:sanitizer-build

Conversation

@tezc
Copy link
Collaborator

@tezc tezc commented Oct 5, 2021

  • Added sanitizer support. address, undefined and thread sanitizers are available.
  • To build Redis with desired sanitizer : make SANITIZER=undefined
  • There were some sanitizer findings, cleaned up codebase
  • Added tests with address and undefined behavior sanitizers to daily CI.
  • Added tests with address sanitizer to the per-PR CI (smoke out mem leaks sooner).

Basically, there are three types of issues :

1- Unaligned load/store : Most probably, this issue may cause a crash on a platform that does not support unaligned access. Redis does unaligned access only on supported platforms.

2- Signed integer overflow. Although, signed overflow issue can be problematic time to time and change how compiler generates code, current findings mostly about signed shift or simple addition overflow. For most platforms Redis can be compiled for, this wouldn't cause any issue as far as I can tell (checked generated code on godbolt.org).

3 -Minor leak (redis-cli), use-after-free(just before calling exit());

UB means nothing guaranteed and risky to reason about program behavior but I don't think any of the fixes here worth backporting. As sanitizers are now part of the CI, preventing new issues will be the real benefit.

[EDIT] It seems that gcc 12.2.1 present on Alpine produces a bug in BITFIELD overflow detection.
looks like in this case, the fix here is not just to silence a warning, it actually fixes a bug.

@oranagra oranagra added the state:to-be-merged The PR should be merged soon, even if not yet ready, this is used so that it won't be forgotten label Oct 5, 2021
Copy link
Contributor

@bjosv bjosv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@oranagra
Copy link
Member

oranagra commented Oct 6, 2021

https://godbolt.org/z/o7fej75xM

WOW, i'm always impressed by what compilers do (in this case both collapsing lots of reads and shifts into one, and also deleting a libc function call and replacing it with assembly).

Also, that's a great demonstration of how much the UB is in many cases BS, in all 3 cases the compiler still generates code that's still doing unaligned access. 😄

of course, had this code is used on an architecture that can't handle unaligned access, the compiler generate different results: https://godbolt.org/z/r58nTMMhP

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tezc thank you for that PR.
This is very useful.
please don't take my comments personally, i guess i'm emotional about the subject.

the 3 main topics we need to discuss to continue are:

  1. if there were actual bugs this uncovered.
  2. if the changes could impact performance (even on non-modern, realistic system)
  3. just discuss compilers and interesting edge cases for the fun of it.

@tezc
Copy link
Collaborator Author

tezc commented Oct 7, 2021

@oranagra Just here my understanding overall

I'm not a compiler dev by any means but I think undefined behavior something like "almost" impossible to reason about. Anything can happen. At some point, questions like if integer shift really causes an issue or if overflow would be fine in a specific case loses its meaning because it may not cause a problem with the current compiler version but next version. (or on different architecture). It may work for funcA() calls generally but for one of funcA() calls, compiler decides to inline the function and then decides to apply some optimization which takes advantage of undefined behavior. So, not easy to answer if any of these issues are actual problems :)

More reasons to hate UBs: A jaw dropping example

Good news, I believe almost all of the reasonable size C projects (if not all) have undefined behaviors (what a good news :) ). Redis has more than these sanitizer findings probably (A lot of things are UB if you follow the standard, makes you depressed each time you learn another thing is also UB). So, no need to be a paranoid and fix each issue. Question is whether we want help from sanitizers or not. If yes, we can follow a pragmatic approach, we can fix some of the issues, suppress ones we don't like.

@oranagra
Copy link
Member

oranagra commented Oct 7, 2021

More reasons to hate UBs: A jaw dropping example

WOW, i'm not a compiler dev either, but IMHO that optimization shouldn't exist. (instead if C++ devs want speed, they should switch to C!).

anyway, i do agree we wanna avoid using undefined behavior, and that we want to use the sanitizer to help catch them.
i just think that if we can make sure to define the behavior, or only use it when we know it's defined (and fall back to less efficient implementation otherwise), that's better in some of these cases (the ones which uglify the code or make it less efficient).

but i don't want to rely on specific flags that exist only in some compilers, and now i'm also paranoid about the fact that a compiler will decide to erase all the files on my disk [see below].
so at this point i'm not sure what to do, it seems my two desires collide.

[below]
i.e. because if we do something that's undefined, then erasing all the files on the disk is a possible outcome. (if on some theoretical platform that would be the behavior, why not apply it for all).
i know that's not the reason in the above link, but who knows what other interpretations of undefined behaviors exist...

@tezc
Copy link
Collaborator Author

tezc commented Oct 20, 2021

@oranagra

I tried to minimize the changes, please take another look at the PR. Also, realized that latest clang gives more warnings, so I fixed them as well and added clang build to the daily. Let me know if you're okay with that. (We can use GCC and/or Clang with UBSAN and ASAN).

I suppressed unaligned load issues. To be honest, I don't think these would cause any performance issues in any environment but it's impossible to benchmark it anyway. So, suppressed them in a few place.

Copy link
Member

@oranagra oranagra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've reviewed the new (smaller) diff from unstable, and also the changes in the last commit, they're mostly LGTM.

One last concern is if we conclude any of the fixes here really fixes a bug, we may want to backport it to older releases,
there was a small insignificant leak in the CLI that i don't think we care for.
and some issue in sentinel for which i'm waiting for an analysis.
other than that i think all the changes will not have any actual impact on redis.
i would be happy if you can go over them and add notes to the top comment dividing them to groups according to type and impact.

@oranagra
Copy link
Member

oranagra commented Nov 9, 2021

@tezc i see all comments are addressed.. so are we good to merge this?

triggered a full CI to see there are no complaints by some esoteric platform: https://github.com/redis/redis/actions/runs/1439775340

@tezc
Copy link
Collaborator Author

tezc commented Nov 9, 2021

@oranagra I was waiting Daily to be green to rebase and run full CI for this branch. You triggered that action for "unstable" right? Anyway, I also triggered CI for these PR + rebase here : https://github.com/tezc/redis/actions/runs/1439837027

Edit: Ok I see, you used github action from unstable. Sorry, I was confused for a moment. It's better to use github action from this branch so it runs sanitizer builds as well. CI build I triggered does that. Let's wait and see if any new issue is introduced after rebase. I'll ping you when it ends.

@oranagra
Copy link
Member

oranagra commented Nov 9, 2021

ohh, right. the action i triggered is silly:

  1. it uses the old yaml files.
  2. the branch wasn't rebased lately and unstable was a mess so it may contain bugs that are now fixed.

@tezc
Copy link
Collaborator Author

tezc commented Nov 11, 2021

@oranagra Rebased the Pr and added two new commits, please take a look

@oranagra oranagra merged commit b91d8b2 into redis:unstable Nov 11, 2021
@tezc tezc deleted the sanitizer-build branch November 11, 2021 11:53
enjoy-binbin added a commit to enjoy-binbin/redis that referenced this pull request Nov 24, 2021
In redis#8287, some overflow checks have been added. But when
`when *= 1000` overflows, it will become a positive number.
And the check not able to catch it. The key will be added with
a short expiration time and will deleted a few seconds later.

In redis#9601, will check the overflow after `*=` and return an
error first, and avoiding this situation.

In this commit, added some tests to cover those code paths.
Found it in redis#9825, and close it.
oranagra pushed a commit that referenced this pull request Nov 24, 2021
In #8287, some overflow checks have been added. But when
`when *= 1000` overflows, it will become a positive number.
And the check not able to catch it. The key will be added with
a short expiration time and will deleted a few seconds later.

In #9601, will check the overflow after `*=` and return an
error first, and avoiding this situation.

In this commit, added some tests to cover those code paths.
Found it in #9825, and close it.
hwware pushed a commit to hwware/redis that referenced this pull request Dec 20, 2021
In redis#8287, some overflow checks have been added. But when
`when *= 1000` overflows, it will become a positive number.
And the check not able to catch it. The key will be added with
a short expiration time and will deleted a few seconds later.

In redis#9601, will check the overflow after `*=` and return an
error first, and avoiding this situation.

In this commit, added some tests to cover those code paths.
Found it in redis#9825, and close it.
oranagra pushed a commit that referenced this pull request Apr 27, 2022
In #8287, some overflow checks have been added. But when
`when *= 1000` overflows, it will become a positive number.
And the check not able to catch it. The key will be added with
a short expiration time and will deleted a few seconds later.

In #9601, will check the overflow after `*=` and return an
error first, and avoiding this situation.

In this commit, added some tests to cover those code paths.
Found it in #9825, and close it.

(cherry picked from commit 9273d09)
oranagra pushed a commit that referenced this pull request Apr 27, 2022
Partial cherry pick from #9601 in order for the tests in #9601 to pass

(cherry picked from commit b91d8b2)
@oranagra
Copy link
Member

oranagra commented Aug 2, 2022

Gents, I don't remember if we discussed this, but how do we feel about changing the Makefile to default to -O3 after this set of changes?

@tezc
Copy link
Collaborator Author

tezc commented Aug 3, 2022

@oranagra Sure, we can do it. In my experience, PGO is a bit more risky for UB. So, I think we can just switch to -O3 now.

Also, what about -flto? Usually, it’s more important. -O3 might cause some more inlining but I guess -flto will inline functions a lot. (between compilation units). So, one downside is we might see less accurate stacktraces on a crash. Are we okay with that? Also, -flto will make build a bit slower.

Another concern, these flags might make some scenarios slower. We need to run some benchmarks for it, hopefully we won’t see regressions.

@oranagra
Copy link
Member

oranagra commented Aug 3, 2022

@filipecosta90 already has some benchmarks indicating 5% improvement (with both -O3 and -flto are used). but i wonder they cover a wide range of cases, and if there are any regressions too.
i'm still paranoid about bugs, but considering we're far away from the next release, i guess we can try it.
I also wonder if there were previous discussions on that subject and why we stayed with O2.
@yossigo are you aware of anything from the past?

@yossigo
Copy link
Collaborator

yossigo commented Aug 3, 2022

@oranagra I'm not aware of anything, this is probably a good timing to try this.

@tezc
Copy link
Collaborator Author

tezc commented Aug 8, 2022

@oranagra I was off last week, catching up.

i'm still paranoid about bugs, but considering we're far away from the next release, i guess we can try it.

We won't know without trying and I guess sanitizer will help us with a small part of the possible bugs. Still, we might do something extra for it if you want. I think our best bet would be running tests with sanitizers on different architectures manually just to check if we see any failures/warnings. I'd pick these three builds: 32 bit (we can run on our local), armv8 (maybe we can ask someone with new Apple laptop or we run on the cloud) and s390x (we can run on qemu and try to ignore timing related failures. If there are many timing issues, maybe we just check if there is a sanitizer warning and ignore the rest of the failures).

Edit: I'll try these, let's see if we get any warnings.

@oranagra
Copy link
Member

in the process of making a new 6.2.x release, the CI on Alpine fails due to bad BITFIELD overflow detection.
https://github.com/redis/redis/actions/runs/3667947707/jobs/6200741331
looks like in this case, the fix here is not just to silence a warning, it actually fixes a bug.
trying to decide between cherry picking this whole PR over there, ignoring the error and letting 6.2 remain as is, or just fix this specific issue with BITFIELD.

@oranagra
Copy link
Member

oranagra commented Dec 11, 2022

I took parts of this PR to 6.2.8 see 4418cf1
and mentioned the BITFIELD issue in the release notes.
i'll also update the top comment here with a note about that issue.
let me know if you see something wrong or have more info.

@oranagra oranagra added the release-notes indication that this issue needs to be mentioned in the release notes label Dec 11, 2022
@oranagra oranagra mentioned this pull request Dec 11, 2022
oranagra pushed a commit that referenced this pull request Dec 12, 2022
**Signed integer overflow.** Although, signed overflow issue can be problematic time to time
and change how compiler generates code, current findings mostly about signed shift or simple
addition overflow. For most platforms Redis can be compiled for, this wouldn't cause any issue
as far as I can tell (checked generated code on godbolt.org).

UB means nothing guaranteed and risky to reason about program behavior but I don't think any
of the fixes here worth backporting. As sanitizers are now part of the CI, preventing new issues
will be the real benefit.

partial cherry pick from commit b91d8b2
The bug in BITFIELD seems to affect 12.2.1 used on Alpine
@oranagra oranagra mentioned this pull request Jan 16, 2023
oranagra pushed a commit that referenced this pull request Jan 17, 2023
**Signed integer overflow.** Although, signed overflow issue can be problematic time to time
and change how compiler generates code, current findings mostly about signed shift or simple
addition overflow. For most platforms Redis can be compiled for, this wouldn't cause any issue
as far as I can tell (checked generated code on godbolt.org).

UB means nothing guaranteed and risky to reason about program behavior but I don't think any
of the fixes here worth backporting. As sanitizers are now part of the CI, preventing new issues
will be the real benefit.

partial cherry pick from commit b91d8b2
The bug in BITFIELD seems to affect 12.2.1 used on Alpine

(cherry picked from commit 4418cf1)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes indication that this issue needs to be mentioned in the release notes state:to-be-merged The PR should be merged soon, even if not yet ready, this is used so that it won't be forgotten

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants