Skip to content

fix broken PEXPIREAT test#7791

Merged
oranagra merged 1 commit intoredis:unstablefrom
oranagra:fix_pexpireat_test
Sep 13, 2020
Merged

fix broken PEXPIREAT test#7791
oranagra merged 1 commit intoredis:unstablefrom
oranagra:fix_pexpireat_test

Conversation

@oranagra
Copy link
Member

This test was nearly always failing on MacOS github actions.
This is because of bugs in the test that caused it to nearly always run
all 3 attempts and just look at the last one as the pass/fail creteria.

i.e. the test was nearly always running all 3 attempts and still sometimes
succeed. this is because the break condition was different than the test
completion condition.

The reason the test succeeded is because the break condition tested the
results of all 3 tests (PSETEX/PEXPIRE/PEXPIREAT), but the success check
at the end was only testing the result of PSETEX.

The reason the PEXPIREAT test nearly always failed is because it was
getting the current time wrong: getting the current second and loosing
the sub-section time, so the only chance for it to succeed is if it run
right when a certain second started.

Because i now get the time from redis, adding another round trip, i
added another 100ms to the PEXPIRE test to make it less fragile, and
also added many more attempts.

Adding many more attempts before failure to account for slow platforms,
github actions and valgrind

This test was nearly always failing on MacOS github actions.
This is because of bugs in the test that caused it to nearly always run
all 3 attempts and just look at the last one as the pass/fail creteria.

i.e. the test was nearly always running all 3 attempts and still sometimes
succeed. this is because the break condition was different than the test
completion condition.

The reason the test succeeded is because the break condition tested the
results of all 3 tests (PSETEX/PEXPIRE/PEXPIREAT), but the success check
at the end was only testing the result of PSETEX.

The reason the PEXPIREAT test nearly always failed is because it was
getting the current time wrong: getting the current second and loosing
the sub-section time, so the only chance for it to succeed is if it run
right when a certain second started.

Because i now get the time from redis, adding another round trip, i
added another 100ms to the PEXPIRE test to make it less fragile, and
also added many more attempts.

Adding many more attempts before failure to account for slow platforms,
github actions and valgrind
@oranagra oranagra requested a review from yossigo September 13, 2020 07:05
@oranagra oranagra merged commit ed9bfe2 into redis:unstable Sep 13, 2020
@oranagra oranagra deleted the fix_pexpireat_test branch September 13, 2020 10:50
oranagra added a commit that referenced this pull request Oct 27, 2020
This test was nearly always failing on MacOS github actions.
This is because of bugs in the test that caused it to nearly always run
all 3 attempts and just look at the last one as the pass/fail creteria.

i.e. the test was nearly always running all 3 attempts and still sometimes
succeed. this is because the break condition was different than the test
completion condition.

The reason the test succeeded is because the break condition tested the
results of all 3 tests (PSETEX/PEXPIRE/PEXPIREAT), but the success check
at the end was only testing the result of PSETEX.

The reason the PEXPIREAT test nearly always failed is because it was
getting the current time wrong: getting the current second and loosing
the sub-section time, so the only chance for it to succeed is if it run
right when a certain second started.

Because i now get the time from redis, adding another round trip, i
added another 100ms to the PEXPIRE test to make it less fragile, and
also added many more attempts.

Adding many more attempts before failure to account for slow platforms,
github actions and valgrind

(cherry picked from commit ed9bfe2)
JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Nov 4, 2020
This test was nearly always failing on MacOS github actions.
This is because of bugs in the test that caused it to nearly always run
all 3 attempts and just look at the last one as the pass/fail creteria.

i.e. the test was nearly always running all 3 attempts and still sometimes
succeed. this is because the break condition was different than the test
completion condition.

The reason the test succeeded is because the break condition tested the
results of all 3 tests (PSETEX/PEXPIRE/PEXPIREAT), but the success check
at the end was only testing the result of PSETEX.

The reason the PEXPIREAT test nearly always failed is because it was
getting the current time wrong: getting the current second and loosing
the sub-section time, so the only chance for it to succeed is if it run
right when a certain second started.

Because i now get the time from redis, adding another round trip, i
added another 100ms to the PEXPIRE test to make it less fragile, and
also added many more attempts.

Adding many more attempts before failure to account for slow platforms,
github actions and valgrind
jschmieg pushed a commit to memKeyDB/memKeyDB that referenced this pull request Nov 6, 2020
This test was nearly always failing on MacOS github actions.
This is because of bugs in the test that caused it to nearly always run
all 3 attempts and just look at the last one as the pass/fail creteria.

i.e. the test was nearly always running all 3 attempts and still sometimes
succeed. this is because the break condition was different than the test
completion condition.

The reason the test succeeded is because the break condition tested the
results of all 3 tests (PSETEX/PEXPIRE/PEXPIREAT), but the success check
at the end was only testing the result of PSETEX.

The reason the PEXPIREAT test nearly always failed is because it was
getting the current time wrong: getting the current second and loosing
the sub-section time, so the only chance for it to succeed is if it run
right when a certain second started.

Because i now get the time from redis, adding another round trip, i
added another 100ms to the PEXPIRE test to make it less fragile, and
also added many more attempts.

Adding many more attempts before failure to account for slow platforms,
github actions and valgrind

(cherry picked from commit ed9bfe2)
@enjoy-binbin
Copy link
Contributor

@oranagra
Copy link
Member Author

@enjoy-binbin what do you mean? these thresholds in the commit you referred to where already applied. do you mean to take them further apart?

please note that before this PR the test was actually broken and passed anyway sometimes.
not it nearly always passes, and in verbose mode it should print the number of iterations it took.
maybe you can gather some statistic how how many iterations it takes on average / max on a few platforms?
maybe we should just add a few more iterations?

@enjoy-binbin
Copy link
Contributor

enjoy-binbin commented Oct 24, 2021

@oranagra I ran a loop and cycled the test more than 200 times in my daily CI (test-macos-latest), here is the result.
The result is a bit strange, most of them are over 30 iterations

  • number of executions: 217
  • min: 0
  • max: 417
  • avg: 57

arrays: [130, 14, 99, 42, 15, 57, 45, 21, 40, 1, 20, 149, 159, 21, 10, 18, 43, 65, 64, 0, 54, 25, 134, 2, 58, 23, 15, 97, 4, 155, 13, 22, 0, 16, 109, 51, 64, 77, 181, 58, 138, 14, 51, 112, 64, 0, 47, 2, 7, 91, 26, 16, 48, 4, 15, 76, 43, 138, 142, 91, 14, 33, 83, 123, 74, 41, 44, 119, 47, 1, 0, 44, 7, 6, 123, 5, 84, 19, 30, 19, 271, 89, 23, 73, 170, 41, 88, 113, 57, 417, 19, 50, 68, 17, 175, 44, 86, 15, 13, 17, 31, 1, 13, 31, 6, 8, 19, 30, 5, 137, 86, 99, 150, 89, 30, 29, 2, 26, 5, 42, 32, 4, 41, 10, 68, 106, 25, 86, 70, 90, 125, 29, 11, 117, 4, 91, 13, 74, 275, 95, 62, 73, 27, 15, 81, 57, 35, 119, 62, 11, 14, 94, 10, 69, 20, 18, 42, 27, 14, 18, 65, 93, 136, 116, 19, 111, 160, 2, 102, 35, 95, 7, 134, 3, 2, 108, 36, 33, 68, 42, 11, 54, 33, 20, 76, 48, 31, 103, 81, 20, 12, 1, 48, 2, 137, 57, 81, 58, 72, 31, 117, 157, 76, 231, 25, 53, 51, 49, 55, 26, 40, 46, 20, 36, 26, 24, 0, 3, 37]

image

@oranagra
Copy link
Member Author

that's odd, i downloaded the last daily results and got this:

oran@Oran-laptop:~/temp/daily_action_logs$ grep -r "sub-second expire test attempts"
1_test-centos7-tls.txt:2021-10-25T00:08:26.7658733Z sub-second expire test attempts: 0
1_test-centos7-tls.txt:2021-10-25T00:30:41.3665369Z sub-second expire test attempts: 0
1_test-alpine-libc-malloc.txt:2021-10-25T00:04:50.9915561Z sub-second expire test attempts: 0
test-ubuntu-libc-malloc/6_test.txt:2021-10-25T00:04:45.4892803Z sub-second expire test attempts: 0
test-ubuntu-tls/6_test.txt:2021-10-25T00:07:50.2645533Z sub-second expire test attempts: 0
test-ubuntu-tls/6_test.txt:2021-10-25T00:30:18.1831327Z sub-second expire test attempts: 0
test-freebsd/4_test.txt:2021-10-25T00:08:26.4350520Z sub-second expire test attempts: 0
1_test-freebsd.txt:2021-10-25T00:08:26.4350560Z sub-second expire test attempts: 0
1_test-ubuntu-no-malloc-usable-size.txt:2021-10-25T00:04:33.2134574Z sub-second expire test attempts: 0
1_test-ubuntu-jemalloc (1).txt:2021-10-25T00:05:41.4787388Z Waiting for background AOF rewrite to finish... sub-second expire test attempts: 0
1_test-ubuntu-32bit.txt:2021-10-25T00:06:09.0814038Z sub-second expire test attempts: 0
test-ubuntu-jemalloc/6_test.txt:2021-10-25T00:05:41.4787348Z Waiting for background AOF rewrite to finish... sub-second expire test attempts: 0
1_test-valgrind.txt:2021-10-25T00:20:03.6155656Z sub-second expire test attempts: 0
test-ubuntu-no-malloc-usable-size/6_test.txt:2021-10-25T00:04:33.2134235Z sub-second expire test attempts: 0
test-valgrind/6_test.txt:2021-10-25T00:20:03.6155606Z sub-second expire test attempts: 0
1_test-centos7-jemalloc.txt:2021-10-25T00:05:29.2406416Z sub-second expire test attempts: 0
test-centos7-jemalloc/7_test.txt:2021-10-25T00:05:29.2406384Z sub-second expire test attempts: 0
test-alpine-jemalloc/7_test.txt:2021-10-25T00:06:14.9224101Z sub-second expire test attempts: 0
test-macos-latest/5_test.txt:2021-10-25T00:09:52.3552550Z Waiting for background AOF rewrite to finish... sub-second expire test attempts: 1
test-valgrind-no-malloc-usable-size/6_test.txt:2021-10-25T00:22:20.9702583Z sub-second expire test attempts: 0
1_test-macos-latest.txt:2021-10-25T00:09:52.3553140Z Waiting for background AOF rewrite to finish... sub-second expire test attempts: 1
1_test-ubuntu-tls.txt:2021-10-25T00:07:50.2645630Z sub-second expire test attempts: 0
1_test-ubuntu-tls.txt:2021-10-25T00:30:18.1831332Z sub-second expire test attempts: 0
1_test-valgrind-no-malloc-usable-size.txt:2021-10-25T00:22:20.9702617Z sub-second expire test attempts: 0
1_test-ubuntu-libc-malloc.txt:2021-10-25T00:04:45.4892855Z sub-second expire test attempts: 0
test-alpine-libc-malloc/7_test.txt:2021-10-25T00:04:50.9915542Z sub-second expire test attempts: 0
1_test-alpine-jemalloc.txt:2021-10-25T00:06:14.9224124Z sub-second expire test attempts: 0
test-centos7-tls/7_test.txt:2021-10-25T00:08:26.7658692Z sub-second expire test attempts: 0
test-centos7-tls/7_test.txt:2021-10-25T00:30:41.3662987Z sub-second expire test attempts: 0
test-ubuntu-32bit/6_test.txt:2021-10-25T00:06:09.0814020Z sub-second expire test attempts: 0

i.e. most on the first attempt and some on the second (not even lose to the threshold of 30).
if what you said is true (that on average it takes 57 attempts), i would expect to see many more failures, and so far we've seen only one.

anyway, what do you suggest?
i think we can leave it as is, or just change the threshold from 30 to 50, but if you see some other improvement opportunity please explain it or open a PR.

@enjoy-binbin
Copy link
Contributor

@oranagra odd.. I did it again in my Daily CI

// 1. This one executed very quickly, max: 15 attempts
diff: enjoy-binbin@972ac61
ci logs: https://github.com/enjoy-binbin/redis/runs/3996247138?check_suite_focus=true#step:5:39

// 2. This one pretty slow, max: 44 attempts
ci logs: https://github.com/enjoy-binbin/redis/runs/3996357109?check_suite_focus=true#step:5:106

// 3. This one slow too, max: more than 50 attempts
ci logs: https://github.com/enjoy-binbin/redis/runs/3996452839?check_suite_focus=true#step:5:40

// 4. This one max: 23 attempts
ci logs: https://github.com/enjoy-binbin/redis/runs/3996530373?check_suite_focus=true#step:5:98

I suppose it may be related to github actions.
like if we run multiple runner jobs at the same time, 2 and 3
1 and 4 are run separately

The cycle results show that it is not particularly stable... so i suppose we can keep the threshold

@enjoy-binbin
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants