Skip to content

Fix crash in script timeout during AOF loading#7870

Merged
oranagra merged 1 commit intoredis:unstablefrom
oranagra:aof_script_timeout
Oct 1, 2020
Merged

Fix crash in script timeout during AOF loading#7870
oranagra merged 1 commit intoredis:unstablefrom
oranagra:aof_script_timeout

Conversation

@oranagra
Copy link
Member

This is a regression in 6.0 (connection abstraction)
it seems it can be triggered only when setting script command
replication to no (or loading old AOF files)

@oranagra oranagra requested a review from yossigo September 30, 2020 14:46
@oranagra
Copy link
Member Author

oranagra commented Sep 30, 2020

Closes #7869

@oranagra oranagra linked an issue Sep 30, 2020 that may be closed by this pull request
@yossigo
Copy link
Collaborator

yossigo commented Sep 30, 2020

@oranagra Perhaps it's time to step up the deprecation process of script replication and produce a warning message if it's enabled.

yossigo
yossigo previously approved these changes Sep 30, 2020
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit concerned about the flakiness of this, did you consider other options?
(I don't have a better suggestion that doesn't involve creating new synthetic delay commands that are script-accessible)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i can't think of any other option.
scripts are by definition immune from being affected by external factors.
DEBUG can't be used in scripts.
and SCRIPT KILL can't be used since this one has to have a write command in it.
the only thing i can think of is find some command that has a deterministic timing.
but obviously blocked commands with timeout are not allowed either.

maybe we can settle on the fact i was able to reproduce it when i fixed it, and we can drop the test now.
or maybe keep the test, but remove the assertion that makes sure it was able to trigger the BUSY and LOADING state. (so that it won't be flaky)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess i was missing the rd flush and after 100 to make it work.
added some timing to see what's going on, and added more time so that on my machine it took 900ms for each eval.
on actions it took:

script took 20125 milliseconds
loading took 19310 milliseconds

i guess we can cut it by half?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the change it takes some 400ms on my machine, and 10s on github actions

script took 10532 milliseconds
loading took 11399 milliseconds

@oranagra
Copy link
Member Author

@yossigo perhaps.. i'm not sure if i wanna blame that bug on the connection abstraction and script timeout, or on the lua-replicate-commands no feature.
i'm also not sure if we wanna backport that fix urgently to 6.0.x or not.

@yossigo
Copy link
Collaborator

yossigo commented Sep 30, 2020

@oranagra probably both but how can it reproduce without lua-replicate-commands no?

@oranagra
Copy link
Member Author

maybe loading an AOF file from v3.2?
let's wait to hear from the reporter before we declare panic.

This is a regression in 6.0 (connection abstraction)
it seems it can be triggered only when setting script command
replication to no (or loading old AOF files)
@oranagra oranagra merged commit dc803d2 into redis:unstable Oct 1, 2020
@oranagra oranagra deleted the aof_script_timeout branch October 1, 2020 08:27
@oranagra
Copy link
Member Author

oranagra commented Oct 1, 2020

@redis/core-team please let me know if you think this fix should be released as 6.0.9 ASAP.
considering it requires

  1. long running script
  2. either AOF from old version or manually disabling lua-replicate-commands
    and considering there's a work around.
    i feel it's not urgent.

oranagra added a commit that referenced this pull request Oct 27, 2020
JackieXie168 pushed a commit to JackieXie168/redis that referenced this pull request Nov 4, 2020
jschmieg pushed a commit to memKeyDB/memKeyDB that referenced this pull request Nov 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

redis 6.0.8 crashing when reloading from aof file

2 participants