snapshots/devmapper: fix race windown causing IO hangup#4235
Merged
estesp merged 1 commit intocontainerd:masterfrom May 7, 2020
Merged
snapshots/devmapper: fix race windown causing IO hangup#4235estesp merged 1 commit intocontainerd:masterfrom
estesp merged 1 commit intocontainerd:masterfrom
Conversation
Author
|
Build succeeded.
|
The issue beblow happens several times beforing the root
cause found:
1. A `fdisk -l` process has being hung up for a long time;
2. A image layer snapshot device is visiable to dmsetup, which
should *not* happen because it should be deactivated after
`Commit()`;
The backtrace of `fdisk` is always the same over time:
```bash
[<ffffffff810bbc6a>] io_schedule+0x2a/0x80
[<ffffffff81295a3f>] do_blockdev_direct_IO+0x1e9f/0x2f10
[<ffffffff81296aea>] __blockdev_direct_IO+0x3a/0x40
[<ffffffff81290e43>] blkdev_direct_IO+0x43/0x50
[<ffffffff811b8a14>] generic_file_read_iter+0x374/0x960
[<ffffffff81291ad5>] blkdev_read_iter+0x35/0x40
[<ffffffff8125229b>] new_sync_read+0xfb/0x240
[<ffffffff81252406>] __vfs_read+0x26/0x40
[<ffffffff81252b96>] vfs_read+0x96/0x130
[<ffffffff812540e5>] SyS_read+0x55/0xc0
[<ffffffff81003c04>] do_syscall_64+0x74/0x180
```
The root cause is, in Commit(), there's a race window between
`SuspendDevice()` and `DeactivateDevice()`, which may cause the
IOs of a process or command like `fdisk` on the "suspended" device
hang up forever. It has twofold:
1. The IOs suspends on the devices;
2. The device is in `Suspended` state, because it's deactivated with
`deferred` flag and without `force` flag;
So they cannot make progress.
One reproducer is:
1. enlarge the race window by putting sleep seconds there;
2. run `while true; do sudo fdisk -l; sleep 0.5; done` on one terminal;
3. and pull image on another terminal;
Fixes it by:
1. Resume the devices again after flushing IO by suspend;
2. Remove device without `deferred` flag;
Fix: containerd#4234
Signed-off-by: Eric Ren <[email protected]>
28e7393 to
63b7587
Compare
|
Build succeeded.
|
Codecov Report
@@ Coverage Diff @@
## master #4235 +/- ##
=======================================
Coverage 38.34% 38.34%
=======================================
Files 90 90
Lines 12728 12728
=======================================
Hits 4881 4881
Misses 7181 7181
Partials 666 666
Continue to review full report at Codecov.
|
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The issue beblow happens several times beforing the root
cause found:
fdisk -lprocess has being hung up for a long time;should not happen because it should be deactivated after
Commit();The backtrace of
fdiskis always the same over time:The root cause is, in Commit(), there's a race window between
SuspendDevice()andDeactivateDevice(), which may cause theIOs of a process or command like
fdiskon the "suspended" devicehang up forever. It has twofold:
Suspendedstate, because it's deactivated withdeferredflag and withoutforceflag;So they cannot make progress.
One reproducer is:
while true; do sudo fdisk -l; sleep 0.5; doneon one terminal;Fixes it by:
deferredflag;Fix: #4234
Signed-off-by: Eric Ren [email protected]