Skip to content

snapshots/devmapper: fix race windown causing IO hangup#4235

Merged
estesp merged 1 commit intocontainerd:masterfrom
renzhengeek:renzhen/fix-iohang
May 7, 2020
Merged

snapshots/devmapper: fix race windown causing IO hangup#4235
estesp merged 1 commit intocontainerd:masterfrom
renzhengeek:renzhen/fix-iohang

Conversation

@renzhengeek
Copy link
Copy Markdown

The issue beblow happens several times beforing the root
cause found:

  1. A fdisk -l process has being hung up for a long time;
  2. A image layer snapshot device is visiable to dmsetup, which
    should not happen because it should be deactivated after
    Commit();

The backtrace of fdisk is always the same over time:

[<ffffffff810bbc6a>] io_schedule+0x2a/0x80
[<ffffffff81295a3f>] do_blockdev_direct_IO+0x1e9f/0x2f10
[<ffffffff81296aea>] __blockdev_direct_IO+0x3a/0x40
[<ffffffff81290e43>] blkdev_direct_IO+0x43/0x50
[<ffffffff811b8a14>] generic_file_read_iter+0x374/0x960
[<ffffffff81291ad5>] blkdev_read_iter+0x35/0x40
[<ffffffff8125229b>] new_sync_read+0xfb/0x240
[<ffffffff81252406>] __vfs_read+0x26/0x40
[<ffffffff81252b96>] vfs_read+0x96/0x130
[<ffffffff812540e5>] SyS_read+0x55/0xc0
[<ffffffff81003c04>] do_syscall_64+0x74/0x180

The root cause is, in Commit(), there's a race window between
SuspendDevice() and DeactivateDevice(), which may cause the
IOs of a process or command like fdisk on the "suspended" device
hang up forever. It has twofold:

  1. The IOs suspends on the devices;
  2. The device is in Suspended state, because it's deactivated with
    deferred flag and without force flag;

So they cannot make progress.

One reproducer is:

  1. enlarge the race window by putting sleep seconds there;
  2. run while true; do sudo fdisk -l; sleep 0.5; done on one terminal;
  3. and pull image on another terminal;

Fixes it by:

  1. Resume the devices again after flushing IO by suspend;
  2. Remove device without deferred flag;

Fix: #4234
Signed-off-by: Eric Ren [email protected]

@renzhengeek
Copy link
Copy Markdown
Author

@mxpv @fuweid Hi, please take a look at this fix, thanks~

@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented May 6, 2020

Build succeeded.

Copy link
Copy Markdown
Member

@mxpv mxpv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, one very minor nit.

Comment thread snapshots/devmapper/snapshotter.go Outdated
The issue beblow happens several times beforing the root
cause found:

  1. A `fdisk -l` process has being hung up for a long time;
  2. A image layer snapshot device is visiable to dmsetup, which
       should *not* happen because it should be deactivated after
       `Commit()`;

The backtrace of `fdisk` is always the same over time:

```bash
[<ffffffff810bbc6a>] io_schedule+0x2a/0x80
[<ffffffff81295a3f>] do_blockdev_direct_IO+0x1e9f/0x2f10
[<ffffffff81296aea>] __blockdev_direct_IO+0x3a/0x40
[<ffffffff81290e43>] blkdev_direct_IO+0x43/0x50
[<ffffffff811b8a14>] generic_file_read_iter+0x374/0x960
[<ffffffff81291ad5>] blkdev_read_iter+0x35/0x40
[<ffffffff8125229b>] new_sync_read+0xfb/0x240
[<ffffffff81252406>] __vfs_read+0x26/0x40
[<ffffffff81252b96>] vfs_read+0x96/0x130
[<ffffffff812540e5>] SyS_read+0x55/0xc0
[<ffffffff81003c04>] do_syscall_64+0x74/0x180
```

The root cause is, in Commit(), there's a race window between
`SuspendDevice()` and `DeactivateDevice()`, which may cause the
IOs of a process or command like `fdisk` on the "suspended" device
hang up forever. It has twofold:

  1. The IOs suspends on the devices;
  2. The device is in `Suspended` state, because it's deactivated with
     `deferred` flag and without `force` flag;

So they cannot make progress.

One reproducer is:
 1. enlarge the race window by putting sleep seconds there;
 2. run `while true; do sudo fdisk -l; sleep 0.5; done` on one terminal;
 3. and pull image on another terminal;

Fixes it by:
 1. Resume the devices again after flushing IO by suspend;
 2. Remove device without `deferred` flag;

Fix: containerd#4234
Signed-off-by: Eric Ren <[email protected]>
@renzhengeek renzhengeek force-pushed the renzhen/fix-iohang branch from 28e7393 to 63b7587 Compare May 6, 2020 23:47
@theopenlab-ci
Copy link
Copy Markdown

theopenlab-ci Bot commented May 6, 2020

Build succeeded.

@codecov-io
Copy link
Copy Markdown

Codecov Report

Merging #4235 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #4235   +/-   ##
=======================================
  Coverage   38.34%   38.34%           
=======================================
  Files          90       90           
  Lines       12728    12728           
=======================================
  Hits         4881     4881           
  Misses       7181     7181           
  Partials      666      666           
Flag Coverage Δ
#windows 38.34% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b1f5146...63b7587. Read the comment docs.

Copy link
Copy Markdown
Member

@fuweid fuweid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Member

@estesp estesp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@estesp estesp merged commit 80859e8 into containerd:master May 7, 2020
@mxpv mxpv mentioned this pull request Dec 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[snapshots/devmapper] race window causing process IO hang

5 participants