layer: fix same rw layer races by kolyshkin · Pull Request #39209 · moby/moby

kolyshkin · 2019-05-13T23:24:23Z

(this is a fix to a bug introduced in #39135, and also a followup to #38265)

Consider the following scenario:

----- goroutine 1 -----               ----- goroutine 2 -----
ReleaseRWLayer()
  m := ls.mounts[l.Name()]
  ...
  m.deleteReference(l)
  m.hasReferences()
  ...                                 GetRWLayer()
  ...                                   mount := ls.mounts[id]
  ls.driver.Remove(m.mountID)
  ls.store.RemoveMount(m.name)          return mount.getReference()
  delete(ls.mounts, m.Name())
-----------------------               -----------------------

When something like this happens, GetRWLayer will return
an RWLayer without a storage. Oops.

There might be more races like this, and it seems the best
solution is to lock by layer id/name by using pkg/locker.

With this in place, name collision could not happen, so remove
the part of previous commit that protected against it in
CreateRWLayer (temporary nil assigmment and associated rollback).

So, now we have
* layerStore.mountL sync.Mutex to protect layerStore.mount map[]
(against concurrent access);
* mountedLayer's embedded sync.Mutex to protect its references map[];
* layerStore.layerL (which I haven't touched);
* per-id locker, to avoid name conflicts and concurrent operations
on the same rw layer.

The whole rig seems to look more readable now (mutexes use is
straightforward, no nested locks).

kolyshkin · 2019-05-13T23:24:38Z

@dmcgowan @tonistiigi PTAL

kolyshkin · 2019-05-14T05:34:14Z

CI failures are def undelated

codecov · 2019-05-16T20:55:50Z

Codecov Report

❗ No coverage uploaded for pull request base (master@cae3c91). Click here to learn what that means.
The diff coverage is 86.66%.

@@            Coverage Diff            @@
##             master   #39209   +/-   ##
=========================================
  Coverage          ?   37.05%           
=========================================
  Files             ?      612           
  Lines             ?    45457           
  Branches          ?        0           
=========================================
  Hits              ?    16846           
  Misses            ?    26327           
  Partials          ?     2284

kolyshkin · 2019-05-16T20:58:01Z

@AkihiroSuda PTAL (first commit)

dmcgowan

LGTM

tonistiigi · 2019-05-17T21:42:00Z

layer/layer_store.go

Should RWLayer be called one too many times(eg. refcount gets confused from error handling or live-restore) compared to the ref count then this seems like a deadlock. The first request will be stuck on driver.Remove second one will come in and block on m.Lock. After driver.Remove finishes it can't continue cause the second one is holding the ls.mountL.

nit: on contention for the same layer (eg. last refs released together that is quite normal) this will still block the whole mountL. Better than before but not as good as what pkg/locker and similar packages do.

I wish my built-in race detector would be as good as yours ;) Thanks!

I've replaced the ugly hack of mis-using mountedLayer.Lock() to per-id lock. Now we have

layerStore.mountL sync.Mutex which is used solely to protect layerStore.mount map[] from concurrent access;

mountedLayer's embedded sync.Mutex which is used solely to protect mountedLayer.references map[] from concurrent access;

layerStore.mountL sync.Mutex which I haven't touched;

per-id locker, to avoid name conflicts and concurrent ops on the same layer.

The whole rig seems to look more readable now (mutexes use is straightforward, no nested locks).

@tonistiigi PTALA

kolyshkin · 2019-05-21T00:18:34Z

@tonistiigi @dmcgowan PTALA (this one should be way easier to review)

kolyshkin · 2019-05-21T08:41:34Z

experimental CI fails on TestPortExposeHostBinding, I'm seeing this for the first time.

00:42:23.397 ----------------------------------------------------------------------
00:42:23.397 FAIL: docker_cli_port_test.go:308: DockerSuite.TestPortExposeHostBinding
00:42:23.397 
00:42:23.397 docker_cli_port_test.go:327:
00:42:23.397     // Port is still bound after the Container is removed
00:42:23.397     c.Assert(err, checker.NotNil, check.Commentf("out: %s", out))
00:42:23.397 ... value = nil
00:42:23.397 ... out:

Not sure why it is failing.

janky and powerpc fails on

00:52:10.813 FAIL: docker_cli_search_test.go:47: DockerSuite.TestSearchCmdOptions
00:52:10.813 
00:52:10.815 assertion failed: expression is false: strings.Count(outSearchCmdStars1, "[OK]") <= strings.Count(outSearchCmd, "[OK]"): The quantity of images with stars should be less than that of all images: NAME                        DESCRIPTION                                     STARS     OFFICIAL   AUTOMATED
00:52:10.815 busybox                     Busybox base image.                             1584      [OK]       
00:52:10.815 progrium/busybox                                                            69                   [OK]
....

According to #26633, it could mean problems with Docker Hub. Still, test can be improved, see #39243

cpuguy83 · 2019-05-21T18:24:23Z

Haven't seen this one before

11:08:39 --- FAIL: TestCreateServiceSecretFileMode (8.89s)
11:08:39     create_test.go:224: Creating a new daemon
11:08:39     daemon.go:300: [d73753f11a294] waiting for daemon to start
11:08:39     daemon.go:332: [d73753f11a294] daemon started
11:08:39     create_test.go:265: assertion failed: 2 (int) != 1 (int)
11:08:39     daemon.go:284: [d73753f11a294] exiting daemon

kolyshkin · 2019-05-22T05:38:23Z

TestCreateServiceSecretFileMode

another flaky test :( #37132

kolyshkin · 2019-05-22T18:06:17Z

experimental failure

00:21:52.292 === RUN   TestServiceRemoveKeepsIngressNetwork
00:22:24.155 --- FAIL: TestServiceRemoveKeepsIngressNetwork (31.86s)
00:22:24.155     service_test.go:232: Creating a new daemon
00:22:24.155     daemon.go:300: [d36f5659ef00c] waiting for daemon to start
00:22:24.156     daemon.go:332: [d36f5659ef00c] daemon started
00:22:24.156     service_test.go:265: timeout hit after 30s: waiting for all tasks to be removed: task count at 1
00:22:24.156     daemon.go:284: [d36f5659ef00c] exiting daemon

thaJeztah · 2019-05-22T18:10:41Z

@kolyshkin @cpuguy83 looks green now

kolyshkin · 2019-05-22T22:46:37Z

Still need another review from @tonistiigi

kolyshkin · 2019-05-23T18:59:47Z

...and @dmcgowan

dmcgowan

LGTM

thaJeztah

I'll follow @dmcgowan and @tonistiigi here; LGTM

GordonTheTurtle added the status/0-triage label May 13, 2019

kolyshkin mentioned this pull request May 13, 2019

Lessen mount lock contention in layer store #39135

Merged

thaJeztah added area/storage Image Storage area/storage/aufs area/storage/overlay status/2-code-review and removed status/0-triage labels May 13, 2019

kolyshkin added rebuild/experimental labels May 14, 2019

GordonTheTurtle removed the rebuild/experimental label May 14, 2019

kolyshkin force-pushed the mountedLayer.Lock branch from ac46745 to 7cb2411 Compare May 16, 2019 20:55

kolyshkin force-pushed the mountedLayer.Lock branch from 7cb2411 to fad76ac Compare May 16, 2019 20:56

kolyshkin force-pushed the mountedLayer.Lock branch from fad76ac to 9a775d9 Compare May 16, 2019 21:09

dmcgowan approved these changes May 16, 2019

View reviewed changes

tonistiigi reviewed May 17, 2019

View reviewed changes

thaJeztah added rebuild/powerpc and removed rebuild/windowsRS5-process labels May 20, 2019

GordonTheTurtle removed the rebuild/powerpc label May 20, 2019

kolyshkin force-pushed the mountedLayer.Lock branch from 9a775d9 to 91793e0 Compare May 21, 2019 00:13

kolyshkin changed the title ~~layer: fix GetRWLayer/ReleaseRWLayer race~~ layer: fix same rw layer races May 21, 2019

kolyshkin added rebuild/experimental labels May 21, 2019

GordonTheTurtle removed the rebuild/experimental label May 21, 2019

kolyshkin added rebuild/experimental labels May 22, 2019

GordonTheTurtle removed rebuild/experimental labels May 22, 2019

thaJeztah added the rebuild/experimental label May 22, 2019

GordonTheTurtle removed the rebuild/experimental label May 22, 2019

kolyshkin requested review from dmcgowan and tonistiigi and removed request for tonistiigi May 23, 2019 18:59

dmcgowan approved these changes May 24, 2019

View reviewed changes

tonistiigi approved these changes May 24, 2019

View reviewed changes

thaJeztah approved these changes May 25, 2019

View reviewed changes

thaJeztah merged commit f25e0c6 into moby:master May 25, 2019

thaJeztah added the process/cherry-pick label May 25, 2019

This was referenced May 25, 2019

[19.03 backport ENGCORE-831] backport layer store optimizations docker-archive/engine#247

Merged

[18.09 backport ENGCORE-830] layer store optimizations docker-archive/engine#248

Merged

thaJeztah added process/cherry-picked and removed process/cherry-pick labels Jun 5, 2019

olljanat mentioned this pull request Jun 17, 2019

Windows: TerminateOnLastHandleClosed #39354

Closed

kolyshkin mentioned this pull request Jul 31, 2019

projectquota: protect concurrent map access (ENGCORE-920) #39644

Merged

thaJeztah added this to the 20.03.0 milestone Apr 2, 2020

Conversation

kolyshkin commented May 13, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kolyshkin commented May 13, 2019

Uh oh!

kolyshkin commented May 14, 2019

Uh oh!

codecov bot commented May 16, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kolyshkin commented May 16, 2019

Uh oh!

dmcgowan left a comment

Choose a reason for hiding this comment

Uh oh!

tonistiigi May 17, 2019

Choose a reason for hiding this comment

Uh oh!

tonistiigi May 17, 2019

Choose a reason for hiding this comment

Uh oh!

kolyshkin May 21, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kolyshkin commented May 21, 2019

Uh oh!

kolyshkin commented May 21, 2019

Uh oh!

cpuguy83 commented May 21, 2019

Uh oh!

kolyshkin commented May 22, 2019

Uh oh!

kolyshkin commented May 22, 2019

Uh oh!

thaJeztah commented May 22, 2019

Uh oh!

kolyshkin commented May 22, 2019

Uh oh!

kolyshkin commented May 23, 2019

Uh oh!

dmcgowan left a comment

Choose a reason for hiding this comment

Uh oh!

thaJeztah left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

kolyshkin commented May 13, 2019 •

edited

Loading

codecov bot commented May 16, 2019 •

edited

Loading

kolyshkin May 21, 2019 •

edited

Loading