Skip to content

layer: fix same rw layer races#39209

Merged
thaJeztah merged 2 commits intomoby:masterfrom
kolyshkin:mountedLayer.Lock
May 25, 2019
Merged

layer: fix same rw layer races#39209
thaJeztah merged 2 commits intomoby:masterfrom
kolyshkin:mountedLayer.Lock

Conversation

@kolyshkin
Copy link
Contributor

@kolyshkin kolyshkin commented May 13, 2019

(this is a fix to a bug introduced in #39135, and also a followup to #38265)

Consider the following scenario:

----- goroutine 1 -----               ----- goroutine 2 -----
ReleaseRWLayer()
  m := ls.mounts[l.Name()]
  ...
  m.deleteReference(l)
  m.hasReferences()
  ...                                 GetRWLayer()
  ...                                   mount := ls.mounts[id]
  ls.driver.Remove(m.mountID)
  ls.store.RemoveMount(m.name)          return mount.getReference()
  delete(ls.mounts, m.Name())
-----------------------               -----------------------

When something like this happens, GetRWLayer will return
an RWLayer without a storage. Oops.

There might be more races like this, and it seems the best
solution is to lock by layer id/name by using pkg/locker.

With this in place, name collision could not happen, so remove
the part of previous commit that protected against it in
CreateRWLayer (temporary nil assigmment and associated rollback).

So, now we have
* layerStore.mountL sync.Mutex to protect layerStore.mount map[]
(against concurrent access);
* mountedLayer's embedded sync.Mutex to protect its references map[];
* layerStore.layerL (which I haven't touched);
* per-id locker, to avoid name conflicts and concurrent operations
on the same rw layer.

The whole rig seems to look more readable now (mutexes use is
straightforward, no nested locks).

@kolyshkin
Copy link
Contributor Author

@dmcgowan @tonistiigi PTAL

@kolyshkin
Copy link
Contributor Author

CI failures are def undelated

@codecov
Copy link

codecov bot commented May 16, 2019

Codecov Report

❗ No coverage uploaded for pull request base (master@cae3c91). Click here to learn what that means.
The diff coverage is 86.66%.

@@            Coverage Diff            @@
##             master   #39209   +/-   ##
=========================================
  Coverage          ?   37.05%           
=========================================
  Files             ?      612           
  Lines             ?    45457           
  Branches          ?        0           
=========================================
  Hits              ?    16846           
  Misses            ?    26327           
  Partials          ?     2284

@kolyshkin kolyshkin force-pushed the mountedLayer.Lock branch from 7cb2411 to fad76ac Compare May 16, 2019 20:56
@kolyshkin
Copy link
Contributor Author

@AkihiroSuda PTAL (first commit)

@kolyshkin kolyshkin force-pushed the mountedLayer.Lock branch from fad76ac to 9a775d9 Compare May 16, 2019 21:09
Copy link
Member

@dmcgowan dmcgowan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should RWLayer be called one too many times(eg. refcount gets confused from error handling or live-restore) compared to the ref count then this seems like a deadlock. The first request will be stuck on driver.Remove second one will come in and block on m.Lock. After driver.Remove finishes it can't continue cause the second one is holding the ls.mountL.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: on contention for the same layer (eg. last refs released together that is quite normal) this will still block the whole mountL. Better than before but not as good as what pkg/locker and similar packages do.

Copy link
Contributor Author

@kolyshkin kolyshkin May 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wish my built-in race detector would be as good as yours ;) Thanks!

I've replaced the ugly hack of mis-using mountedLayer.Lock() to per-id lock. Now we have

  • layerStore.mountL sync.Mutex which is used solely to protect layerStore.mount map[] from concurrent access;
  • mountedLayer's embedded sync.Mutex which is used solely to protect mountedLayer.references map[] from concurrent access;
  • layerStore.mountL sync.Mutex which I haven't touched;
  • per-id locker, to avoid name conflicts and concurrent ops on the same layer.

The whole rig seems to look more readable now (mutexes use is straightforward, no nested locks).

@tonistiigi PTALA

@kolyshkin
Copy link
Contributor Author

@tonistiigi @dmcgowan PTALA (this one should be way easier to review)

@kolyshkin kolyshkin changed the title layer: fix GetRWLayer/ReleaseRWLayer race layer: fix same rw layer races May 21, 2019
@kolyshkin
Copy link
Contributor Author

  1. experimental CI fails on TestPortExposeHostBinding, I'm seeing this for the first time.
00:42:23.397 ----------------------------------------------------------------------
00:42:23.397 FAIL: docker_cli_port_test.go:308: DockerSuite.TestPortExposeHostBinding
00:42:23.397 
00:42:23.397 docker_cli_port_test.go:327:
00:42:23.397     // Port is still bound after the Container is removed
00:42:23.397     c.Assert(err, checker.NotNil, check.Commentf("out: %s", out))
00:42:23.397 ... value = nil
00:42:23.397 ... out: 

Not sure why it is failing.

  1. janky and powerpc fails on
00:52:10.813 FAIL: docker_cli_search_test.go:47: DockerSuite.TestSearchCmdOptions
00:52:10.813 
00:52:10.815 assertion failed: expression is false: strings.Count(outSearchCmdStars1, "[OK]") <= strings.Count(outSearchCmd, "[OK]"): The quantity of images with stars should be less than that of all images: NAME                        DESCRIPTION                                     STARS     OFFICIAL   AUTOMATED
00:52:10.815 busybox                     Busybox base image.                             1584      [OK]       
00:52:10.815 progrium/busybox                                                            69                   [OK]
....

According to #26633, it could mean problems with Docker Hub. Still, test can be improved, see #39243

@cpuguy83
Copy link
Member

Haven't seen this one before

11:08:39 --- FAIL: TestCreateServiceSecretFileMode (8.89s)
11:08:39     create_test.go:224: Creating a new daemon
11:08:39     daemon.go:300: [d73753f11a294] waiting for daemon to start
11:08:39     daemon.go:332: [d73753f11a294] daemon started
11:08:39     create_test.go:265: assertion failed: 2 (int) != 1 (int)
11:08:39     daemon.go:284: [d73753f11a294] exiting daemon

@kolyshkin
Copy link
Contributor Author

TestCreateServiceSecretFileMode

another flaky test :( #37132

@kolyshkin
Copy link
Contributor Author

experimental failure

00:21:52.292 === RUN   TestServiceRemoveKeepsIngressNetwork
00:22:24.155 --- FAIL: TestServiceRemoveKeepsIngressNetwork (31.86s)
00:22:24.155     service_test.go:232: Creating a new daemon
00:22:24.155     daemon.go:300: [d36f5659ef00c] waiting for daemon to start
00:22:24.156     daemon.go:332: [d36f5659ef00c] daemon started
00:22:24.156     service_test.go:265: timeout hit after 30s: waiting for all tasks to be removed: task count at 1
00:22:24.156     daemon.go:284: [d36f5659ef00c] exiting daemon

@thaJeztah
Copy link
Member

@kolyshkin @cpuguy83 looks green now

@kolyshkin
Copy link
Contributor Author

Still need another review from @tonistiigi

@kolyshkin
Copy link
Contributor Author

...and @dmcgowan

@kolyshkin kolyshkin requested review from dmcgowan and tonistiigi and removed request for tonistiigi May 23, 2019 18:59
Copy link
Member

@dmcgowan dmcgowan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll follow @dmcgowan and @tonistiigi here; LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants