layer: fix same rw layer races#39209
Conversation
|
@dmcgowan @tonistiigi PTAL |
|
CI failures are def undelated |
ac46745 to
7cb2411
Compare
Codecov Report
@@ Coverage Diff @@
## master #39209 +/- ##
=========================================
Coverage ? 37.05%
=========================================
Files ? 612
Lines ? 45457
Branches ? 0
=========================================
Hits ? 16846
Misses ? 26327
Partials ? 2284 |
7cb2411 to
fad76ac
Compare
|
@AkihiroSuda PTAL (first commit) |
fad76ac to
9a775d9
Compare
layer/layer_store.go
Outdated
There was a problem hiding this comment.
Should RWLayer be called one too many times(eg. refcount gets confused from error handling or live-restore) compared to the ref count then this seems like a deadlock. The first request will be stuck on driver.Remove second one will come in and block on m.Lock. After driver.Remove finishes it can't continue cause the second one is holding the ls.mountL.
There was a problem hiding this comment.
nit: on contention for the same layer (eg. last refs released together that is quite normal) this will still block the whole mountL. Better than before but not as good as what pkg/locker and similar packages do.
There was a problem hiding this comment.
I wish my built-in race detector would be as good as yours ;) Thanks!
I've replaced the ugly hack of mis-using mountedLayer.Lock() to per-id lock. Now we have
layerStore.mountL sync.Mutexwhich is used solely to protectlayerStore.mount map[]from concurrent access;mountedLayer's embeddedsync.Mutexwhich is used solely to protectmountedLayer.references map[]from concurrent access;layerStore.mountL sync.Mutexwhich I haven't touched;- per-id locker, to avoid name conflicts and concurrent ops on the same layer.
The whole rig seems to look more readable now (mutexes use is straightforward, no nested locks).
@tonistiigi PTALA
9a775d9 to
91793e0
Compare
|
@tonistiigi @dmcgowan PTALA (this one should be way easier to review) |
Not sure why it is failing.
According to #26633, it could mean problems with Docker Hub. Still, test can be improved, see #39243 |
|
Haven't seen this one before |
another flaky test :( #37132 |
|
experimental failure |
|
@kolyshkin @cpuguy83 looks green now |
|
Still need another review from @tonistiigi |
|
...and @dmcgowan |
thaJeztah
left a comment
There was a problem hiding this comment.
I'll follow @dmcgowan and @tonistiigi here; LGTM
(this is a fix to a bug introduced in #39135, and also a followup to #38265)
Consider the following scenario:
When something like this happens,
GetRWLayer will returnan
RWLayerwithout a storage. Oops.There might be more races like this, and it seems the best
solution is to lock by layer id/name by using pkg/locker.
With this in place, name collision could not happen, so remove
the part of previous commit that protected against it in
CreateRWLayer (temporary nil assigmment and associated rollback).
So, now we have
* layerStore.mountL sync.Mutex to protect layerStore.mount map[]
(against concurrent access);
* mountedLayer's embedded
sync.Mutexto protect its references map[];* layerStore.layerL (which I haven't touched);
* per-id locker, to avoid name conflicts and concurrent operations
on the same rw layer.
The whole rig seems to look more readable now (mutexes use is
straightforward, no nested locks).