libnet: don't lock `*Endpoint` in the void by akerouanton · Pull Request #47216 · moby/moby

akerouanton · 2024-01-25T11:52:25Z

- What I did

(*Endpoint).sbJoin(), (*Endpoint).sbLeave() start by calling (*Network).getEndpointFromStore() before doing anything else. In turn this method calls (*Store).GetObject() with a freshly created *Endpoint instance, meaning the *Endpoint instance returned by getEndpointFromStore is a totally different reference than the one already living in memory. Thus, any mutex locks on these new instances actually lock nothing at all.

Another property of GetObject() is to fully hydrate a given object based on the "canonical" reference that lives in its cache.

Since both sbJoin() and sbLeave() are always called on an *Endpoint receiver which were either just created, or just retrieved from the datastore, there's no chance that the receiver was partially hydrated.

- How I did it

By conducting a static analysis of sbJoin(), sbLeave() and their callsites.

- How to verify it

CI

thaJeztah · 2024-01-25T15:18:01Z

@akerouanton I think the failure may be related (not a flaky);

=== Failed
=== FAIL: libnetwork TestSandboxAddMultiPrio (0.29s)
time="2024-01-25T14:34:34Z" level=error msg="Could not add route to IPv6 network fe90::1/64 via device test_nw_0: network is down"
    sandbox_unix_test.go:179: another container is attached to the same network endpoint

=== FAIL: libnetwork TestResolvConf (0.10s)
    libnetwork_linux_test.go:2049: another container is attached to the same network endpoint
    libnetwork_linux_test.go:1977: network testnetwork id cdc1acb28253c687c212affa09a2183ef96a76907577fc23701f6ee2e63cba13 has active endpoints

akerouanton · 2024-01-25T17:49:56Z

Yep, they're. I'll convert this one as a draft for now.

`(*Endpoint).sbJoin()` `(*Endpoint).sbLeave()` start by calling `(*Network).getEndpointFromStore()` before doing anything else. In turn this method calls `(*Store).GetObject()` with a freshly created `*Endpoint` instance, meaning the `*Endpoint` instance returned by `getEndpointFromStore` is a totally different reference than the one already living in memory. Thus, any mutex locks on these new instances actually lock *nothing* at all. Another property of `GetObject()` is to fully hydrate a given object based on the "canonical" reference that lives in its cache. Since both `sbJoin()` and `sbLeave()` are always called on an `*Endpoint` receiver which were either just created, or just retrieved from the datastore, there's no chance that the receiver was partially hydrated. Signed-off-by: Albin Kerouanton <[email protected]>

akerouanton · 2024-01-26T10:21:16Z

@thaJeztah Turns out, the same fix had to be applied to sbLeave() too. This one's ready now.

thaJeztah · 2024-01-26T12:53:03Z

Since both sbJoin() and sbLeave() are always called on an *Endpoint receiver which were either just created, or just retrieved from the datastore, there's no chance that the receiver was partially hydrated.

Silly question; are there other code-paths that could have been executed concurrently that could have modified the endpoint (so the hydrated copy we have being a stale / outdated copy)? (as there's no lock before we retrieved the hydrated copy, so curious if there's any synchronisation to prevent that from happening).

corhere · 2024-01-26T17:34:56Z

Since both sbJoin() and sbLeave() are always called on an *Endpoint receiver which were either just created, or just retrieved from the datastore, there's no chance that the receiver was partially hydrated.

Consider the call chain (*Endpoint).sbJoin <- (*Endpoint).Join <- (*Sandbox).Refresh <- (*Daemon).updateNetwork. The sandbox is fetched from the controller's c.sandboxes map, the endpoints are fetched from the sandbox's sb.endpoints map, and ep.Join(sb) is called on all of those endpoints which were neither just created nor just retrieved from the datastore.

The endpoint receiver being partially hydrated is not the most important concern; if it is out of date w.r.t. the KVObject in the datastore, the n.getController().updateToStore(ep) call in sbJoin will fail and there is no rollback-retry loop. So long as func (*Network) getEndpointFromStore exists in its current form, I cannot trust that Endpoint objects in memory are unique.

akerouanton added status/2-code-review area/networking Networking kind/refactor PR's that refactor, or clean-up code labels Jan 25, 2024

akerouanton added this to the 26.0.0 milestone Jan 25, 2024

akerouanton requested review from corhere, robmry and thaJeztah January 25, 2024 11:52

akerouanton self-assigned this Jan 25, 2024

akerouanton mentioned this pull request Jan 25, 2024

libnet: remove the EndpointInfo interface #47217

Draft

akerouanton marked this pull request as draft January 25, 2024 17:50

akerouanton force-pushed the libnet-sbJoin-lock-for-real branch from 66455fb to ada1cb5 Compare January 26, 2024 09:18

akerouanton force-pushed the libnet-sbJoin-lock-for-real branch from ada1cb5 to 69513a3 Compare January 26, 2024 09:27

akerouanton marked this pull request as ready for review January 26, 2024 10:20

akerouanton modified the milestones: 26.0.0, v-future Feb 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

libnet: don't lock `*Endpoint` in the void#47216

libnet: don't lock `*Endpoint` in the void#47216
akerouanton wants to merge 1 commit intomoby:masterfrom
akerouanton:libnet-sbJoin-lock-for-real

akerouanton commented Jan 25, 2024 •

edited

Loading

Uh oh!

thaJeztah commented Jan 25, 2024

Uh oh!

akerouanton commented Jan 25, 2024

Uh oh!

akerouanton commented Jan 26, 2024

Uh oh!

thaJeztah commented Jan 26, 2024

Uh oh!

corhere commented Jan 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

akerouanton commented Jan 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

thaJeztah commented Jan 25, 2024

Uh oh!

akerouanton commented Jan 25, 2024

Uh oh!

akerouanton commented Jan 26, 2024

Uh oh!

thaJeztah commented Jan 26, 2024

Uh oh!

corhere commented Jan 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

akerouanton commented Jan 25, 2024 •

edited

Loading