Skip to content

libnet: don't lock *Endpoint in the void#47216

Open
akerouanton wants to merge 1 commit intomoby:masterfrom
akerouanton:libnet-sbJoin-lock-for-real
Open

libnet: don't lock *Endpoint in the void#47216
akerouanton wants to merge 1 commit intomoby:masterfrom
akerouanton:libnet-sbJoin-lock-for-real

Conversation

@akerouanton
Copy link
Copy Markdown
Member

@akerouanton akerouanton commented Jan 25, 2024

- What I did

(*Endpoint).sbJoin(), (*Endpoint).sbLeave() start by calling (*Network).getEndpointFromStore() before doing anything else. In turn this method calls (*Store).GetObject() with a freshly created *Endpoint instance, meaning the *Endpoint instance returned by getEndpointFromStore is a totally different reference than the one already living in memory. Thus, any mutex locks on these new instances actually lock nothing at all.

Another property of GetObject() is to fully hydrate a given object based on the "canonical" reference that lives in its cache.

Since both sbJoin() and sbLeave() are always called on an *Endpoint receiver which were either just created, or just retrieved from the datastore, there's no chance that the receiver was partially hydrated.

- How I did it

By conducting a static analysis of sbJoin(), sbLeave() and their callsites.

- How to verify it

CI

@akerouanton akerouanton added status/2-code-review area/networking Networking kind/refactor PR's that refactor, or clean-up code labels Jan 25, 2024
@akerouanton akerouanton added this to the 26.0.0 milestone Jan 25, 2024
@akerouanton akerouanton self-assigned this Jan 25, 2024
@thaJeztah
Copy link
Copy Markdown
Member

@akerouanton I think the failure may be related (not a flaky);

=== Failed
=== FAIL: libnetwork TestSandboxAddMultiPrio (0.29s)
time="2024-01-25T14:34:34Z" level=error msg="Could not add route to IPv6 network fe90::1/64 via device test_nw_0: network is down"
    sandbox_unix_test.go:179: another container is attached to the same network endpoint

=== FAIL: libnetwork TestResolvConf (0.10s)
    libnetwork_linux_test.go:2049: another container is attached to the same network endpoint
    libnetwork_linux_test.go:1977: network testnetwork id cdc1acb28253c687c212affa09a2183ef96a76907577fc23701f6ee2e63cba13 has active endpoints

@akerouanton
Copy link
Copy Markdown
Member Author

Yep, they're. I'll convert this one as a draft for now.

@akerouanton akerouanton marked this pull request as draft January 25, 2024 17:50
@akerouanton akerouanton force-pushed the libnet-sbJoin-lock-for-real branch from 66455fb to ada1cb5 Compare January 26, 2024 09:18
`(*Endpoint).sbJoin()` `(*Endpoint).sbLeave()` start by calling
`(*Network).getEndpointFromStore()` before doing anything else. In
turn this method calls `(*Store).GetObject()` with a freshly created
`*Endpoint` instance, meaning the `*Endpoint` instance returned by
`getEndpointFromStore` is a totally different reference than the one
already living in memory. Thus, any mutex locks on these new instances
actually lock *nothing* at all.

Another property of `GetObject()` is to fully hydrate a given object
based on the "canonical" reference that lives in its cache.

Since both `sbJoin()` and `sbLeave()` are always called on an
`*Endpoint` receiver which were either just created, or just retrieved
from the datastore, there's no chance that the receiver was partially
hydrated.

Signed-off-by: Albin Kerouanton <[email protected]>
@akerouanton akerouanton force-pushed the libnet-sbJoin-lock-for-real branch from ada1cb5 to 69513a3 Compare January 26, 2024 09:27
@akerouanton akerouanton marked this pull request as ready for review January 26, 2024 10:20
@akerouanton
Copy link
Copy Markdown
Member Author

@thaJeztah Turns out, the same fix had to be applied to sbLeave() too. This one's ready now.

@thaJeztah
Copy link
Copy Markdown
Member

Since both sbJoin() and sbLeave() are always called on an *Endpoint receiver which were either just created, or just retrieved from the datastore, there's no chance that the receiver was partially hydrated.

Silly question; are there other code-paths that could have been executed concurrently that could have modified the endpoint (so the hydrated copy we have being a stale / outdated copy)? (as there's no lock before we retrieved the hydrated copy, so curious if there's any synchronisation to prevent that from happening).

@corhere
Copy link
Copy Markdown
Contributor

corhere commented Jan 26, 2024

Since both sbJoin() and sbLeave() are always called on an *Endpoint receiver which were either just created, or just retrieved from the datastore, there's no chance that the receiver was partially hydrated.

Consider the call chain (*Endpoint).sbJoin <- (*Endpoint).Join <- (*Sandbox).Refresh <- (*Daemon).updateNetwork. The sandbox is fetched from the controller's c.sandboxes map, the endpoints are fetched from the sandbox's sb.endpoints map, and ep.Join(sb) is called on all of those endpoints which were neither just created nor just retrieved from the datastore.

The endpoint receiver being partially hydrated is not the most important concern; if it is out of date w.r.t. the KVObject in the datastore, the n.getController().updateToStore(ep) call in sbJoin will fail and there is no rollback-retry loop. So long as func (*Network) getEndpointFromStore exists in its current form, I cannot trust that Endpoint objects in memory are unique.

@akerouanton akerouanton modified the milestones: 26.0.0, v-future Feb 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/networking Networking kind/refactor PR's that refactor, or clean-up code status/2-code-review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants