Skip to content

Add generic reconciler and replace L2 responder reconciler#27884

Closed
dylandreimerink wants to merge 3 commits intocilium:mainfrom
dylandreimerink:feature/generic-map-reconciler
Closed

Add generic reconciler and replace L2 responder reconciler#27884
dylandreimerink wants to merge 3 commits intocilium:mainfrom
dylandreimerink:feature/generic-map-reconciler

Conversation

@dylandreimerink
Copy link
Copy Markdown
Member

This PR adds generic map reconciliation logic. The idea behind this is that in order to add resilience, we want to move most maps to a pattern where the desired state lives in a stateDB table and reconciliation logic will keep the BPF map in sync with this desired state.

Since most BPF maps have the same requirements, we can reuse the same reconciliation logic. The only map to currently use a similar reconciler is the L2 responder map, so this PR also changes it out for this generic reconciler.

Lastly, this PR changes the L2 announcer to use the devices table instead of the legacy method of getting devices info and the GARP capability is now provided via hive so it can be mocked out. See commit messages for details.

Add generic reconciler and replace L2 responder reconciler

@maintainer-s-little-helper maintainer-s-little-helper bot added the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 1, 2023
@dylandreimerink dylandreimerink added release-note/misc This PR makes changes that have no direct user impact. feature/l2-announcement resiliency Tracks resiliency issues labels Sep 1, 2023
@maintainer-s-little-helper maintainer-s-little-helper bot removed the dont-merge/needs-release-note-label The author needs to describe the release impact of these changes. label Sep 1, 2023
@dylandreimerink dylandreimerink force-pushed the feature/generic-map-reconciler branch 2 times, most recently from 9ad7527 to 4a53399 Compare September 1, 2023 12:07
Added a generic map reconciler that can be used to reconcile stateDB
tables to BPF maps. This reconciler only work for maps that flow in one
direction, from the agent to the datapath. This reconciler doesn't read
any information back into the agent from the datapath.

The reconciler watches for changes in the stateDB table and applies
any changes directly (partial reconciliation) which is fast. It also
periodically does a full reconciliation to make sure that the stateDB
table and the BPF map are in sync.

The reconciler will retry individual entries and will not block the
reconciliation of other entries if one fails.

Signed-off-by: Dylan Reimerink <[email protected]>
So far, the l2 responder map reconciler which is part of the datapath
package has been responsible for sending GARP packets. It used the
global function to do so. However, in this commit series we will be
moving the GARP sending logic to the L2 announcer package which is
part of the control plane. We therefor need the ability to mock it out.

This commit adds a new interface to the GARP package which is
implemented by the existing "send garp packet on a given interface"
function. We then implement the interface with a small wrapper around
the global function and provide it via hive.

Signed-off-by: Dylan Reimerink <[email protected]>
@dylandreimerink dylandreimerink force-pushed the feature/generic-map-reconciler branch 2 times, most recently from 425eb61 to d8e10fe Compare September 1, 2023 13:50
This commit removes the `datapath/l2responder` package which did
reconciliation from a stateDB map to the `l2respondermap` and replaces
this logic with the generic map reconciler that was added in an earlier
commit in this series.

The custom reconciler was responsible for the interface name to
interface index conversion and sending GARP messages when a new entry
was added to the map. The generic reconciler does not do this, so this
functionality is moved to the `l2announcer` package. The devices table
is used to provide both the interface name to interface index without
the need for manual conversion. And the GARP messaging is now done via
a interface provided via hive.

Signed-off-by: Dylan Reimerink <[email protected]>
@dylandreimerink dylandreimerink force-pushed the feature/generic-map-reconciler branch from d8e10fe to 8c9db11 Compare September 1, 2023 13:51
if shutdown {
return nil
}

Copy link
Copy Markdown
Contributor

@joamaki joamaki Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is m.retryQueue.Done called? Forget does nothing except clear the rate-limiting data. Without call to Done(key) the key will never be processed again. Tests probably lacking coverage here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that is unfortunate naming, but then again I did not check the API docs. You are likely right about the tests. This is challenging since the retry logic is all time based which makes for really flaky tests.

err := m.m.Delete(entry.Key())
// If we error, retry this key later, don't hold up the delete tracker.
if err != nil && !errors.Is(err, ebpf.ErrKeyNotExist) {
m.retryQueue.AddRateLimited(entry.Key())
Copy link
Copy Markdown
Contributor

@joamaki joamaki Sep 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this lead to races? E.g. it might be we'll retry the delete while it was added back? Forget doesn't remove anything from the queue. It would probably be better to not use workqueue here at all, but rather implement it directly and drive it from the same loop.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that we only store the key for retries, just like Resource[T] so it doesn't matter if something happened in the table between us adding the key and the actual retry action. Not sure this warrants a custom retry queue, although it might make testing and verification easier 🤔

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more that the snapshot that the retry gets might be older or newer than what we have here, e.g. that's why it might travel back in time.

return rev
}

func (m *mapReconciler[E, K, V]) Retirer(ctx context.Context) error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be Retrier? Is there a reason this is exported?
BTW, would retryLoop and reconcileLoop work better as names?

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it
has not had recent activity. It will be closed if no further activity
occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale. label Oct 14, 2023
@github-actions
Copy link
Copy Markdown

This pull request has not seen any activity since it was marked stale.
Closing.

@github-actions github-actions bot closed this Oct 29, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature/l2-announcement release-note/misc This PR makes changes that have no direct user impact. resiliency Tracks resiliency issues stale The stale bot thinks this issue is old. Add "pinned" label to prevent this from becoming stale.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants