client-go/tools/cache: add APIs with context parameter#126387
client-go/tools/cache: add APIs with context parameter#126387k8s-ci-robot merged 3 commits intokubernetes:masterfrom
Conversation
8e98f6b to
a5ffb2c
Compare
a5ffb2c to
215861c
Compare
|
/triage accepted |
|
/test pull-kubernetes-apidiff |
| } | ||
|
|
||
| func NewNamingConditionController( | ||
| ctx context.Context, |
There was a problem hiding this comment.
- ./pkg/controller/status.NewNamingConditionController: changed from func(k8s.io/apiextensions-apiserver/pkg/client/informers/externalversions/apiextensions/v1.CustomResourceDefinitionInformer, k8s.io/apiextensions-apiserver/pkg/client/clientset/clientset/typed/apiextensions/v1.CustomResourceDefinitionsGetter) *NamingConditionController to func(context.Context, k8s.io/apiextensions-apiserver/pkg/client/informers/externalversions/apiextensions/v1.CustomResourceDefinitionInformer, k8s.io/apiextensions-apiserver/pkg/client/clientset/clientset/typed/apiextensions/v1.CustomResourceDefinitionsGetter) *NamingConditionController
This seems to be unused outside of Kubernetes, so breaking the API seems acceptable.
There was a problem hiding this comment.
I think all controllers can be considered internal and breaking consumers is ok.
| // | ||
| // It's an error to call Run more than once. | ||
| // RunWithContext blocks; call via go. | ||
| RunWithContext(ctx context.Context) |
There was a problem hiding this comment.
## ./staging/src/k8s.io/client-go
Incompatible changes:
- ./tools/cache.Controller.RunWithContext: added
Adding a new method to an interface is a breaking change because out-of-tree implementations of that interface no longer work. I think this is an interface only because there is one test mocking it, not because there are supposed to be other implementations, so this API change is okay?
There was a problem hiding this comment.
Would rather change the existing. This interface seems to be very low-level. Would not be surprised if it is not used externally.
There was a problem hiding this comment.
Works for me - changed.
There was a problem hiding this comment.
Actually, this Run method is what everyone calls who uses an informer. After changing it, lots of code no longer compiled. I think we have to keep RunWithContext and transition to it gradually.
| type WatchErrorHandlerWithContext func(ctx context.Context, r *Reflector, err error) | ||
|
|
||
| // DefaultWatchErrorHandler is the default implementation of WatchErrorHandlerWithContext. | ||
| func DefaultWatchErrorHandler(ctx context.Context, r *Reflector, err error) { |
There was a problem hiding this comment.
Incompatible changes:
- ./tools/cache.DefaultWatchErrorHandler: changed from func(*Reflector, error) to func(context.Context, *Reflector, error)
Some copies seem to exist, but not used elsewhere?
Users of cache.Config can still set a watch handler without the extra context parameter.
There was a problem hiding this comment.
both opa gatekeeper and kubevirt aren't libraries. So version skew shouldn't be an issue.
| // in the underlying store. This is only safe if your use of the cache can handle mutation entries | ||
| // remaining in the cache for up to ttl when mutations and deletes occur very closely in time. | ||
| func NewIntegerResourceVersionMutationCache(backingCache Store, indexer Indexer, ttl time.Duration, includeAdds bool) MutationCache { | ||
| func NewIntegerResourceVersionMutationCache(ctx context.Context, backingCache Store, indexer Indexer, ttl time.Duration, includeAdds bool) MutationCache { |
There was a problem hiding this comment.
Incompatible changes:
- ./tools/cache.NewIntegerResourceVersionMutationCache: changed from func(Store, Indexer, time.Duration, bool) MutationCache to func(context.Context, Store, Indexer, time.Duration, bool) MutationCache
Not used outside of Kubernetes?
There was a problem hiding this comment.
This seems like an overkill, based on the discussion we've had with Jordan during KubeCon I believe we've agreed to apply context where it made sense. This cache seems like one that should not have, at most I'd expect it to only accept logger. This would resolve the other comment I had about creating a controller without the necessity for context.
There was a problem hiding this comment.
Ack. This is one of those places that I hadn't updated based on the "it's okay to put logr.Logger into our APIs" conclusion.
What could potentially throw a monkey wrench into this "no context during construction" approach is when other components don't follow it. We've seen that with event recorder. Yesterday I found that construction a delaying queue also already spawns a goroutine during New*.
I still think "no goroutines during construction" is a sound principle, sometimes it just might be a bit harder to follow.
There was a problem hiding this comment.
I took your ctx -> logger replacement for NewIntegerResourceVersionMutationCache from #129148 into this PR.
I did not take the other changes for controller construction. That can be a follow-up PR?
There was a problem hiding this comment.
I still think "no goroutines during construction" is a sound principle, sometimes it just might be a bit harder to follow.
I don't want to clean those, yet. Probably in a separate PR, but in the long run we should definitely ensure that.
| // where additional optional parameters are passed in a struct. If no resync period | ||
| // is specified there, AddEventHandlerWithConfig behaves like AddEventHandler. | ||
| // The context is used for contextual logging. | ||
| AddEventHandlerWithConfig(ctx context.Context, handler ResourceEventHandler, config HandlerConfig) (ResourceEventHandlerRegistration, error) |
There was a problem hiding this comment.
This WithConfig is used instead of adding a WithContext because it enables adding other optional parameters besides resync period without having to add another method.
Incompatible changes:
- ./tools/cache.SharedInformer.AddEventHandlerWithConfig: added
- ./tools/cache.SharedInformer.RunWithContext: added
- ./tools/cache.SharedInformer.SetWatchErrorHandlerWithContext: added
Same as with Controller: probably shouldn't be implemented outside of Kubernetes?
There was a problem hiding this comment.
This is implemented outside of kube, especially for mocking. Controller-runtime for example has an instance.
There was a problem hiding this comment.
Is it acceptable to break that other code? Probably not.
The alternative is to leave cache.SharedInformer as it is and add a second cache.SharedInformerCtx with the new methods. The code which returns a cache.SharedInformer can return a cache.SharedInformerCtx (assignable to variables or fields of type cache.SharedInformer). The only API break for such producers is that the function signature changes.
The downside of this approach is that code accepting a cache.SharedInformer must continue to do so. It can optionally use the new method only after a checked type cast.
There was a problem hiding this comment.
Isn't such a struct usually called Options in Kube, not Config?
There was a problem hiding this comment.
And in that case, I would rather not mention Config/Options in the method name, but call it AddEventHandlerWithContext.
There was a problem hiding this comment.
I think that satisfy what consumers of controller-runtime want
... unless those consumers import those packages:
https://grep.app/search?q=sigs.k8s.io/controller-runtime/pkg/controller/controllertest
😢
There was a problem hiding this comment.
Sounds okay from my side. We intentionally state in our docs that every CR minor version is only compatible with a specific client-go minor version:
Every minor version of controller-runtime has been tested with a specific minor version of client-go. A controller-runtime minor version may be compatible with other client-go minor versions, but this is by chance and neither supported nor tested. In general, we create one minor version of controller-runtime for each minor version of client-go and other k8s.io/* dependencies.
https://github.com/kubernetes-sigs/controller-runtime?tab=readme-ov-file#compatibility
Also happened in the past that we weren't entirely compatible with other client-go versions.
So seems okay to me to have a breaking change here.
There was a problem hiding this comment.
Thanks @sbueringer for heads up. I've pushed a single commit with the original proposal (extending SharedInformer) and with the renames suggested by @sttts above.
Please take another look, it might be ready for merging now.
There was a problem hiding this comment.
Looks good to me from a CR perspective
| // the next notification will be attempted. This is usually better than the alternative of never | ||
| // delivering again. | ||
| stopCh := make(chan struct{}) | ||
| wait.Until(func() { |
There was a problem hiding this comment.
The only purpose of wait.Until here is to deal with panics. The way it was implemented prevented shutting down the goroutine immediately when the last item in the channel caused a panic because wait.Until was stuck in a sleep.
I found that when adding more unit tests and checking for proper termination of goroutines. Insteading of allowing goleak to wait longer, I changed this code so that it sleeps only if really needed (= before processing the next item after a panic).
There was a problem hiding this comment.
It took me a while to grok it, but it does make sense.
460930a to
11dffab
Compare
|
/retest Some jobs didn't start. |
| // It replaces its internal store with the collected items and | ||
| // reuses the current watch requests for getting further events. | ||
| func (r *Reflector) watchList(stopCh <-chan struct{}) (watch.Interface, error) { | ||
| func (r *Reflector) watchList(ctx context.Context) (watch.Interface, error) { |
There was a problem hiding this comment.
The elephant in the room which does not get addressed by this PR is the ListWatcher API. The implementations all use context.TODO. They get called here:
In this PR, that code has a proper context, but cannot pass it on. Let's address that in a follow-up PR, similar to #129109?
There was a problem hiding this comment.
I've addressed this in #129125. But that's a massive PR that I need to split up properly. The main benefit of doing the full exercise is that there will be no more surprises later on. Decisions like "this code does not need a context" could turn out to be incorrect when later it is found that the code directly or indirectly calls code which does need a context after all.
| // in the underlying store. This is only safe if your use of the cache can handle mutation entries | ||
| // remaining in the cache for up to ttl when mutations and deletes occur very closely in time. | ||
| func NewIntegerResourceVersionMutationCache(backingCache Store, indexer Indexer, ttl time.Duration, includeAdds bool) MutationCache { | ||
| func NewIntegerResourceVersionMutationCache(ctx context.Context, backingCache Store, indexer Indexer, ttl time.Duration, includeAdds bool) MutationCache { |
There was a problem hiding this comment.
This seems like an overkill, based on the discussion we've had with Jordan during KubeCon I believe we've agreed to apply context where it made sense. This cache seems like one that should not have, at most I'd expect it to only accept logger. This would resolve the other comment I had about creating a controller without the necessity for context.
| return nil, false, fmt.Errorf("failed to build token generator: %v", err) | ||
| } | ||
| tokenController, err := serviceaccountcontroller.NewTokensController( | ||
| ctx, |
There was a problem hiding this comment.
I believe we discussed that we should not require context for the controller creation, only for the actual run method. If you change this to only pass logger, similarly to what the change below was before your modification we could also get rid of the changes in the controller to use .AddEventHandler for now. This would also address part of the concerns that Jordan had in this thread.
There was a problem hiding this comment.
@soltysh: so do you also want to avoid adding AddEventHandlerWithContext? Or add AddEventHandlerWithLogger instead?
The problem is that there's a bunch of log calls potentially triggered by AddEventHandler. Without something passed by the caller, those cannot be properly attributed to the caller. In particular when those log calls point out things like "Warning: resync period is smaller than resync check period and the informer has already started" it would be good to get more information into the log entry from a contextual logger.
There was a problem hiding this comment.
Coming back to *WithContext: part of my motivation for adding it was that it's easier for developers to cancel one context and get everything stopped which was created with it than having to remember to call all individual Stop or Remove methods. Is that goal worth extra complexity?
Perhaps not. Resource leaks are annoying, but perhaps developers simply have to be more careful if they care.
In particular when those log calls point out things like "Warning: resync period is smaller than resync check period and the informer has already started" it would be good to get more information into the log entry from a contextual logger.
That may be possible without an API change. The code underneath AddEventHandler can mark itself as helper code, then stack unwinding will skip it and the log message will be recorded as coming from the code which made the broken AddEventHandler call. How does that sound?
There was a problem hiding this comment.
I removed AddEventHandlerWithContext and then my own new TestEventPanics reminded me why a per-event-handler logger is useful: when it panics, we want to know what panicked. I've replaced AddEventHandlerWithContext with AddEventHandlerWithOptions, with a logger as one optional option.
There was a problem hiding this comment.
I've replaced
AddEventHandlerWithContextwithAddEventHandlerWithOptions, with a logger as one optional option.
This will probably serve us much better in the long run, w/o the need to change the API.
| // It replaces its internal store with the collected items and | ||
| // reuses the current watch requests for getting further events. | ||
| func (r *Reflector) watchList(stopCh <-chan struct{}) (watch.Interface, error) { | ||
| func (r *Reflector) watchList(ctx context.Context) (watch.Interface, error) { |
|
I'm planning to go through the client-go changes once more tomorrow, most likely. |
There are situations where it makes more sense to pass a logger through a call chain, primarily because passing a context would imply that the call chain should honor cancellation even though there is a different shutdown mechanism. Using the *WithContext variants would cause additional overhead for klog.NewContext, which hurts in particular for HandleCrash because that function is typically a nop that doesn't actually need to log anything. HandleCrashWithLogger avoids that overhead. For HandleError that is less relevant because it always logs, but for the sake of symmetry it also gets added. Putting klog.Logger (= logr.Logger) into the public Kubernetes Go API is okay because it's no longer realistic that these packages can ever drop the klog dependency. Callers using slog as logger in their binary can use https://github.com/veqryn/slog-context to store a slog.Logger in a context and then call the *WithContext variants, klog.FromContext will be able to use it. This is probably very rare, so there's no need for *WithSlog variants. While at it, unit testing gets enhanced and logging in panic handlers gets improved such that they are guaranteed to get a saner location when not doing any caller skipping. Previously, this was undefined.
Several tests leaked goroutines. All of those get fixed where possible without API changes. Goleak is used to prevent regressions. One new test specifically covers shutdown of an informer and its event handlers.
|
I force-pushed to sync with #129125. That includes the |
|
/test pull-kubernetes-apidiff |
| } | ||
| discoveryController := NewDiscoveryController(s.Informers.Apiextensions().V1().CustomResourceDefinitions(), versionDiscoveryHandler, groupDiscoveryHandler, aggregatedDiscoveryManager) | ||
| namingController := status.NewNamingConditionController(s.Informers.Apiextensions().V1().CustomResourceDefinitions(), crdClient.ApiextensionsV1()) | ||
| namingController := status.NewNamingConditionController(context.TODO() /* for contextual logging */, s.Informers.Apiextensions().V1().CustomResourceDefinitions(), crdClient.ApiextensionsV1()) |
There was a problem hiding this comment.
This should be just the logger, like we agreed for all controllers. Especially in the New* methods.
| }, r.backoffManager, true, stopCh) | ||
| klog.V(3).Infof("Stopping reflector %s (%s) from %s", r.typeDescription, r.resyncPeriod, r.name) | ||
| }, r.backoffManager, true, ctx.Done()) | ||
| klog.FromContext(ctx).V(3).Info("Stopping reflector", "type", r.typeDescription, "resyncPeriod", r.resyncPeriod, "reflector", r.name) |
There was a problem hiding this comment.
Nit:
| klog.FromContext(ctx).V(3).Info("Stopping reflector", "type", r.typeDescription, "resyncPeriod", r.resyncPeriod, "reflector", r.name) | |
| logger.V(3).Info("Stopping reflector", "type", r.typeDescription, "resyncPeriod", r.resyncPeriod, "reflector", r.name) |
you already have there logger.
| // In practice, the reflector exits the watchHandler as soon as the bookmark event is received and calls the t.C() method. | ||
| func newInitialEventsEndBookmarkTicker(name string, c clock.Clock, watchStart time.Time, exitOnWatchListBookmarkReceived bool) *initialEventsEndBookmarkTicker { | ||
| return newInitialEventsEndBookmarkTickerInternal(name, c, watchStart, 10*time.Second, exitOnWatchListBookmarkReceived) | ||
| func newInitialEventsEndBookmarkTicker(ctx context.Context, name string, c clock.Clock, watchStart time.Time, exitOnWatchListBookmarkReceived bool) *initialEventsEndBookmarkTicker { |
There was a problem hiding this comment.
Context seems unnecessary here, based on my understanding. Logger should be sufficient.
| } | ||
| if err != nil { | ||
| klog.Warningf("The watchlist request ended with an error, falling back to the standard LIST/WATCH semantics because making progress is better than deadlocking, err = %v", err) | ||
| logger.Error(err, "The watchlist request ended with an error, falling back to the standard LIST/WATCH semantics because making progress is better than deadlocking") |
There was a problem hiding this comment.
Below in this method we have:
if fallbackToList {
err = r.list(stopCh)
if err != nil {
return err
}
}which seems the only consumer of stopCh. I tested locally and switching that bit to context should be an easy change, especially that later in tests you're using ctx.Done to pass to list invocation.
There was a problem hiding this comment.
Looks like I already made that change in my full PR (#129125).
I've included it here instead.
|
|
||
| if s.HasStarted() { | ||
| klog.Warningf("The sharedIndexInformer has started, run more than once is not allowed") | ||
| logger.Info("Warning: the sharedIndexInformer has started, run more than once is not allowed") |
There was a problem hiding this comment.
There's a bit on inconsistency, here you're using logger.Info("Warning: but in staging/src/k8s.io/client-go/tools/cache/listers.go#ListAllByNamespace you switch to klog.TODO().Error. I'd either do both with info and warning prefix or error, consistently.
There was a problem hiding this comment.
This one here is correct. Warnings are not errors, just info message that "need more attention" - whatever that means.
I've changed ListAllByNamespace to do the same.
| processorStopCh := make(chan struct{}) | ||
| // Separate stop context because Processor should be stopped strictly after controller. | ||
| // Cancelation in the parent context is ignored. | ||
| processorStopCtx, stopProcessor := context.WithCancelCause(context.WithoutCancel(ctx)) |
There was a problem hiding this comment.
The double context wrapping is to:
- drop parent cancel
- maintain the logger from context
Do I get that correct? The former is being written down, but the 2nd point is something I had to figure out. I'd just put it in the comment, because the first thing that came to my mind was why not just start with a fresh context.Background() rather than double wrapping.
There was a problem hiding this comment.
Correct. It's not just logger (we could add that back explicitly) but also all other values which we don't know about here. I added a comment.
| if resyncPeriod < s.resyncCheckPeriod { | ||
| if s.started { | ||
| klog.Warningf("resyncPeriod %v is smaller than resyncCheckPeriod %v and the informer has already started. Changing it to %v", resyncPeriod, s.resyncCheckPeriod, s.resyncCheckPeriod) | ||
| logger.Info("Warning: resync period is smaller than resync check period and the informer has already started. Changing it to the resync check period", "resyncPeriod", resyncPeriod, "resyncCheckPeriod", s.resyncCheckPeriod) |
There was a problem hiding this comment.
Another instance of info+warning. I'm leaning towards doing that also in that once place where you've used error. I don't see a reason why it would have that error there.
There was a problem hiding this comment.
Agreed, and changed there.
| // the next notification will be attempted. This is usually better than the alternative of never | ||
| // delivering again. | ||
| stopCh := make(chan struct{}) | ||
| wait.Until(func() { |
There was a problem hiding this comment.
It took me a while to grok it, but it does make sense.
The context is used for cancellation and to support contextual logging. In most cases, alternative *WithContext APIs get added, except for NewIntegerResourceVersionMutationCache where code searches indicate that the API is not used downstream. An API break around SharedInformer couldn't be avoided because the alternative (keeping the interface unchanged and adding a second one with the new method) would have been worse. controller-runtime needs to be updated because it implements that interface in a test package. Downstream consumers of controller-runtime will work unless they use those test package. Converting Kubernetes to use the other new alternatives will follow. In the meantime, usage of the new alternatives cannot be enforced via logcheck yet (see kubernetes#126379 for the process). Passing context through and checking it for cancellation is tricky for event handlers. A better approach is to map the context cancellation to the normal removal of an event handler via a helper goroutine. Thanks to the new HandleErrorWithLogr and HandleCrashWithLogr, remembering the logger is sufficient for handling problems at runtime.
|
@pohly: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
LGTM label has been added. DetailsGit tree hash: 73c2ccaf794c3ac71eb3ff0dc4c67337c3eab90b |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: pohly, soltysh, sttts The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What type of PR is this?
/kind cleanup
What this PR does / why we need it:
The context is used for cancellation and to support contextual logging.
In most cases, alternative *WithContext APIs get added, except for NewIntegerResourceVersionMutationCache where code searches indicate that the API is not used downstream.
An API break around SharedInformer couldn't be avoided because the
alternative (keeping the interface unchanged and adding a second one with
the new method) would have been worse. controller-runtime needs to be updated
because it implements that interface in a test package. Downstream consumers of
controller-runtime will work unless they use those test package.
Converting Kubernetes to use the other new alternatives will follow. In the meantime, usage of the new alternatives cannot be enforced via logcheck yet (see #126379 for the process).
Which issue(s) this PR fixes:
Related-to: #126379
Special notes for your reviewer:
#121989 contains this change and the related changes in the rest of k/k.
Does this PR introduce a user-facing change?