Skip to content

Conversation

@pmorie
Copy link
Contributor

@pmorie pmorie commented Jan 29, 2020

For #1007; this PR adds WIP support for leader election to sharedmain.

@knative-prow-robot knative-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 29, 2020
@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Jan 29, 2020
@knative-prow-robot knative-prow-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 29, 2020
Copy link
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lint

logger.Fatalw("Version check failed", zap.Error(err))
}
// run is the main controller flow
run := func(ctx context.Context) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite big. Why not extract it to a separate function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the constructions in k8s controller manager, decided to save any cosmetics or discretionary factoring for after it worked. Since that function captures many variables from the enclosing context, extracting a function would be more laborious than it might appear - since you would have to introduce a type or function factory to inject those variables.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't look closely, but I noticed logger. If it's just 2-3, then passing via args should be alright.
But, sure, we can brush it up later on.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Frankly, I don't think it's worth it, but you're welcome to do it as a follow-on PR.

run(ctx)
panic("unreachable")
} else {
// create a unique identifier so that two controllers on the same host don't
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// create a unique identifier so that two controllers on the same host don't
// Create a unique identifier so that two controllers on the same host don't

if err != nil {
logger.Fatalw("Failed to get hostname for leader election", zap.Error(err))
}
id = id + "_" + string(uuid.NewUUID())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
id = id + "_" + string(uuid.NewUUID())
id += "_" + string(uuid.NewUUID())

logger.Fatalw("Failed to get hostname for leader election", zap.Error(err))
}
id = id + "_" + string(uuid.NewUUID())
log.Printf("%v will run in leader-elected mode with id %v", component, id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm? or logger.Infof().

})
if err != nil {
logger.Fatalw("Failed to create admission controller", zap.Error(err))
logger.Fatalw("error creating lock: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.Fatalw("error creating lock: %v", err)
logger.Fatalw("Error creating lock: %v", zap.Error(err))

Callbacks: leaderelection.LeaderCallbacks{
OnStartedLeading: run,
OnStoppedLeading: func() {
logger.Fatalw("leaderelection lost")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.Fatalw("leaderelection lost")
logger.Fatal("leaderelection lost")

should work?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this code correctly, the controller will start up, and try to acquire the leader lock. If it participates in the election and is not elected, it will sit and continue participating in the leader election. If it wins the election, it will serve (run the specified controllers) and will exit if it ever loses an election after that.

Is that correct?

if !leConfig.LeaderElect {
log.Printf("%v will not run in leader-elected mode", component)
run(ctx)
panic("unreachable")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go with logger.Fatal() instead, rather than panic. Same below.

Copy link
Contributor

@knative-prow-robot knative-prow-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vagababov: 11 warnings.

Details

In response to this:

/lint

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

corev1 "k8s.io/api/core/v1"
)

const ConfigMapNameEnv = "CONFIG_LEADERELECTION_NAME"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint comments: exported const ConfigMapNameEnv should have comment or be unexported. More info.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably unexported.

EnabledComponents map[string]bool
}

func (c *Config) GetComponentConfig(name string) ComponentConfig {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint comments: exported method Config.GetComponentConfig should have comment or be unexported. More info.

},
}

for i, _ := range cases {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for i, _ := range cases {
for i := range cases {

Golint range-loop: should omit 2nd value from range; this loop is equivalent to for i := range ....

},
}

for i, _ := range cases {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for i, _ := range cases {
for i := range cases {

Golint range-loop: should omit 2nd value from range; this loop is equivalent to for i := range ....

if err != nil {
if apierrors.IsNotFound(err) {
return kle.NewConfigFromMap(nil)
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Golint indent: if block ends with a return statement, so drop this else and outdent its block. More info.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to this comment.

@@ -0,0 +1,4 @@
# See the OWNERS docs at https://go.k8s.io/owners
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought ./hack/update-deps.sh deletes OWNERS files in vendor/.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you run it ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, if you run it applies :)

@pmorie pmorie force-pushed the leader-election branch 2 times, most recently from f78a2c2 to e28f996 Compare February 4, 2020 19:12
Copy link
Contributor

@vaikas vaikas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding this!

logger.Fatalw("leaderelection lost")
},
},
// TODO: investigate using watchdog
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add an issue to track this and add a pointer to that here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened #1048

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

}

if leaseDurationStr, ok := data["leaseDuration"]; ok {
if leaseDuration, err := time.ParseDuration(leaseDurationStr); err == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless I'm totes reading this wrong, if a user specifies an invalid value here, it will not take effect and they will have no visibility that what they wanted didn't happen?
Should we at the very least log this error?
Just worrying about the lack of visibility here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point; I think that in the presence of an invalid value, you should receive the default value and have a log message. WDYT

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it gets the default value now, my worry was the lack of feedback, which the log message would provide.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three choices here:

  1. If you supply an invalid value, it fails loudly (i.e. logger.Fatal) This is most appropriate for startup conditions, where it might block something like a rolling update.

  2. If an invalid config is supplied, the entire config should be considered invalid and rejected. In this case, this would be adding an else to return nil, err

  3. If a partially-invalid config is supplied, do the best you can, and log errors about the parts that didn't work.

I'd most prefer option 1, but given that we're driving this from dynamically-updated configmaps, I think option 2 would be better than option 3, because it makes the application of the ConfigMap much more deterministic (all or nothing, vs all-except-my-typos).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So... @nimakaviani added a validating webhook that allows us to register configmaps for validation and reject them outright if it returns an error. So the right answer here is multi-part:

  1. This should return an error
  2. Each downstream webhook should add this to the list of configmaps to validate.

Then the configmap update will be outright rejected and never saved to etcd.

} else if !apierrors.IsNotFound(err) {
logger.With(zap.Error(err)).Fatalf("Error reading ConfigMap %q", logging.ConfigMapName())
}
// Watch the logging config map and dynamically update logging levels.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want this and the observability configmap to be subject to leader election, or do we want them to apply even when the controller is not leader-elected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these are configured as part of the leader-elected code, I think that there is no notion where these configs apply when the controller is not the leader. If the controller is running and not the leader it is waiting to be the leader. Note, there is an exceptional case where the leader election lock may be lost but the process still running due to deadlock which will be addressed by #1048, but I don't think it changes the equation, and I think it would be fine to have logging and observability config watches stay within the leader-elected section.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These watches are used to report information like Go resource usage and where to write structured logs (which may be written for background work even when not leader elected).

Similarly, I worry that the client version check on line 172 could lead to N leader candidates which will exit as soon as they are elected leader because they don't actually like the version of Kubernetes apiserver that they are attached to.

I think I'd prefer to see the leader election more narrowly wrapped around the controller start on 234-235 (and maybe subsequent).

logger.Info("Starting controllers...")
go controller.StartAll(ctx.Done(), controllers...)

profilingServer := profiling.NewServer(profilingHandler)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto on profiling. Do we want the profiler to only run on the elected leader, or do we want to be able to check profiling on all the participants?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I think on all

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, fixed


// If we have one or more admission controllers, then start the webhook
// and pass them in.
if len(webhooks) > 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question on the webhooks. In particular, one could imagine in leader-election mode having the webhooks run only only on the non-elected nodes, to offload some CPU work from the leader. I could also see just running the webhook on all of them so that the webhook has higher availability.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expect that webhooks will not be run leader-elected, as they should be horizontally scalable, they are already accessed behind a service, and they do not directly mutate the API server state.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with all your statements about webhooks -- my suggestion was to take this code out from inside run() and put it back in the general shared main.

I think @mattmoor 's "mink" distribution combines the webhooks and controllers into a single deployment, for example. It seems surprising to have leader election control this code.

// Register webhook metrics
webhook.RegisterMetrics()

// possible bug? egCtx
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a new comment. Should @vagababov or @mattmoor comment on whether this ctx is correct?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure that this was intentional because the ctx here is being passed into work that is enqueued on the errgroup. So it should key off of the SIGTERM not the first failure in the errgroup

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it, it just looked a little fishy to me superficially, hence the note to myself.

Callbacks: leaderelection.LeaderCallbacks{
OnStartedLeading: run,
OnStoppedLeading: func() {
logger.Fatalw("leaderelection lost")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand this code correctly, the controller will start up, and try to acquire the leader lock. If it participates in the election and is not elected, it will sit and continue participating in the leader election. If it wins the election, it will serve (run the specified controllers) and will exit if it ever loses an election after that.

Is that correct?

}

if leaseDurationStr, ok := data["leaseDuration"]; ok {
if leaseDuration, err := time.ParseDuration(leaseDurationStr); err == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Three choices here:

  1. If you supply an invalid value, it fails loudly (i.e. logger.Fatal) This is most appropriate for startup conditions, where it might block something like a rolling update.

  2. If an invalid config is supplied, the entire config should be considered invalid and rejected. In this case, this would be adding an else to return nil, err

  3. If a partially-invalid config is supplied, do the best you can, and log errors about the parts that didn't work.

I'd most prefer option 1, but given that we're driving this from dynamically-updated configmaps, I think option 2 would be better than option 3, because it makes the application of the ConfigMap much more deterministic (all or nothing, vs all-except-my-typos).


if enabledComponents, ok := data["enabledComponents"]; ok {
tokens := strings.Split(enabledComponents, ",")
for i, _ := range tokens {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^

for i, _ := range tokens {
str := tokens[i]

if str == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this check?

if err != nil {
logger.Fatalw("Failed to get hostname for leader election", zap.Error(err))
}
id += "_" + string(uuid.NewUUID())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a helper in the leaderelection package? Seems like you use it in serving as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes

RenewDeadline time.Duration
RetryPeriod time.Duration

EnabledComponents map[string]bool
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use sets.String from apimachinery

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack


if enabledComponents, ok := data["enabledComponents"]; ok {
tokens := strings.Split(enabledComponents, ",")
for i, _ := range tokens {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use sets.String as I suggest below, then the entire loop can be sets.NewString(tokens...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

ResourceLock string `json:"resourceLock"`
LeaseDuration time.Duration `json:"leaseDuration"`
RenewDeadline time.Duration `json:"renewDeadline"`
RetryPeriod time.Duration `json:"retryPeriod"`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the json encodings here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct, i'll remove them

@pmorie
Copy link
Contributor Author

pmorie commented Feb 4, 2020

@evankanderson

If I understand this code correctly, the controller will start up, and try to acquire the leader lock. If it participates in the election and is not elected, it will sit and continue participating in the leader election. If it wins the election, it will serve (run the specified controllers) and will exit if it ever loses an election after that.

Is that correct?

yes, if leader election is enabled for the controller by name

}

// GetLeaderElectionConfig gets the leader election config.
func GetLeaderElectionConfig(ctx context.Context) (*kle.Config, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this method need to be public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is used in controllers that do not use sharedmain, as are other methods from this file that are exported.

logger.Info("Starting controllers...")
go controller.StartAll(ctx.Done(), controllers...)

profilingServer := profiling.NewServer(profilingHandler)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here I think on all

EventRecorder: recorder,
})
if err != nil {
logger.Fatalw("Error creating lock: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.Fatalw("Error creating lock: %v", err)
logger.Fatalw("Error creating lock", zap.Error(err))

corev1 "k8s.io/api/core/v1"
)

const ConfigMapNameEnv = "CONFIG_LEADERELECTION_NAME"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably unexported.

return defaultComponentConfig()
}

func defaultConfig() Config {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
func defaultConfig() Config {
func defaultConfig() *Config {

}

func defaultConfig() Config {
return Config{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return Config{
return &Config{

This way you don't have to take address of the returned object every time

@pmorie
Copy link
Contributor Author

pmorie commented Feb 4, 2020

I'd most prefer option 1, but given that we're driving this from dynamically-updated configmaps, I think option 2 would be better than option 3, because it makes the application of the ConfigMap much more deterministic (all or nothing, vs all-except-my-typos).

After some thought, I am leaning toward option 1, to error loudly if there is an invalid value and refuse to start. The configmap will be protected by a validating webhook. In most cases this will prevent the configmap from being mutated into an invalid configuration, but it's still possible to get into a state whether the configmap contains invalid data. Since we believe it will be an exceptional condition for this to happen, I think it is okay to refuse to start. If the configmap contains invalid data, and the system is otherwise healthy, the configmap will only ever be able to be updated to a valid state. Similarly, we exit the controller during startup if there's an error parsing the log config.

Note, currently this code does not reload a running process' configuration based on changes to the configmap. I spent a bit of time investigating whether the leader election code would tolerate being reconfigured at runtime (the fields of the config are exported and referenced within the code instead of unpacked from the config). I believe it would be possible to investigate whether this would work, but I do not believe the code is designed for that and it is not tested for it.

I think we have a few options:

  • Attempt to make the controller reload itself without exiting of the leader election config changes
  • Make the controller exit if the leader election config changes
  • Retain the existing behavior (read once and only once)

Many changes of configuration would likely be benign in isolation (ie, no other changes to deployed resources for knative controllers), but some could result in controller downtime or other negative effects (like disabling leader election without changing number of controller replicas).

In general I expect that operators of knative will rarely change this configuration, and in my experience leader elected controllers frequently move between being the leader, to waiting for the lock, and back again. Since that is the case, i'm leaning toward retaining the existing behavior and reading only once OR making the controller exit (since they frequently exit and are restarted by the kubelet anyway due to losing the leader election lock).

@pmorie
Copy link
Contributor Author

pmorie commented Feb 21, 2020

/retest

Copy link
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if I left the same comments before, but those are mostly stylistical changes.

<-egCtx.Done()

profilingServer.Shutdown(context.Background())
// Don't forward ErrServerClosed as that indicates we're already shutting down.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: comment here doesn't make much sense, since we're not forwarding anything, logging at best.

if !leConfig.LeaderElect {
logger.Infof("%v will not run in leader-elected mode", component)
run(ctx)
logger.Fatal("unreachable")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

won't this be reachable when <-ctx.Done() triggers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this no longer applies, I don't think. I'll remove it.


if resourceLock, ok := data["resourceLock"]; ok {
if !validResourceLocks.Has(resourceLock) {
return nil, fmt.Errorf("resourceLock: invalid value %q: valid values are \"leases\",\"configmaps\",\"endpoints\"", resourceLock)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return nil, fmt.Errorf("resourceLock: invalid value %q: valid values are \"leases\",\"configmaps\",\"endpoints\"", resourceLock)
return nil, fmt.Errorf(`resourceLock: invalid value %q: valid values are "leases", "configmaps", "endpoints"`, resourceLock)

EnabledComponents: sets.NewString(),
}

if resourceLock, ok := data["resourceLock"]; ok {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and for the checks below, I'd totally support simplifying this to:

if x := data["resourceLock"]; !validResourceLocks(x) {
  return fmt.Errorf(....
}

"" is not a valid value is just as good as must not be empty, imo. But will shorten the file in half.

return defaultComponentConfig()
}

func defaultConfig() Config {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why wouldn't we return pointer from the getgo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can actually delete this method entirely now.

@knative-metrics-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-knative-pkg-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
leaderelection/config.go Do not exist 61.8%

Copy link
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 26, 2020
Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few small comments; the only big one is that I still think the logging and metrics setup should be outside the leader election (so that we can monitor non-elected controllers).

profilingServer := profiling.NewServer(profilingHandler)
eg, egCtx := errgroup.WithContext(ctx)
eg.Go(profilingServer.ListenAndServe)
go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering why this is a goroutine, rather than a defer?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method won't exit until the server is done or crashes. Is there a downside to using a goroutine here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just a bit unusual, but it looks like the only leader election option is RunOrDie, so this is fine.

run := func(ctx context.Context) {
cmw := SetupConfigMapWatchOrDie(ctx, logger)
controllers, _ := ControllersAndWebhooksFromCtors(ctx, cmw, ctors...)
WatchLoggingConfigOrDie(ctx, cmw, logger, atomicLevel, component)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still wondering why we only want to configure logging and monitoring after being elected leader, rather than applying the logging and monitoring configs continuously.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was following the pattern in the k8s controller manager, which is to basically do absolutely nothing until you're the leader. I don't see an advantage to watching these configs until you're the leader, but if you want me to change it, I will. Do you want me to change it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea would be to have logs and metrics potentially exporting to a remote destination even for non-leader controllers (so you could count the number of standby working controllers, for example).

But this can be done later, so I'll approve as-is.

}

RunLeaderElected(ctx, logger, run, component, leConfig)
logger.Fatal("unreachable")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why logger.Fatal here, rather than allowing RunLeaderElected to exit at shutdown if necessary?


// Create a unique identifier so that two controllers on the same host don't
// race.
id, err := kle.UniqueID()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR

It seems like we might want to include a metric here indicating the leader-election status.

Can you drop a // TODO: add monitoring for leader election status to this bit of the code (and maybe file a good-first-issue for it)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

}

if leaseDurationStr, ok := data["leaseDuration"]; ok {
if leaseDuration, err := time.ParseDuration(leaseDurationStr); err == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combining with Victor's comment above, what about:

var err error

if config.LeaseDuration, err = time.ParseDuration(data["leaseDuration"]); err != nil {
  return nil, fmt.Errorf("leaseDuration: %+v", err)
}

return &config, nil
}

return NewConfigFromMap(configMap.Data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like we should pass the config from defaultConfig through NewConfigFromMap to make sure that it meets the expected invariants. WDYT?


func defaultComponentConfig() ComponentConfig {
return ComponentConfig{
LeaderElect: false,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the default value. I assume you're just trying to make the default very clear?

Comment on lines +158 to +162
cm := os.Getenv(ConfigMapNameEnv)
if cm == "" {
return "config-leader-election"
}
return cm
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
cm := os.Getenv(ConfigMapNameEnv)
if cm == "" {
return "config-leader-election"
}
return cm
if cm := os.Getenv(ConfigMapNameEnv); cm != "" {
return cm
}
return "config-leader-election"

Copy link
Member

@evankanderson evankanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

run := func(ctx context.Context) {
cmw := SetupConfigMapWatchOrDie(ctx, logger)
controllers, _ := ControllersAndWebhooksFromCtors(ctx, cmw, ctors...)
WatchLoggingConfigOrDie(ctx, cmw, logger, atomicLevel, component)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea would be to have logs and metrics potentially exporting to a remote destination even for non-leader controllers (so you could count the number of standby working controllers, for example).

But this can be done later, so I'll approve as-is.

profilingServer := profiling.NewServer(profilingHandler)
eg, egCtx := errgroup.WithContext(ctx)
eg.Go(profilingServer.ListenAndServe)
go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was just a bit unusual, but it looks like the only leader election option is RunOrDie, so this is fine.

@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: evankanderson, pmorie

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 26, 2020
@knative-prow-robot knative-prow-robot merged commit ca35cb8 into knative:master Feb 26, 2020
pmorie added a commit to pmorie/pkg that referenced this pull request Feb 26, 2020
* Add leader election config and to sharedmain

* Add new dependencies

* Extract method for RunLeaderElected

* Make leader election config constructor validate

* Rename leader election files

* Always start profiling server whether component has LE lock or not

* Fix entering unreachable section when leader election is disabled

* Address PR feedback
knative-prow-robot pushed a commit that referenced this pull request Feb 26, 2020
* add leader election support to sharedmain (#1019)

* Add leader election config and to sharedmain

* Add new dependencies

* Extract method for RunLeaderElected

* Make leader election config constructor validate

* Rename leader election files

* Always start profiling server whether component has LE lock or not

* Fix entering unreachable section when leader election is disabled

* Address PR feedback

* Fix missing import
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/test-and-release cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants