Skip to content

Conversation

@pmorie
Copy link
Contributor

@pmorie pmorie commented Jan 29, 2020

Fixes knative/pkg#1007

Proposed Changes

  • depends on add leader election support to sharedmain pkg#1019 for changes to sharedmain, which is used by:
    • controller
    • autoscaler-hpa
    • networking-certmanager
    • networking-istio
    • networking-ns-cert
    • webhook
  • /config is changed so that it is simple to enable leader election, but it is not enabled by default

Release Note

Add support for leader election

@knative-prow-robot knative-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 29, 2020
@googlebot googlebot added the cla: yes Indicates the PR's author has signed the CLA. label Jan 29, 2020
@knative-prow-robot knative-prow-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jan 29, 2020
Copy link
Member

@mattmoor mattmoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll TAL at the pkg PR next.

Gopkg.toml Outdated
[[override]]
name = "knative.dev/pkg"
branch = "master"
source = "github.com/pmorie/pkg"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pro tip: I usually throw a // TODO(mattmoor): DO NOT SUBMIT around this, so that the bot catches it and leaves a comment.

if !leConfig.LeaderElect {
log.Printf("autoscaler will not run in leader-elected mode")
run(ctx)
panic("unreachable")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this not hit when the context is cancelled?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(which happens when K8s sends SIGTERM)

return err == nil, nil
}); perr != nil {
log.Fatal("Timed out attempting to get k8s version: ", err)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you'd left this as is, then you could save some kubeclient.Get calls below.

if err != nil {
logger.Fatalw("Failed to get hostname for leader election", zap.Error(err))
}
id = id + "_" + string(uuid.NewUUID())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think hoisting this little bit into a helper would document it in the most straightforward way. At first I read the comment above and was like: "That's not what os.Hostname() does...?" :)

# giving up; 10 seconds is the value used by core kubernetes controllers.
renewDeadline: "10s"
# retryPeriod is how long the leader election client waits between tries of
# actions; 2 seconds is the value used by core kuberntes controllers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# actions; 2 seconds is the value used by core kuberntes controllers.
# actions; 2 seconds is the value used by core kubernetes controllers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still seems to be a problem

logger.Fatalw("leaderelection lost")
},
},
// WatchDog: electionChecker,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this?

if err := eg.Wait(); err != nil && err != http.ErrServerClosed {
logger.Errorw("Error while running server", zap.Error(err))
}
panic("unreachable")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd use logger.Fatal in place of panic

eg, egCtx := errgroup.WithContext(ctx)
eg.Go(func() error {
return customMetricsAdapter.Run(ctx.Done())
leaderelection.RunOrDie(ctx, leaderelection.LeaderElectionConfig{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC we get all of this for free in the other binaries due to their use of sharedmain?

})
if err != nil {
logger.Fatalw("Failed to create admission controller", zap.Error(err))
logger.Fatalw("error creating lock: %v", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we leaking resources on uninstall?

@knative-prow-robot knative-prow-robot added the area/API API objects and controllers label Feb 6, 2020
@pmorie
Copy link
Contributor Author

pmorie commented Feb 24, 2020

/retest

@pmorie pmorie changed the title WIP: Add leader election Add leader election Feb 24, 2020
@knative-prow-robot knative-prow-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Feb 24, 2020
@pmorie
Copy link
Contributor Author

pmorie commented Feb 25, 2020

/retest

@pmorie pmorie closed this Feb 25, 2020
@pmorie pmorie reopened this Feb 25, 2020
Copy link
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lint

Copy link
Contributor

@knative-prow-robot knative-prow-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vagababov: 1 warning.

Details

In response to this:

/lint

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link
Contributor

@vagababov vagababov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 27, 2020
Copy link
Member

@mattmoor mattmoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve
/hold

Mostly nits. If we want to get this baking, feel free to /unhold and follow-up, or I can restamp if you want to knock these off now.

// config validation constructors
tracingconfig "knative.dev/pkg/tracing/config"
defaultconfig "knative.dev/serving/pkg/apis/config"
configvalidation "knative.dev/serving/pkg/apis/config/validation"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is sort of an odd place to put this. Can we make it knative.dev/serving/pkg/leaderelection?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: the defaulting config is here because defaulting happens as part of the pkg/apis stuff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I had something similar in eventing that already went in, mind if we do this in a follow-up?

@@ -0,0 +1,69 @@
# Copyright 2018 The Knative Authors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Copyright 2018 The Knative Authors
# Copyright 2020 The Knative Authors

resourceLock: "leases"
leaseDuration: "15s"
renewDeadline: "10s"
retryPeriod: "2s"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of the above should come from defaults, right? If so we generally let the defaulting provide them so that folks can safely tweak the knobs via kubectl edit and not upset the three-way merge.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, I will do that in a follow-up.

# giving up; 10 seconds is the value used by core kubernetes controllers.
renewDeadline: "10s"
# retryPeriod is how long the leader election client waits between tries of
# actions; 2 seconds is the value used by core kuberntes controllers.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This still seems to be a problem


// ValidateLeaderElectionConfig enriches the leader election config validation
// with extra validations specific to serving.
func ValidateLeaderElectionConfig(configMap *corev1.ConfigMap) (*kle.Config, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we go with pkg/leaderelection (comment above) then we should rename ValidateConfig to avoid stutter.

@knative-prow-robot knative-prow-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 27, 2020
@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mattmoor, pmorie

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow-robot knative-prow-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 27, 2020
@pmorie
Copy link
Contributor Author

pmorie commented Feb 27, 2020

/unhold

@knative-prow-robot knative-prow-robot removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. labels Feb 27, 2020
@pmorie
Copy link
Contributor Author

pmorie commented Feb 27, 2020

/retest

Looks like the failure in auto tls tests may be related to #7018

@knative-prow-robot
Copy link
Contributor

knative-prow-robot commented Feb 27, 2020

@pmorie: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
pull-knative-serving-integration-tests-go114 9799bb1 link /test pull-knative-serving-integration-tests-go114

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@knative-metrics-robot
Copy link

The following is the coverage report on the affected files.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/leaderelection/config.go Do not exist 85.7%

Copy link
Member

@mattmoor mattmoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@knative-prow-robot knative-prow-robot added the lgtm Indicates that a PR is ready to be merged. label Feb 27, 2020
@pmorie
Copy link
Contributor Author

pmorie commented Feb 27, 2020

/retest

@knative-prow-robot knative-prow-robot merged commit 37852e6 into knative:master Feb 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/API API objects and controllers area/autoscale cla: yes Indicates the PR's author has signed the CLA. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature Track] Active/passive HA

7 participants