Skip to content

Conversation

@linkvt
Copy link
Contributor

@linkvt linkvt commented Oct 22, 2025

Hi,

this PR fixes a chicken-and-egg bootstrap issue where the operator would get stuck during KnativeServing installation.

Problem:

  • ValidatingWebhookConfiguration with failurePolicy=Fail intercepts KCertificate resource creation
  • If KCertificate resources are created before the webhook pod is ready, the API server rejects them
  • The activator deployment depends on the routing-serving-certs secret (generated from a KCertificate resource which creates a cert-manager Certificate) at runtime
  • Previous stage ordering would check all deployments (including activator) before creating KCertificate (and thus cert-manager Certificate) resources, causing a deadlock

Solution:

  1. Added CheckWebhookDeployment() that waits specifically for the webhook deployment to be ready before proceeding
  2. Reordered reconciliation stages:
    • manifests.Install (creates all deployments and more) - unchanged
    • CheckWebhookDeployment (waits for webhook to be ready) - new
    • InstallWebhookDependentResources (creates Certificate resources) - unchanged
    • CheckDeployments (checks all deployments including activator) - unchanged

This ensures:

  • Webhook is ready before Certificate creation (avoids admission rejection)
  • Certificate resources exist before checking activator
    • activator might restart very few times before CM-Certificate was created by webhook and Secret with certificate data was created by cert-manager

Related functions:

  • pkg/reconciler/common/deployments.go: Added CheckWebhookDeployment()
  • pkg/reconciler/knativeserving/knativeserving.go: Reordered stages
  • pkg/reconciler/common/install.go: Added logging for consistency

Tested locally in my kind cluster with a KnativeServing with and without system-internal-tls enabled.

Fixes #2178

Release Note

Fix deadlock during creation of KnativeServing with system-internal-tls enabled

/kind bug

@knative-prow knative-prow bot added the kind/bug Categorizes issue or PR as related to a bug. label Oct 22, 2025
@knative-prow knative-prow bot requested review from houshengbo and matzew October 22, 2025 22:11
@knative-prow knative-prow bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Oct 22, 2025
@codecov
Copy link

codecov bot commented Oct 22, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 63.31%. Comparing base (6dfe1de) to head (4ff3e18).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2179   +/-   ##
=======================================
  Coverage   63.31%   63.31%           
=======================================
  Files          49       49           
  Lines        1878     1878           
=======================================
  Hits         1189     1189           
  Misses        597      597           
  Partials       92       92           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@linkvt linkvt force-pushed the fix-webhook-certificate-deadlock branch 2 times, most recently from 0f0389a to f6b3c9c Compare October 22, 2025 22:33
@linkvt
Copy link
Contributor Author

linkvt commented Oct 22, 2025

/ok-to-test

@knative-prow knative-prow bot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label Oct 22, 2025
@linkvt
Copy link
Contributor Author

linkvt commented Oct 23, 2025

/cc @Fedosin

@knative-prow knative-prow bot requested a review from Fedosin October 23, 2025 12:13
@linkvt
Copy link
Contributor Author

linkvt commented Oct 23, 2025

/remove-approve

Was added automatically as I'm co-release lead.

@knative-prow knative-prow bot removed the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 23, 2025
@Fedosin
Copy link

Fedosin commented Oct 23, 2025

/lgtm

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Oct 23, 2025
This fixes a chicken-and-egg bootstrap issue where the operator would
get stuck during KnativeServing installation.

Problem:
- ValidatingWebhookConfiguration with failurePolicy=Fail intercepts
  Certificate resource creation
- If Certificate resources are created before the webhook pod is ready,
  the API server rejects them
- The activator deployment depends on the routing-serving-certs secret
  (generated from a Certificate resource) at runtime
- Previous stage ordering would check all deployments (including activator)
  before creating Certificate resources, causing a deadlock

Solution:
1. Added CheckWebhookDeployment() function that waits specifically for
   the webhook deployment to be ready before proceeding
2. Reordered reconciliation stages:
   - manifests.Install (creates all deployments)
   - CheckWebhookDeployment (waits for webhook to be ready)
   - InstallWebhookDependentResources (creates Certificate resources)
   - CheckDeployments (checks all deployments including activator)

This ensures:
- Webhook is ready before Certificate creation (avoids admission rejection)
- Certificate resources exist before checking activator (avoids missing secret)
- Clear error message if webhook deployment is missing from manifest

Related functions:
- pkg/reconciler/common/deployments.go: Added CheckWebhookDeployment()
- pkg/reconciler/knativeserving/knativeserving.go: Reordered stages
- pkg/reconciler/common/install.go: Added logging for consistency
@linkvt linkvt force-pushed the fix-webhook-certificate-deadlock branch from f6b3c9c to 4ff3e18 Compare October 27, 2025 07:54
@knative-prow knative-prow bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 27, 2025
@linkvt
Copy link
Contributor Author

linkvt commented Oct 29, 2025

PTAL @houshengbo

We're currently preparing release 1.20 so it might be good to get this into this release if you're interested.

manifests.SetManifestPaths, // setting path right after applying manifests to populate paths
common.CheckDeployments,
manifests.SetManifestPaths, // setting path right after applying manifests to populate paths
common.CheckWebhookDeployment, // Wait for webhook to be ready before creating Certificate resources
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

knativeeventing.go also need this function, please add it there as well.
The name of the deployment resource for eventing is "eventing-webhook", you probably need to make the name as a param to the function "CheckWebhookDeployment".

Copy link
Contributor Author

@linkvt linkvt Oct 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about eventing needing this as well?

Serving needed it as there are two install steps where the first one blocked the second one (see line 126 and 129):

manifests.Install,
manifests.SetManifestPaths, // setting path right after applying manifests to populate paths
common.CheckDeployments,
common.InstallWebhookDependentResources,
common.MarkStatusSuccess,
common.DeleteObsoleteResources(ctx, ks, r.installed),
}

Eventing has only one install step which means nothing is broken (see line 143):

manifests.Install,
manifests.SetManifestPaths, // setting path right after applying manifests to populate paths
common.CheckDeployments,
common.MarkStatusSuccess,
common.DeleteObsoleteResources(ctx, ke, r.installed),
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just deployed an KnativeEventing resource with the operator and a the feature transport-encryption: Permissive causing the Certificates to be created without any issues, I guess it works without further changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the explanation!

@houshengbo
Copy link
Contributor

/lgtm
/approve

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Oct 31, 2025
@knative-prow
Copy link

knative-prow bot commented Oct 31, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: houshengbo

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 31, 2025
@linkvt
Copy link
Contributor Author

linkvt commented Oct 31, 2025

/retest

@knative-prow knative-prow bot merged commit 62f800b into knative:main Oct 31, 2025
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. kind/bug Categorizes issue or PR as related to a bug. lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

System TLS Bootsrap: system-internal-tls bootstrap fails with operator: activator crashes before routing-serving-certs secret exists

3 participants