bugfix(port-forward): Correctly handle known errors #117493

sxllwx · 2023-04-20T03:01:23Z

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

For detailed fault location process, please refer to #74551 (comment)

containerd PR: containerd/containerd#8418

This PR is mainly to correctly handle EPIPE errors. So we rely on the error returned by h.forwarder.PortForward [undecorated or the underlay err can be read by errors.Is(err, syscall.EPIPE)].

Does this PR introduce a user-facing change?

Fixed(port-forward) correctly handle the TCP connection receiving RST. Prevent `kubectl port-forward` from receiving `broken pipe` error

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

k8s-ci-robot · 2023-04-20T03:01:25Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

pkg/kubelet/cri/streaming/portforward/httpstream.go

bart0sh · 2023-04-21T15:34:50Z

@sxllwx is is a user-visible issue? If it is, please provide a release note.

/triage accepted
/priority important-longterm
/assign

- structured error types added - use Close instead of Reset close data and error conn

k8s-ci-robot · 2024-05-06T07:07:01Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sxllwx
Once this PR has been reviewed and has the lgtm label, please ask for approval from liggitt. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

sxllwx · 2024-05-06T08:41:49Z

/test pull-kubernetes-unit

sxllwx · 2024-06-27T02:05:25Z

PTAL thx~. @liggitt

k8s-triage-robot · 2024-09-25T02:26:24Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

dhirschfeld · 2024-09-25T02:32:16Z

Ugh, a stale bot.

/remove-lifecycle stale

aojea · 2024-10-02T08:29:28Z

staging/src/k8s.io/kubelet/pkg/cri/streaming/portforward/httpstream.go

 	ctx := context.Background()
-	defer p.dataStream.Close()
-	defer p.errorStream.Close()
+	defer p.dataStream.Close()  //nolint: errcheck


I wonder if we have the same problem as in here and we should Reset() instead of Close()

#126718 (comment)

I see , discussed here #117493 (comment)

soltysh · 2024-10-15T11:31:27Z

A note for everyone who might be looking into this PR and trying it out with latest kind. kind is using internally containerd. Current version used there has the fix from containerd/containerd#8418 included, which means you won't get failures from e2e-s in this PR.

Newer versions of containerd, on the other hand switched to relying on k8s provided streaming implementation, this will be available in 1.8 and 2.0 versions. So to be able to get the error back, and thus verify the functionality of this PR, one needs to re-built the base image with newer containerd and use that when building a node-image from this PR (kind build node-image --base-image=...).

liggitt · 2024-10-15T15:41:08Z

synced offline with @aojea and @soltysh

I still don't think plumbing up specific error details from the kubelet / containerd and skipping alerting the client that there was an error is the correct approach.

As I mentioned in #117493 (comment), the loop in the client that currently unconditionally tears down the overall streamConn when an error is seen handling a single portforward request seems like the place we should be making changes.

git diff
diff --git a/staging/src/k8s.io/client-go/tools/portforward/portforward.go b/staging/src/k8s.io/client-go/tools/portforward/portforward.go
index 83ef3e929b3..365b7dd1603 100644
--- a/staging/src/k8s.io/client-go/tools/portforward/portforward.go
+++ b/staging/src/k8s.io/client-go/tools/portforward/portforward.go
@@ -407,12 +407,22 @@ func (pf *PortForwarder) handleConnection(conn net.Conn, port ForwardedPort) {
        case <-localError:
        }
 
+       // This is from https://github.com/kubernetes/kubernetes/pull/126718
+       /*
+               reset dataStream to discard any unsent data, preventing port forwarding from being blocked.
+               we must reset dataStream before waiting on errorChan, otherwise, the blocking data will affect errorStream and cause <-errorChan to block indefinitely.
+       */
+       _ = dataStream.Reset()
+
        // always expect something on errorChan (it may be nil)
        err = <-errorChan
        if err != nil {
                runtime.HandleError(err)
-               pf.streamConn.Close()
+               // Don't tear down the whole parent port-forward pf.streamConn when there's an error handling a single request
+               // TODO: *something* has to notice and tear down the whole parent port-forward pf.streamConn when the backend is completely gone... what / where?
        }
+       // This also forces drain of the errorStream, similar to dataStream above
+       _ = errorStream.Reset()
 }
 
 // Close stops all listeners of PortForwarder.

liggitt · 2024-10-15T15:41:32Z

@soltysh was going to dig into #117493 (comment) more as well

soltysh · 2024-10-21T16:49:00Z

To keep everyone updated, here are the findings so far:

Adding the resets as outlined by Jordan in #117493 (comment) only partially addresses the issue. It ensures we receive all the messages, but unfortunately neither our spdy.connection nor the upstream spdy.Connection offers us option to reset the connection without actually closing it, which seems the cleanest approach atm. Without closing the channel we will hang when trying to create a new error channel in the next iteration, and will eventually timeout after 30s.
Interestingly enough, when we create our connection we also register a ping which fires every 5 seconds, sadly it reports success even when we're stuck trying to create a new stream.

(EDIT): The one test where I primarily focused on is this test from this PR, which is sending big chunks of data continuously, but from my tests will get stuck after sending ~1-2 MB of data. Surprisingly, I can keep the port-forward open and re-try sending the data after the failures and it will work again to send a similar amount of data in subsequent requests.

soltysh · 2024-10-24T15:46:50Z

/hold
see #128318 and #128319 which includes changes from this PR, #126718 and additional fixes which should allow us to resolve the port-forward issue

k8s-ci-robot · 2024-10-26T02:21:50Z

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-triage-robot · 2025-02-02T15:41:09Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2025-02-02T15:41:15Z

@k8s-triage-robot: Closed this PR.

Details

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot requested review from mtaufen and sjenning April 20, 2023 03:01

k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 20, 2023

sxllwx force-pushed the fix/issue-74551 branch from 909d43c to 49c6406 Compare April 20, 2023 03:27

k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 20, 2023

sxllwx marked this pull request as ready for review April 20, 2023 08:52

k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 20, 2023

k8s-ci-robot requested a review from pacoxu April 20, 2023 08:52

sxllwx force-pushed the fix/issue-74551 branch from 49c6406 to 2398b06 Compare April 20, 2023 10:22

bart0sh reviewed Apr 21, 2023

View reviewed changes

pkg/kubelet/cri/streaming/portforward/httpstream.go Outdated Show resolved Hide resolved

k8s-ci-robot assigned bart0sh Apr 21, 2023

sxllwx force-pushed the fix/issue-74551 branch from 2398b06 to aa8d9b6 Compare April 24, 2023 02:34

k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Apr 24, 2023

sxllwx added 2 commits May 6, 2024 10:47

bugfix(port-forward): Correctly handle known errors

551ef3c

- structured error types added - use Close instead of Reset close data and error conn

use unified method to handle errors encountered during portforward

72b1371

k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 6, 2024

sxllwx force-pushed the fix/issue-74551 branch from 57e4e4d to 8354e18 Compare May 6, 2024 07:14

sxllwx added 2 commits May 6, 2024 15:42

fix client-side and add unit tests

54d28b6

Add comments to the portForwardErrResponse

a68cf3e

sxllwx force-pushed the fix/issue-74551 branch from 8354e18 to a68cf3e Compare May 6, 2024 07:42

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 25, 2024

k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 25, 2024

aojea reviewed Oct 2, 2024

View reviewed changes

soltysh mentioned this pull request Oct 15, 2024

fix: draining remote stream after port-forward connection broken #126718

Closed

This was referenced Oct 24, 2024

Reset streams when an error happens during port-forward (part 1/2) #128318

Merged

Reset streams when an error happens during port-forward (part 2/2) #128319

Closed

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 24, 2024

dims added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 3, 2025

k8s-ci-robot closed this Feb 2, 2025

bugfix(port-forward): Correctly handle known errors #117493

bugfix(port-forward): Correctly handle known errors #117493

Uh oh!

Conversation

sxllwx commented Apr 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Apr 20, 2023

Uh oh!

Uh oh!

bart0sh commented Apr 21, 2023

Uh oh!

k8s-ci-robot commented May 6, 2024

Uh oh!

sxllwx commented May 6, 2024

Uh oh!

sxllwx commented Jun 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-triage-robot commented Sep 25, 2024

Uh oh!

dhirschfeld commented Sep 25, 2024

Uh oh!

aojea Oct 2, 2024

Choose a reason for hiding this comment

Uh oh!

aojea Oct 2, 2024

Choose a reason for hiding this comment

Uh oh!

soltysh commented Oct 15, 2024

Uh oh!

liggitt commented Oct 15, 2024

Uh oh!

liggitt commented Oct 15, 2024

Uh oh!

soltysh commented Oct 21, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

soltysh commented Oct 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Oct 26, 2024

Uh oh!

k8s-triage-robot commented Feb 2, 2025

Uh oh!

k8s-ci-robot commented Feb 2, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

sxllwx commented Apr 20, 2023 •

edited

Loading

sxllwx commented Jun 27, 2024 •

edited

Loading

soltysh commented Oct 21, 2024 •

edited

Loading

soltysh commented Oct 24, 2024 •

edited

Loading