Skip to content

Conversation

@sxllwx
Copy link
Member

@sxllwx sxllwx commented Apr 20, 2023

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #74551 #107203

Special notes for your reviewer:

For detailed fault location process, please refer to #74551 (comment)

containerd PR: containerd/containerd#8418

This PR is mainly to correctly handle EPIPE errors. So we rely on the error returned by h.forwarder.PortForward [undecorated or the underlay err can be read by errors.Is(err, syscall.EPIPE)].

Does this PR introduce a user-facing change?

Fixed(port-forward) correctly handle the TCP connection receiving RST. Prevent `kubectl port-forward` from receiving `broken pipe` error

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot
Copy link
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/needs-kind Indicates a PR lacks a `kind/foo` label and requires one. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 20, 2023
@k8s-ci-robot k8s-ci-robot added area/kubelet sig/node Categorizes an issue or PR as relevant to SIG Node. and removed do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Apr 20, 2023
@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 20, 2023
@sxllwx sxllwx marked this pull request as ready for review April 20, 2023 08:52
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 20, 2023
@k8s-ci-robot k8s-ci-robot requested a review from pacoxu April 20, 2023 08:52
@bart0sh
Copy link
Contributor

bart0sh commented Apr 21, 2023

@sxllwx is is a user-visible issue? If it is, please provide a release note.

/triage accepted
/priority important-longterm
/assign

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. labels Apr 21, 2023
@k8s-ci-robot k8s-ci-robot added the release-note Denotes a PR that will be considered when it comes time to generate release notes. label Apr 24, 2023
sxllwx added 2 commits May 6, 2024 10:47
- structured error types added
- use Close instead of Reset close data and error conn
@k8s-ci-robot k8s-ci-robot added the sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. label May 6, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sxllwx
Once this PR has been reviewed and has the lgtm label, please ask for approval from liggitt. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sxllwx sxllwx force-pushed the fix/issue-74551 branch from 57e4e4d to 8354e18 Compare May 6, 2024 07:14
@sxllwx sxllwx force-pushed the fix/issue-74551 branch from 8354e18 to a68cf3e Compare May 6, 2024 07:42
@sxllwx
Copy link
Member Author

sxllwx commented May 6, 2024

/test pull-kubernetes-unit

@sxllwx
Copy link
Member Author

sxllwx commented Jun 27, 2024

PTAL thx~. @liggitt

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Mark this PR as fresh with /remove-lifecycle stale
  • Close this PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 25, 2024
@dhirschfeld
Copy link

Ugh, a stale bot.

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 25, 2024
ctx := context.Background()
defer p.dataStream.Close()
defer p.errorStream.Close()
defer p.dataStream.Close() //nolint: errcheck
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we have the same problem as in here and we should Reset() instead of Close()

#126718 (comment)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see , discussed here #117493 (comment)

@soltysh
Copy link
Contributor

soltysh commented Oct 15, 2024

A note for everyone who might be looking into this PR and trying it out with latest kind. kind is using internally containerd. Current version used there has the fix from containerd/containerd#8418 included, which means you won't get failures from e2e-s in this PR.

Newer versions of containerd, on the other hand switched to relying on k8s provided streaming implementation, this will be available in 1.8 and 2.0 versions. So to be able to get the error back, and thus verify the functionality of this PR, one needs to re-built the base image with newer containerd and use that when building a node-image from this PR (kind build node-image --base-image=...).

@liggitt
Copy link
Member

liggitt commented Oct 15, 2024

synced offline with @aojea and @soltysh

I still don't think plumbing up specific error details from the kubelet / containerd and skipping alerting the client that there was an error is the correct approach.

As I mentioned in #117493 (comment), the loop in the client that currently unconditionally tears down the overall streamConn when an error is seen handling a single portforward request seems like the place we should be making changes.

git diff
diff --git a/staging/src/k8s.io/client-go/tools/portforward/portforward.go b/staging/src/k8s.io/client-go/tools/portforward/portforward.go
index 83ef3e929b3..365b7dd1603 100644
--- a/staging/src/k8s.io/client-go/tools/portforward/portforward.go
+++ b/staging/src/k8s.io/client-go/tools/portforward/portforward.go
@@ -407,12 +407,22 @@ func (pf *PortForwarder) handleConnection(conn net.Conn, port ForwardedPort) {
        case <-localError:
        }
 
+       // This is from https://github.com/kubernetes/kubernetes/pull/126718
+       /*
+               reset dataStream to discard any unsent data, preventing port forwarding from being blocked.
+               we must reset dataStream before waiting on errorChan, otherwise, the blocking data will affect errorStream and cause <-errorChan to block indefinitely.
+       */
+       _ = dataStream.Reset()
+
        // always expect something on errorChan (it may be nil)
        err = <-errorChan
        if err != nil {
                runtime.HandleError(err)
-               pf.streamConn.Close()
+               // Don't tear down the whole parent port-forward pf.streamConn when there's an error handling a single request
+               // TODO: *something* has to notice and tear down the whole parent port-forward pf.streamConn when the backend is completely gone... what / where?
        }
+       // This also forces drain of the errorStream, similar to dataStream above
+       _ = errorStream.Reset()
 }
 
 // Close stops all listeners of PortForwarder.

@liggitt
Copy link
Member

liggitt commented Oct 15, 2024

@soltysh was going to dig into #117493 (comment) more as well

@soltysh
Copy link
Contributor

soltysh commented Oct 21, 2024

To keep everyone updated, here are the findings so far:

Adding the resets as outlined by Jordan in #117493 (comment) only partially addresses the issue. It ensures we receive all the messages, but unfortunately neither our spdy.connection nor the upstream spdy.Connection offers us option to reset the connection without actually closing it, which seems the cleanest approach atm. Without closing the channel we will hang when trying to create a new error channel in the next iteration, and will eventually timeout after 30s.
Interestingly enough, when we create our connection we also register a ping which fires every 5 seconds, sadly it reports success even when we're stuck trying to create a new stream.

(EDIT): The one test where I primarily focused on is this test from this PR, which is sending big chunks of data continuously, but from my tests will get stuck after sending ~1-2 MB of data. Surprisingly, I can keep the port-forward open and re-try sending the data after the failures and it will work again to send a similar amount of data in subsequent requests.

@soltysh
Copy link
Contributor

soltysh commented Oct 24, 2024

/hold
see #128318 and #128319 which includes changes from this PR, #126718 and additional fixes which should allow us to resolve the port-forward issue

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. labels Oct 24, 2024
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@dims dims added the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jan 3, 2025
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closed this PR.

Details

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

  • Reopen this PR with /reopen
  • Mark this PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/kubelet area/test cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. release-note Denotes a PR that will be considered when it comes time to generate release notes. sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Projects

Archived in project
Archived in project

Development

Successfully merging this pull request may close these issues.

kubectl port-forward broken pipe