Skip to content

Cleanly shutdown tap stream to data plane proxies#1624

Merged
dadjeibaah merged 8 commits intomasterfrom
dad/fix-the-tap
Sep 12, 2018
Merged

Cleanly shutdown tap stream to data plane proxies#1624
dadjeibaah merged 8 commits intomasterfrom
dad/fix-the-tap

Conversation

@dadjeibaah
Copy link
Contributor

@dadjeibaah dadjeibaah commented Sep 11, 2018

This work is part of a #1504. Sometimes, the tap server causes the controller pod to restart after it receives this error:

time="2018-08-22T05:57:00Z" level=error msg="rpc error: code = Canceled desc = context canceled"
panic: send on closed channel

goroutine 139 [running]:
github.com/linkerd/linkerd2/controller/tap.(*server).tapProxy(0xc420442c20, 0x14843a0, 0xc420966fc0, 0x3f800000, 0xc420981510, 0xc4205fe740, 0xa, 0xc420a0c180)
	/go/src/github.com/linkerd/linkerd2/controller/tap/server.go:278 +0x3b6
created by github.com/linkerd/linkerd2/controller/tap.(*server).TapByResource
	/go/src/github.com/linkerd/linkerd2/controller/tap/server.go:94 +0x50b

This error arises when the Tap server does not close gRPC tap streams to proxies before the tap server terminates its streams to its upstream clients and causes the controller pod to restart.

This PR uses the request context from the initial TapByReource to help shutdown tap streams to the data plane proxies gracefully.

fixes #1504

Signed-off-by: Dennis Adjei-Baah [email protected]

Dennis Adjei-Baah added 2 commits September 11, 2018 08:18
Signed-off-by: Dennis Adjei-Baah <[email protected]>
func (s *server) tapProxy(maxRps float32, match *proxy.ObserveRequest_Match, addr string, events chan<- *public.TapEvent, wait *sync.WaitGroup) {
tapAddr := fmt.Sprintf("%s:%d", addr, s.tapPort)
log.Infof("Establishing tap on %s", tapAddr)
ctx := context.Background()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without propagating the context from the client request, how do the proxy tap requests ever get cancelled?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my reading of the gRPC GoDocs DialContext usesctx the set up of the connection and the ctx struct doesn't manage the lifetime of the connection. IIUC, that means there is no need for the proxy tap requests to use the parent ctx provided by the TapByResource. The thought process is that each connection to a data plane tap should have its own context and when the connection is closed the context is canceled with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure what the difference is between Dial and DialContext but we could probably get away with just using Dial when creating a connection since it provides its own context.

events := make(chan *public.TapEvent)
var wg sync.WaitGroup

go func() { // Stop sending back events if the request is cancelled
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if a better solution would be to get rid of this goroutine altogether and instead add a switch when reading from the events channel so that we read from both the events channel and the context's Done channel and return when Done becomes closed. I think this has the intended behavior without needing to worry about closing the events channel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good point. We would need to wrap the select in a infinite loop so that we don't exit the function in the event where we don't receive any events and the stream context isn't marked as Done()

@dadjeibaah dadjeibaah self-assigned this Sep 11, 2018
Dennis Adjei-Baah and others added 4 commits September 11, 2018 11:29
Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes to tap server LGTM.

Are the websocket changes related?

@dadjeibaah
Copy link
Contributor Author

@adleong, from the debugging session I had with @klingerf, it looks like there aren't related. I have created the issue #1630 to track that issue.

Dennis Adjei-Baah added 2 commits September 12, 2018 10:07
Signed-off-by: Dennis Adjei-Baah <[email protected]>
Signed-off-by: Dennis Adjei-Baah <[email protected]>
Copy link
Contributor

@klingerf klingerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⭐️ This looks good to me. Thanks for making those updates!

@dadjeibaah dadjeibaah merged commit 00d0a26 into master Sep 12, 2018
@dadjeibaah dadjeibaah deleted the dad/fix-the-tap branch September 12, 2018 22:00
zachalbert added a commit to zachalbert/linkerd2 that referenced this pull request Sep 13, 2018
* master:
  Update CHANGES.md for v18.9.1 release (linkerd#1631)
  Cleanly shutdown tap stream to data plane proxies (linkerd#1624)
  Change breadcrumb header to default font in styles.css (linkerd#1633)
  Improve top table to better cope with high RPS traffic (linkerd#1634)
  Add small success rate chart to table, misc web tweaks (linkerd#1628)
  Consolidate the source and destination columns in the Tap and Top tables (linkerd#1620)
  remove extraneous calc function in sidebar.css (linkerd#1632)
  Display more helpful websocket errors (linkerd#1626)
  Add breadcrumb navigation at the top of linkerd dashboard (linkerd#1613)
  Introduce inject check for known sidecars (linkerd#1619)
  Bump Prometheus to v2.4.0, Grafana to 5.2.4 (linkerd#1625)
  Improve performance of tap table by throttling updates (linkerd#1623)
  Add with-source flag to top (linkerd#1614)

Conflicts:
	web/app/css/styles.css
	web/app/js/components/ResourceDetail.jsx
	web/app/js/index.js
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Tap from web sometimes causes tap server panic

3 participants