Cleanly shutdown tap stream to data plane proxies#1624
Conversation
Signed-off-by: Dennis Adjei-Baah <[email protected]>
Signed-off-by: Dennis Adjei-Baah <[email protected]>
controller/tap/server.go
Outdated
| func (s *server) tapProxy(maxRps float32, match *proxy.ObserveRequest_Match, addr string, events chan<- *public.TapEvent, wait *sync.WaitGroup) { | ||
| tapAddr := fmt.Sprintf("%s:%d", addr, s.tapPort) | ||
| log.Infof("Establishing tap on %s", tapAddr) | ||
| ctx := context.Background() |
There was a problem hiding this comment.
Without propagating the context from the client request, how do the proxy tap requests ever get cancelled?
There was a problem hiding this comment.
From my reading of the gRPC GoDocs DialContext usesctx the set up of the connection and the ctx struct doesn't manage the lifetime of the connection. IIUC, that means there is no need for the proxy tap requests to use the parent ctx provided by the TapByResource. The thought process is that each connection to a data plane tap should have its own context and when the connection is closed the context is canceled with it.
There was a problem hiding this comment.
I am not sure what the difference is between Dial and DialContext but we could probably get away with just using Dial when creating a connection since it provides its own context.
controller/tap/server.go
Outdated
| events := make(chan *public.TapEvent) | ||
| var wg sync.WaitGroup | ||
|
|
||
| go func() { // Stop sending back events if the request is cancelled |
There was a problem hiding this comment.
I'm wondering if a better solution would be to get rid of this goroutine altogether and instead add a switch when reading from the events channel so that we read from both the events channel and the context's Done channel and return when Done becomes closed. I think this has the intended behavior without needing to worry about closing the events channel.
There was a problem hiding this comment.
That is a good point. We would need to wrap the select in a infinite loop so that we don't exit the function in the event where we don't receive any events and the stream context isn't marked as Done()
Signed-off-by: Dennis Adjei-Baah <[email protected]>
Signed-off-by: Kevin Lingerfelt <[email protected]>
# Conflicts: # controller/tap/server.go
Signed-off-by: Dennis Adjei-Baah <[email protected]>
adleong
left a comment
There was a problem hiding this comment.
Changes to tap server LGTM.
Are the websocket changes related?
Signed-off-by: Dennis Adjei-Baah <[email protected]>
Signed-off-by: Dennis Adjei-Baah <[email protected]>
klingerf
left a comment
There was a problem hiding this comment.
⭐️ This looks good to me. Thanks for making those updates!
* master: Update CHANGES.md for v18.9.1 release (linkerd#1631) Cleanly shutdown tap stream to data plane proxies (linkerd#1624) Change breadcrumb header to default font in styles.css (linkerd#1633) Improve top table to better cope with high RPS traffic (linkerd#1634) Add small success rate chart to table, misc web tweaks (linkerd#1628) Consolidate the source and destination columns in the Tap and Top tables (linkerd#1620) remove extraneous calc function in sidebar.css (linkerd#1632) Display more helpful websocket errors (linkerd#1626) Add breadcrumb navigation at the top of linkerd dashboard (linkerd#1613) Introduce inject check for known sidecars (linkerd#1619) Bump Prometheus to v2.4.0, Grafana to 5.2.4 (linkerd#1625) Improve performance of tap table by throttling updates (linkerd#1623) Add with-source flag to top (linkerd#1614) Conflicts: web/app/css/styles.css web/app/js/components/ResourceDetail.jsx web/app/js/index.js
This work is part of a #1504. Sometimes, the tap server causes the controller pod to restart after it receives this error:
This error arises when the Tap server does not close gRPC tap streams to proxies before the tap server terminates its streams to its upstream clients and causes the controller pod to restart.
This PR uses the request context from the initial
TapByReourceto help shutdown tap streams to the data plane proxies gracefully.fixes #1504
Signed-off-by: Dennis Adjei-Baah [email protected]