Fix error handling for Tap CLI and public<>tap server#177
Conversation
22e042f to
6b5b5f2
Compare
|
FWIW, tests arent passing on travis and I am debugging it so this isnt super ready for review yet :( |
05c897c to
7ff2fa8
Compare
|
Alright, adding |
7ff2fa8 to
665f034
Compare
siggy
left a comment
There was a problem hiding this comment.
These are all great changes (especially the tests!). I've left a few comments.
For next time, I think this could have been broken up into smaller PRs, particularly the code reorgs, some examples:
- renaming
server.gotohttp_server.go - reorg in
client.go - splitting
proto_over_http.goout ofclient.go - reimplementing
serverMarshal()aswriteProtoToHttpResponse()
controller/api/public/client.go
Outdated
| return err | ||
| } | ||
|
|
||
| defer httpRsp.Body.Close() |
There was a problem hiding this comment.
should this defer close go immediately after c.post(... ?
There was a problem hiding this comment.
I'm not sure. there's a lot of error handling before we get the Body, I would assume that we should "group" the lines that handle that object. Maybe this is a code smell indicating that this should be extracted in a func?
There was a problem hiding this comment.
my concern was that if we return prior to this, we'll resource leak. reading this code a bit more i'm realizing checkIfResponseHasConduitError also calls close. maybe the safest thing to do is to httpRsp.Body.Close() immediately after the c.post(... error check, and then checkIfResponseHasConduitError() need not be concerned with closing?
There was a problem hiding this comment.
👍 to @siggy's comment. We should defer the the .Close call immediately after checking the error that's returned by the call to c.post, and then remove all of the other .Close calls.
There was a problem hiding this comment.
I think I know what makes me feel funny about this: the post call "owns" the Response, but not the Body. The Body is only used in thos few lines at the bottom. I think that a good tool to think about this is that if we were to extract those few lines into a convertBodyIntoProto or something the defer would be extracted as well.
Nevertheless I thing the suggestions are good enough and just went ahead with them 👍
controller/api/public/client.go
Outdated
| if err != nil { | ||
| log.Debugf("Error invoking [%s]: %v", url.String(), err) | ||
| } else { | ||
| log.Debugf("Response from [%s] had hesaders: %v", url.String(), rsp.Header) |
controller/api/public/client.go
Outdated
|
|
||
| serverUrl := apiURL.ResolveReference(&url.URL{Path: ApiPrefix}) | ||
|
|
||
| log.Debugf("Expecting Conduit Public API to be server over [%s]", serverUrl) |
| rsp, err := s.tapClient.Tap(tapStream.Context(), req) | ||
| tapClient, err := s.tapClient.Tap(tapStream.Context(), req) | ||
| if err != nil { | ||
| //TODO: why not return the error? |
3630821 to
5651fe2
Compare
|
@siggy you are right, sorry about the PR size. I think that I need to build better intuition for this. Both coming from a trunk-based development background and being a little worried about review turnaround time bias me towards larger and more end-to-end pieces, but I'll change that 💪 |
siggy
left a comment
There was a problem hiding this comment.
one comment about body.close and then lgtm.
understood re: code review styles. in case this hasn't been shared, an old friend wrote this, one of my favorite posts:
https://medium.com/@9len/on-code-review-16ea85f7c585
controller/api/public/client.go
Outdated
| return err | ||
| } | ||
|
|
||
| defer httpRsp.Body.Close() |
There was a problem hiding this comment.
my concern was that if we return prior to this, we'll resource leak. reading this code a bit more i'm realizing checkIfResponseHasConduitError also calls close. maybe the safest thing to do is to httpRsp.Body.Close() immediately after the c.post(... error check, and then checkIfResponseHasConduitError() need not be concerned with closing?
|
@klingerf do you still want to review this? |
See #163 Signed-off-by: Phil Calcado <[email protected]>
Signed-off-by: Phil Calcado <[email protected]>
Signed-off-by: Phil Calcado <[email protected]>
Signed-off-by: Phil Calcado <[email protected]>
Signed-off-by: Phil Calcado <[email protected]>
Signed-off-by: Phil Calcado <[email protected]>
Signed-off-by: Phil Calcado <[email protected]>
Signed-off-by: Phil Calcado <[email protected]>
Signed-off-by: Phil Calcado <[email protected]>
Signed-off-by: Phil Calcado <[email protected]>
Signed-off-by: Phil Calcado <[email protected]>
Signed-off-by: Phil Calcado <[email protected]>
Signed-off-by: Phil Calcado <[email protected]>
70a1920 to
0941d6b
Compare
| } | ||
|
|
||
| if totalBytesRead != messageLength { | ||
| return nil, fmt.Errorf("message declared length [%d] but could only read [%d] bytes") |
There was a problem hiding this comment.
Just to wrap my head around this error case, What scenarios would cause the code to reach this line when deserializing the HTTP? Proto? body.
0951b34 to
5e6684f
Compare
if totalBytesRead != messageLength {
return nil, fmt.Errorf("message declared length [%d] but could only read [%d] bytes")
}
The protocol we use here is custom, and it relies on declaring the message length using the first four bytes of the HTTP body. If everything goes well, the server code probably uses Unfortunately the world is a sad place. It could be that:
EDIT-- |
Signed-off-by: Phil Calcado <[email protected]>
Signed-off-by: Phil Calcado <[email protected]>
5e6684f to
74d85a2
Compare
klingerf
left a comment
There was a problem hiding this comment.
⭐️ This looks great! Thanks for addressing my review feedback and adding those additional tests.
| } | ||
| }) | ||
|
|
||
| t.Run("Can multiple messages in the same stream", func(t *testing.T) { |
There was a problem hiding this comment.
Hmm, maybe:
Can read multiple messages in the same stream
Signed-off-by: Phil Calcado <[email protected]>
74d85a2 to
0d634ed
Compare
proxy: update pinned version to 5b507a9 This picks up the following proxy commits: * eaabc48 Update tower-grpc * e9561de Update h2 to 0.1.16 * 28fd5e7 Add Route timeouts (linkerd/linkerd2-proxy#165) * 5637372 Re-flag tcp_duration tests as flaky * 20cbd18 Revise several log levels and messages (linkerd/linkerd2-proxy##177) * ae16978 Remove flakiness from 'profiles' tests * 49c29cd canonicalize: Only log errors at the WARN level when falling back (linkerd/linkerd2-proxy#174) * 486dd13 Make outbound router honor `l5d-dst-override` header (linkerd/linkerd2-proxy#173) * 7adc50d Make timeouts for canonicalization DNS queries tuneable (linkerd/linkerd2-proxy#175) * 3188179 Try reducing CI flakiness by reducing RUST_TEST_THREADS to 1 Some of these changes will probably need changelog entries: * Improve logging when rejecting malformed HTTP/2 pseudo-headers (hyperium/h2#347) * Improve logging for gRPC errors (tower-rs/tower-grpc#111) * Add Route timeouts (linkerd/linkerd2-proxy#165) * Downgrade several of the noisiest log messages to TRACE (linkerd/linkerd2-proxy##177) * Add an environment variable for configuring the DNS canonicalization timeout (linkerd/linkerd2-proxy#175) * Make outbound router honor `l5d-dst-override` header (linkerd/linkerd2-proxy#173) Perhaps all the logging related changes can be grouped into one changelog entry, though... Signed-off-by: Eliza Weisman <[email protected]>
This picks up the following proxy commits: * eaabc48 Update tower-grpc * e9561de Update h2 to 0.1.16 * 28fd5e7 Add Route timeouts (linkerd/linkerd2-proxy#165) * 5637372 Re-flag tcp_duration tests as flaky * 20cbd18 Revise several log levels and messages (linkerd/linkerd2-proxy##177) * ae16978 Remove flakiness from 'profiles' tests * 49c29cd canonicalize: Only log errors at the WARN level when falling back (linkerd/linkerd2-proxy#174) * 486dd13 Make outbound router honor `l5d-dst-override` header (linkerd/linkerd2-proxy#173) * 7adc50d Make timeouts for canonicalization DNS queries tuneable (linkerd/linkerd2-proxy#175) * 3188179 Try reducing CI flakiness by reducing RUST_TEST_THREADS to 1 Some of these changes will probably need changelog entries: * Improve logging when rejecting malformed HTTP/2 pseudo-headers (hyperium/h2#347) * Improve logging for gRPC errors (tower-rs/tower-grpc#111) * Add Route timeouts (linkerd/linkerd2-proxy#165) * Downgrade several of the noisiest log messages to TRACE (linkerd/linkerd2-proxy##177) * Add an environment variable for configuring the DNS canonicalization timeout (linkerd/linkerd2-proxy#175) * Make outbound router honor `l5d-dst-override` header (linkerd/linkerd2-proxy#173) Perhaps all the logging related changes can be grouped into one changelog entry, though... Signed-off-by: Eliza Weisman <[email protected]>
This started as an investigation of why
$ conduit tapwould abort withunexpected EOFat times.To debug, I've read all the protobuf-to-http code and, while at that I've cleaned up it a bit and wrote tests for a lot of it.
I finally identified that this problem was due to us not checking when the gRPC
Tap()returned an error (in this case because the pods didnt exist). Commit c27f400 adds this check.The EOF goes away in the client, but the error message still wasn't great:
So I did some digging and realized that we haven't been using gRPC errors properly. Commit a3954e6 adds that to the tap server, and after it here is the message displayed: