Skip to content

Fix error handling for Tap CLI and public<>tap server#177

Merged
pcalcado merged 16 commits intomasterfrom
phil/tap-leak
Jan 25, 2018
Merged

Fix error handling for Tap CLI and public<>tap server#177
pcalcado merged 16 commits intomasterfrom
phil/tap-leak

Conversation

@pcalcado
Copy link
Contributor

@pcalcado pcalcado commented Jan 19, 2018

This started as an investigation of why $ conduit tap would abort with unexpected EOF at times.

To debug, I've read all the protobuf-to-http code and, while at that I've cleaned up it a bit and wrote tests for a lot of it.

I finally identified that this problem was due to us not checking when the gRPC Tap() returned an error (in this case because the pods didnt exist). Commit c27f400 adds this check.

The EOF goes away in the client, but the error message still wasn't great:

[email protected]:~/code/src/github.com/runconduit/conduit  $ conduit tap pod default/pod-with-no-rules  --api-addr localhost:8085
rpc error: code = Unknown desc = no pod exists for key default/pod-with-no-rules%

So I did some digging and realized that we haven't been using gRPC errors properly. Commit a3954e6 adds that to the tap server, and after it here is the message displayed:

[email protected]:~/code/src/github.com/runconduit/conduit  $ conduit tap pod default/pod-with-no-rules  --api-addr localhost:8085
no pod exists for key default/pod-with-no-rules%

@pcalcado pcalcado added the review/ready Issue has a reviewable PR label Jan 19, 2018
@pcalcado pcalcado self-assigned this Jan 19, 2018
@pcalcado pcalcado force-pushed the phil/tap-leak branch 8 times, most recently from 22e042f to 6b5b5f2 Compare January 19, 2018 20:53
@pcalcado
Copy link
Contributor Author

FWIW, tests arent passing on travis and I am debugging it so this isnt super ready for review yet :(

@pcalcado pcalcado removed the review/ready Issue has a reviewable PR label Jan 19, 2018
@pcalcado pcalcado force-pushed the phil/tap-leak branch 2 times, most recently from 05c897c to 7ff2fa8 Compare January 19, 2018 22:48
@pcalcado pcalcado added review/ready Issue has a reviewable PR reviewable labels Jan 19, 2018
@pcalcado
Copy link
Contributor Author

Alright, adding review & reviewable to this, it should be good to go!

Copy link
Member

@siggy siggy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are all great changes (especially the tests!). I've left a few comments.

For next time, I think this could have been broken up into smaller PRs, particularly the code reorgs, some examples:

  • renaming server.go to http_server.go
  • reorg in client.go
  • splitting proto_over_http.go out of client.go
  • reimplementing serverMarshal() as writeProtoToHttpResponse()

return err
}

defer httpRsp.Body.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this defer close go immediately after c.post(... ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure. there's a lot of error handling before we get the Body, I would assume that we should "group" the lines that handle that object. Maybe this is a code smell indicating that this should be extracted in a func?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my concern was that if we return prior to this, we'll resource leak. reading this code a bit more i'm realizing checkIfResponseHasConduitError also calls close. maybe the safest thing to do is to httpRsp.Body.Close() immediately after the c.post(... error check, and then checkIfResponseHasConduitError() need not be concerned with closing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 to @siggy's comment. We should defer the the .Close call immediately after checking the error that's returned by the call to c.post, and then remove all of the other .Close calls.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I know what makes me feel funny about this: the post call "owns" the Response, but not the Body. The Body is only used in thos few lines at the bottom. I think that a good tool to think about this is that if we were to extract those few lines into a convertBodyIntoProto or something the defer would be extracted as well.

Nevertheless I thing the suggestions are good enough and just went ahead with them 👍

if err != nil {
log.Debugf("Error invoking [%s]: %v", url.String(), err)
} else {
log.Debugf("Response from [%s] had hesaders: %v", url.String(), rsp.Header)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hesaders


serverUrl := apiURL.ResolveReference(&url.URL{Path: ApiPrefix})

log.Debugf("Expecting Conduit Public API to be server over [%s]", serverUrl)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

server

rsp, err := s.tapClient.Tap(tapStream.Context(), req)
tapClient, err := s.tapClient.Tap(tapStream.Context(), req)
if err != nil {
//TODO: why not return the error?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe log something here?

@pcalcado pcalcado force-pushed the phil/tap-leak branch 2 times, most recently from 3630821 to 5651fe2 Compare January 23, 2018 16:12
@pcalcado
Copy link
Contributor Author

@siggy you are right, sorry about the PR size.

I think that I need to build better intuition for this. Both coming from a trunk-based development background and being a little worried about review turnaround time bias me towards larger and more end-to-end pieces, but I'll change that 💪

Copy link
Member

@siggy siggy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one comment about body.close and then lgtm.

understood re: code review styles. in case this hasn't been shared, an old friend wrote this, one of my favorite posts:
https://medium.com/@9len/on-code-review-16ea85f7c585

return err
}

defer httpRsp.Body.Close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my concern was that if we return prior to this, we'll resource leak. reading this code a bit more i'm realizing checkIfResponseHasConduitError also calls close. maybe the safest thing to do is to httpRsp.Body.Close() immediately after the c.post(... error check, and then checkIfResponseHasConduitError() need not be concerned with closing?

@pcalcado
Copy link
Contributor Author

@klingerf do you still want to review this?

}

if totalBytesRead != messageLength {
return nil, fmt.Errorf("message declared length [%d] but could only read [%d] bytes")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to wrap my head around this error case, What scenarios would cause the code to reach this line when deserializing the HTTP? Proto? body.

@pcalcado
Copy link
Contributor Author

pcalcado commented Jan 24, 2018

@deebo91 said:

if totalBytesRead != messageLength {
 return nil, fmt.Errorf("message declared length [%d] but could only read [%d] bytes")
}

Just to wrap my head around this error case, What scenarios would cause the code to reach this line when deserializing the HTTP? Proto? body.

The protocol we use here is custom, and it relies on declaring the message length using the first four bytes of the HTTP body. If everything goes well, the server code probably uses serializeAsPayload to encode the message and everybody will be happy.

Unfortunately the world is a sad place. It could be that:

  • we accidentally introduce a code path on the server that doesn't call serializeAsPayload but rather use a copy adn paste'd version that isn't broken
  • we might want to change the protocol in an upcoming version (say using more or fewer bytes than 4) and we might have a situation where the user has a CLI version that is a bit older and doesn't know about the protocol change
  • When writing distributed systems, even if you have absolute control of the from and to components you still have to deal with the unreliable network. There are many pieces of middleware between the CLI and the actual server, from the k8s API to k8s itself to maybe even a web accelerator proxy thing or some corporate spyware that the user might have installed in their network. These things, especially some weird proprietary "security" tools corporations sometimes use, can do all sort of transformations with they payload. It might be the case that you hit a defective one and they strip out some bytes from your message. If we don't check this here, we will have a very confusing error when this gets unmarshalled. Instead, we should detect the error as soon as possible and inform this function's caller that something isn't cool.

EDIT--
OTOH, @peczenyj just reminded me that ReadFull returns an error if fewer bytes were read, so this is useless 😎

Phil Calcado added 2 commits January 24, 2018 17:23
Copy link
Contributor

@klingerf klingerf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⭐️ This looks great! Thanks for addressing my review feedback and adding those additional tests.

}
})

t.Run("Can multiple messages in the same stream", func(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe:

Can read multiple messages in the same stream

@pcalcado pcalcado merged commit 9410da4 into master Jan 25, 2018
@pcalcado pcalcado removed the review/ready Issue has a reviewable PR label Jan 25, 2018
@pcalcado pcalcado deleted the phil/tap-leak branch January 26, 2018 18:30
hawkw added a commit that referenced this pull request Jan 24, 2019
proxy: update pinned version to 5b507a9

This picks up the following proxy commits:

* eaabc48 Update tower-grpc
* e9561de Update h2 to 0.1.16
* 28fd5e7 Add Route timeouts (linkerd/linkerd2-proxy#165)
* 5637372 Re-flag tcp_duration tests as flaky
* 20cbd18 Revise several log levels and messages (linkerd/linkerd2-proxy##177)
* ae16978 Remove flakiness from 'profiles' tests
* 49c29cd canonicalize: Only log errors at the WARN level when falling back (linkerd/linkerd2-proxy#174)
* 486dd13 Make outbound router honor `l5d-dst-override` header (linkerd/linkerd2-proxy#173)
* 7adc50d Make timeouts for canonicalization DNS queries tuneable (linkerd/linkerd2-proxy#175)
* 3188179 Try reducing CI flakiness by reducing RUST_TEST_THREADS to 1

Some of these changes will probably need changelog entries:

* Improve logging when rejecting malformed HTTP/2 pseudo-headers
  (hyperium/h2#347)
* Improve logging for gRPC errors (tower-rs/tower-grpc#111)
* Add Route timeouts (linkerd/linkerd2-proxy#165)
* Downgrade several of the noisiest log messages to TRACE
  (linkerd/linkerd2-proxy##177)
* Add an environment variable for configuring the DNS canonicalization
  timeout (linkerd/linkerd2-proxy#175)
* Make outbound router honor `l5d-dst-override` header
  (linkerd/linkerd2-proxy#173)

Perhaps all the logging related changes can be grouped into one
changelog entry, though...

Signed-off-by: Eliza Weisman <[email protected]>
hawkw added a commit that referenced this pull request Jan 24, 2019
This picks up the following proxy commits:

* eaabc48 Update tower-grpc
* e9561de Update h2 to 0.1.16
* 28fd5e7 Add Route timeouts (linkerd/linkerd2-proxy#165)
* 5637372 Re-flag tcp_duration tests as flaky
* 20cbd18 Revise several log levels and messages (linkerd/linkerd2-proxy##177)
* ae16978 Remove flakiness from 'profiles' tests
* 49c29cd canonicalize: Only log errors at the WARN level when falling back (linkerd/linkerd2-proxy#174)
* 486dd13 Make outbound router honor `l5d-dst-override` header (linkerd/linkerd2-proxy#173)
* 7adc50d Make timeouts for canonicalization DNS queries tuneable (linkerd/linkerd2-proxy#175)
* 3188179 Try reducing CI flakiness by reducing RUST_TEST_THREADS to 1

Some of these changes will probably need changelog entries:

* Improve logging when rejecting malformed HTTP/2 pseudo-headers
  (hyperium/h2#347)
* Improve logging for gRPC errors (tower-rs/tower-grpc#111)
* Add Route timeouts (linkerd/linkerd2-proxy#165)
* Downgrade several of the noisiest log messages to TRACE
  (linkerd/linkerd2-proxy##177)
* Add an environment variable for configuring the DNS canonicalization
  timeout (linkerd/linkerd2-proxy#175)
* Make outbound router honor `l5d-dst-override` header
  (linkerd/linkerd2-proxy#173)

Perhaps all the logging related changes can be grouped into one
changelog entry, though...

Signed-off-by: Eliza Weisman <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Extract and test protobuf-over-http code

4 participants