Fix error handling for Tap CLI and public<>tap server by pcalcado · Pull Request #177 · linkerd/linkerd2

pcalcado · 2018-01-19T17:43:45Z

first step towards CLI doesn't complain when given unknown deployments, pods, or paths #49
closes Extract and test protobuf-over-http code #106

This started as an investigation of why $ conduit tap would abort with unexpected EOF at times.

To debug, I've read all the protobuf-to-http code and, while at that I've cleaned up it a bit and wrote tests for a lot of it.

I finally identified that this problem was due to us not checking when the gRPC Tap() returned an error (in this case because the pods didnt exist). Commit c27f400 adds this check.

The EOF goes away in the client, but the error message still wasn't great:

[email protected]:~/code/src/github.com/runconduit/conduit  $ conduit tap pod default/pod-with-no-rules  --api-addr localhost:8085
rpc error: code = Unknown desc = no pod exists for key default/pod-with-no-rules%

So I did some digging and realized that we haven't been using gRPC errors properly. Commit a3954e6 adds that to the tap server, and after it here is the message displayed:

[email protected]:~/code/src/github.com/runconduit/conduit  $ conduit tap pod default/pod-with-no-rules  --api-addr localhost:8085
no pod exists for key default/pod-with-no-rules%

pcalcado · 2018-01-19T20:53:55Z

FWIW, tests arent passing on travis and I am debugging it so this isnt super ready for review yet :(

pcalcado · 2018-01-19T22:49:53Z

Alright, adding review & reviewable to this, it should be good to go!

siggy

These are all great changes (especially the tests!). I've left a few comments.

For next time, I think this could have been broken up into smaller PRs, particularly the code reorgs, some examples:

renaming server.go to http_server.go
reorg in client.go
splitting proto_over_http.go out of client.go
reimplementing serverMarshal() as writeProtoToHttpResponse()

siggy · 2018-01-22T23:13:30Z

controller/api/public/client.go

+		return err
+	}

+	defer httpRsp.Body.Close()


should this defer close go immediately after c.post(... ?

I'm not sure. there's a lot of error handling before we get the Body, I would assume that we should "group" the lines that handle that object. Maybe this is a code smell indicating that this should be extracted in a func?

my concern was that if we return prior to this, we'll resource leak. reading this code a bit more i'm realizing checkIfResponseHasConduitError also calls close. maybe the safest thing to do is to httpRsp.Body.Close() immediately after the c.post(... error check, and then checkIfResponseHasConduitError() need not be concerned with closing?

👍 to @siggy's comment. We should defer the the .Close call immediately after checking the error that's returned by the call to c.post, and then remove all of the other .Close calls.

I think I know what makes me feel funny about this: the post call "owns" the Response, but not the Body. The Body is only used in thos few lines at the bottom. I think that a good tool to think about this is that if we were to extract those few lines into a convertBodyIntoProto or something the defer would be extracted as well.

Nevertheless I thing the suggestions are good enough and just went ahead with them 👍

siggy · 2018-01-22T23:15:25Z

controller/api/public/client.go

+	if err != nil {
+		log.Debugf("Error invoking [%s]: %v", url.String(), err)
+	} else {
+		log.Debugf("Response from [%s] had hesaders: %v", url.String(), rsp.Header)


siggy · 2018-01-22T23:30:07Z

controller/api/public/client.go


+	serverUrl := apiURL.ResolveReference(&url.URL{Path: ApiPrefix})
+
+	log.Debugf("Expecting Conduit Public API to be server over [%s]", serverUrl)


siggy · 2018-01-22T23:34:36Z

controller/api/public/grpc_server.go

-	rsp, err := s.tapClient.Tap(tapStream.Context(), req)
+	tapClient, err := s.tapClient.Tap(tapStream.Context(), req)
 	if err != nil {
+		//TODO: why not return the error?


maybe log something here?

pcalcado · 2018-01-23T16:16:16Z

@siggy you are right, sorry about the PR size.

I think that I need to build better intuition for this. Both coming from a trunk-based development background and being a little worried about review turnaround time bias me towards larger and more end-to-end pieces, but I'll change that 💪

siggy

one comment about body.close and then lgtm.

understood re: code review styles. in case this hasn't been shared, an old friend wrote this, one of my favorite posts:
https://medium.com/@9len/on-code-review-16ea85f7c585

siggy · 2018-01-23T19:01:30Z

controller/api/public/client.go

+		return err
+	}

+	defer httpRsp.Body.Close()


my concern was that if we return prior to this, we'll resource leak. reading this code a bit more i'm realizing checkIfResponseHasConduitError also calls close. maybe the safest thing to do is to httpRsp.Body.Close() immediately after the c.post(... error check, and then checkIfResponseHasConduitError() need not be concerned with closing?

pcalcado · 2018-01-23T22:38:13Z

@klingerf do you still want to review this?

See #163 Signed-off-by: Phil Calcado <[email protected]>

Signed-off-by: Phil Calcado <[email protected]>

dadjeibaah · 2018-01-24T21:46:27Z

controller/api/public/proto_over_http.go

+	}
+
+	if totalBytesRead != messageLength {
+		return nil, fmt.Errorf("message declared length [%d] but could only read [%d] bytes")


Just to wrap my head around this error case, What scenarios would cause the code to reach this line when deserializing the HTTP? Proto? body.

pcalcado · 2018-01-24T22:06:54Z

@deebo91 said:

if totalBytesRead != messageLength {
 return nil, fmt.Errorf("message declared length [%d] but could only read [%d] bytes")
}

Just to wrap my head around this error case, What scenarios would cause the code to reach this line when deserializing the HTTP? Proto? body.

The protocol we use here is custom, and it relies on declaring the message length using the first four bytes of the HTTP body. If everything goes well, the server code probably uses serializeAsPayload to encode the message and everybody will be happy.

Unfortunately the world is a sad place. It could be that:

we accidentally introduce a code path on the server that doesn't call serializeAsPayload but rather use a copy adn paste'd version that isn't broken
we might want to change the protocol in an upcoming version (say using more or fewer bytes than 4) and we might have a situation where the user has a CLI version that is a bit older and doesn't know about the protocol change
When writing distributed systems, even if you have absolute control of the from and to components you still have to deal with the unreliable network. There are many pieces of middleware between the CLI and the actual server, from the k8s API to k8s itself to maybe even a web accelerator proxy thing or some corporate spyware that the user might have installed in their network. These things, especially some weird proprietary "security" tools corporations sometimes use, can do all sort of transformations with they payload. It might be the case that you hit a defective one and they strip out some bytes from your message. If we don't check this here, we will have a very confusing error when this gets unmarshalled. Instead, we should detect the error as soon as possible and inform this function's caller that something isn't cool.

EDIT--
OTOH, @peczenyj just reminded me that ReadFull returns an error if fewer bytes were read, so this is useless 😎

Signed-off-by: Phil Calcado <[email protected]>

klingerf

⭐️ This looks great! Thanks for addressing my review feedback and adding those additional tests.

klingerf · 2018-01-24T23:30:44Z

controller/api/public/proto_over_http_test.go

+		}
+	})
+
+	t.Run("Can multiple messages in the same stream", func(t *testing.T) {


Hmm, maybe:

Can read multiple messages in the same stream

Signed-off-by: Phil Calcado <[email protected]>

proxy: update pinned version to 5b507a9 This picks up the following proxy commits: * eaabc48 Update tower-grpc * e9561de Update h2 to 0.1.16 * 28fd5e7 Add Route timeouts (linkerd/linkerd2-proxy#165) * 5637372 Re-flag tcp_duration tests as flaky * 20cbd18 Revise several log levels and messages (linkerd/linkerd2-proxy##177) * ae16978 Remove flakiness from 'profiles' tests * 49c29cd canonicalize: Only log errors at the WARN level when falling back (linkerd/linkerd2-proxy#174) * 486dd13 Make outbound router honor `l5d-dst-override` header (linkerd/linkerd2-proxy#173) * 7adc50d Make timeouts for canonicalization DNS queries tuneable (linkerd/linkerd2-proxy#175) * 3188179 Try reducing CI flakiness by reducing RUST_TEST_THREADS to 1 Some of these changes will probably need changelog entries: * Improve logging when rejecting malformed HTTP/2 pseudo-headers (hyperium/h2#347) * Improve logging for gRPC errors (tower-rs/tower-grpc#111) * Add Route timeouts (linkerd/linkerd2-proxy#165) * Downgrade several of the noisiest log messages to TRACE (linkerd/linkerd2-proxy##177) * Add an environment variable for configuring the DNS canonicalization timeout (linkerd/linkerd2-proxy#175) * Make outbound router honor `l5d-dst-override` header (linkerd/linkerd2-proxy#173) Perhaps all the logging related changes can be grouped into one changelog entry, though... Signed-off-by: Eliza Weisman <[email protected]>

This picks up the following proxy commits: * eaabc48 Update tower-grpc * e9561de Update h2 to 0.1.16 * 28fd5e7 Add Route timeouts (linkerd/linkerd2-proxy#165) * 5637372 Re-flag tcp_duration tests as flaky * 20cbd18 Revise several log levels and messages (linkerd/linkerd2-proxy##177) * ae16978 Remove flakiness from 'profiles' tests * 49c29cd canonicalize: Only log errors at the WARN level when falling back (linkerd/linkerd2-proxy#174) * 486dd13 Make outbound router honor `l5d-dst-override` header (linkerd/linkerd2-proxy#173) * 7adc50d Make timeouts for canonicalization DNS queries tuneable (linkerd/linkerd2-proxy#175) * 3188179 Try reducing CI flakiness by reducing RUST_TEST_THREADS to 1 Some of these changes will probably need changelog entries: * Improve logging when rejecting malformed HTTP/2 pseudo-headers (hyperium/h2#347) * Improve logging for gRPC errors (tower-rs/tower-grpc#111) * Add Route timeouts (linkerd/linkerd2-proxy#165) * Downgrade several of the noisiest log messages to TRACE (linkerd/linkerd2-proxy##177) * Add an environment variable for configuring the DNS canonicalization timeout (linkerd/linkerd2-proxy#175) * Make outbound router honor `l5d-dst-override` header (linkerd/linkerd2-proxy#173) Perhaps all the logging related changes can be grouped into one changelog entry, though... Signed-off-by: Eliza Weisman <[email protected]>

pcalcado added the review/ready Issue has a reviewable PR label Jan 19, 2018

pcalcado requested review from adleong, klingerf and siggy January 19, 2018 17:46

pcalcado self-assigned this Jan 19, 2018

pcalcado added enhancement area/controller area/cli labels Jan 19, 2018

pcalcado force-pushed the phil/tap-leak branch 8 times, most recently from 22e042f to 6b5b5f2 Compare January 19, 2018 20:53

pcalcado removed the review/ready Issue has a reviewable PR label Jan 19, 2018

pcalcado force-pushed the phil/tap-leak branch 2 times, most recently from 05c897c to 7ff2fa8 Compare January 19, 2018 22:48

pcalcado added review/ready Issue has a reviewable PR reviewable labels Jan 19, 2018

pcalcado force-pushed the phil/tap-leak branch from 7ff2fa8 to 665f034 Compare January 19, 2018 22:52

siggy reviewed Jan 23, 2018

View reviewed changes

pcalcado force-pushed the phil/tap-leak branch 2 times, most recently from 3630821 to 5651fe2 Compare January 23, 2018 16:12

siggy approved these changes Jan 23, 2018

View reviewed changes

Phil Calcado added 13 commits January 24, 2018 14:51

Remove support for JSON on public API

e20871e

See #163 Signed-off-by: Phil Calcado <[email protected]>

Extract HTTP to protobuf conversion

6661847

Signed-off-by: Phil Calcado <[email protected]>

Change server to use new http to proto implementation

ea20893

Signed-off-by: Phil Calcado <[email protected]>

Add test using both client and server

5279b78

Signed-off-by: Phil Calcado <[email protected]>

Consolidate serialization in single place

f5afdb2

Signed-off-by: Phil Calcado <[email protected]>

Extract http stream writer logic and remove casts

bd7fe13

Signed-off-by: Phil Calcado <[email protected]>

Add test for Tap over HTTP

0c1a054

Signed-off-by: Phil Calcado <[email protected]>

Add more debug to public api and CLI

2273330

Signed-off-by: Phil Calcado <[email protected]>

Reorganize file

f11acd0

Signed-off-by: Phil Calcado <[email protected]>

Extract error detection from message reading

ee81df7

Signed-off-by: Phil Calcado <[email protected]>

Add proper gRPC error handling for tap

2434c91

Signed-off-by: Phil Calcado <[email protected]>

Fail test if cant start/stop server

d7937dd

Signed-off-by: Phil Calcado <[email protected]>

Make tests work on Travis

a4dd5af

Signed-off-by: Phil Calcado <[email protected]>

pcalcado force-pushed the phil/tap-leak branch from 70a1920 to 0941d6b Compare January 24, 2018 20:10

dadjeibaah reviewed Jan 24, 2018

View reviewed changes

pcalcado force-pushed the phil/tap-leak branch from 0951b34 to 5e6684f Compare January 24, 2018 21:50

Phil Calcado added 2 commits January 24, 2018 17:23

Act on feedback

1efbbaa

Signed-off-by: Phil Calcado <[email protected]>

Make protobuf http deserialization read the whole message

0d80c8f

Signed-off-by: Phil Calcado <[email protected]>

pcalcado force-pushed the phil/tap-leak branch from 5e6684f to 74d85a2 Compare January 24, 2018 22:23

klingerf approved these changes Jan 24, 2018

View reviewed changes

Use io.ReadFull instead of handwritten code to read http body

0d634ed

Signed-off-by: Phil Calcado <[email protected]>

pcalcado force-pushed the phil/tap-leak branch from 74d85a2 to 0d634ed Compare January 25, 2018 16:31

pcalcado merged commit 9410da4 into master Jan 25, 2018

pcalcado removed the review/ready Issue has a reviewable PR label Jan 25, 2018

pcalcado deleted the phil/tap-leak branch January 26, 2018 18:30

klingerf mentioned this pull request Jan 29, 2018

Update cli subcommands to print errors when encountered #221

Merged

hawkw mentioned this pull request Jan 24, 2019

proxy: update pinned version to 5b507a9 #2147

Merged


		serverUrl := apiURL.ResolveReference(&url.URL{Path: ApiPrefix})

		log.Debugf("Expecting Conduit Public API to be server over [%s]", serverUrl)

Conversation

pcalcado commented Jan 19, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pcalcado commented Jan 19, 2018

Uh oh!

pcalcado commented Jan 19, 2018

Uh oh!

siggy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcalcado commented Jan 23, 2018

Uh oh!

siggy left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcalcado commented Jan 23, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

pcalcado commented Jan 24, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

klingerf left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

pcalcado commented Jan 19, 2018 •

edited

Loading

pcalcado commented Jan 24, 2018 •

edited

Loading