Skip to content

Conversation

@thaJeztah
Copy link
Member

Trying to get these the same as containerd, but looks like we need to bump protobuf to 1.1.0, to make it build;

# github.com/docker/swarmkit/vendor/github.com/google/certificate-transparency-go/client/configpb
vendor/github.com/google/certificate-transparency-go/client/configpb/multilog.pb.go:53:39: undefined: proto.InternalMessageInfo
vendor/github.com/google/certificate-transparency-go/client/configpb/multilog.pb.go:103:36: undefined: proto.InternalMessageInfo

It builds now, but CI probably fails

@thaJeztah
Copy link
Member Author

WARNING: DATA RACE
Write at 0x00c4202c3860 by goroutine 73:
  github.com/docker/swarmkit/ca.NewMutableTLS()
      github.com/docker/swarmkit/ca/_test/_obj_test/transport.go:205 +0x33e
  github.com/docker/swarmkit/ca.(*MutableTLSCreds).Clone()
      github.com/docker/swarmkit/ca/_test/_obj_test/transport.go:52 +0x7a
  github.com/docker/swarmkit/vendor/google.golang.org/grpc.DialContext()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/clientconn.go:529 +0xe9c
  github.com/docker/swarmkit/vendor/google.golang.org/grpc.Dial()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/clientconn.go:393 +0xbc
  github.com/docker/swarmkit/connectionbroker.(*Broker).SelectRemote()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/connectionbroker/broker.go:75 +0x278
  github.com/docker/swarmkit/ca.getGRPCConnection()
      github.com/docker/swarmkit/ca/_test/_obj_test/certificates.go:738 +0x17a
  github.com/docker/swarmkit/ca.GetRemoteSignedCertificate()
      github.com/docker/swarmkit/ca/_test/_obj_test/certificates.go:833 +0x137
  github.com/docker/swarmkit/ca.(*RootCA).RequestAndSaveNewCertificates()
      github.com/docker/swarmkit/ca/_test/_obj_test/certificates.go:182 +0x213
  github.com/docker/swarmkit/ca.RenewTLSConfigNow()
      github.com/docker/swarmkit/ca/_test/_obj_test/config.go:602 +0x4d5
  github.com/docker/swarmkit/ca.(*TLSRenewer).Start.func1()
      github.com/docker/swarmkit/ca/_test/_obj_test/renewer.go:147 +0xe7f

Previous read at 0x00c4202c3860 by goroutine 38:
  crypto/tls.(*Conn).clientHandshake()
      /usr/local/go/src/crypto/tls/handshake_client.go:47 +0x111
  crypto/tls.(*Conn).Handshake()
      /usr/local/go/src/crypto/tls/conn.go:1307 +0x2f0
  github.com/docker/swarmkit/ca.(*MutableTLSCreds).ClientHandshake.func1()
      github.com/docker/swarmkit/ca/_test/_obj_test/transport.go:103 +0x50

Goroutine 73 (running) created at:
  github.com/docker/swarmkit/ca.(*TLSRenewer).Start()
      github.com/docker/swarmkit/ca/_test/_obj_test/renewer.go:63 +0xd2
  github.com/docker/swarmkit/ca_test.TestRenewTLSConfigWithNoNode()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/ca/config_test.go:947 +0x412
  testing.tRunner()
      /usr/local/go/src/testing/testing.go:746 +0x16c

Goroutine 38 (finished) created at:
  github.com/docker/swarmkit/ca.(*MutableTLSCreds).ClientHandshake()
      github.com/docker/swarmkit/ca/_test/_obj_test/transport.go:101 +0x22c
  github.com/docker/swarmkit/vendor/google.golang.org/grpc/transport.newHTTP2Client()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/transport/http2_client.go:182 +0x2901
  github.com/docker/swarmkit/vendor/google.golang.org/grpc/transport.NewClientTransport()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/transport/transport.go:531 +0xd7
  github.com/docker/swarmkit/vendor/google.golang.org/grpc.(*addrConn).createTransport()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/clientconn.go:1147 +0x3f9
  github.com/docker/swarmkit/vendor/google.golang.org/grpc.(*addrConn).resetTransport()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/clientconn.go:1108 +0x7b2
  github.com/docker/swarmkit/vendor/google.golang.org/grpc.(*addrConn).transportMonitor()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/clientconn.go:1281 +0x619
  github.com/docker/swarmkit/vendor/google.golang.org/grpc.(*addrConn).connect.func1()
      /home/ubuntu/.go_workspace/src/github.com/docker/swarmkit/vendor/google.golang.org/grpc/clientconn.go:845 +0x268

--- FAIL: TestRenewTLSConfigWithNoNode (0.11s)
	testing.go:699: race detected during execution of test

...

--- FAIL: TestExternalCASignRequestTimesOut (1.00s)
	testing.go:699: race detected during execution of test

...

--- FAIL: TestSecurityConfigUpdateRootCA (10.06s)
	Error Trace:	config_test.go:439
	Error:		Object expected to be of type x509.UnknownAuthorityError, but was context.deadlineExceededError

@dperny
Copy link
Collaborator

dperny commented May 18, 2018

Race detector appears to be triggered because GRPC previously did not call the Clone method on MutableTLSCreds in our configuration, but now does, and the Clone method is evidently not concurrency safe in this context.

@thaJeztah
Copy link
Member Author

ping @cyli ^^ feels like your area 🤗

@dperny
Copy link
Collaborator

dperny commented May 18, 2018

The other failures are the same damned bug as #2452, which is that the returned error from GRPC changed again.

@dperny
Copy link
Collaborator

dperny commented May 18, 2018

Yeah, the problem is once again that we try to fail on an error that otherwise is transient, and GRPC no longer as of that version exposes the error we need. Is there anyone in Docker with more experience with GRPC that might know the approach to fix this?

@thaJeztah
Copy link
Member Author

The other failures are the same damned bug as #2452, which is that the returned error from GRPC changed again.

Ugh.. I saw the failures and actually thought it could be (but it was late and didn't look into that)

@dperny
Copy link
Collaborator

dperny commented May 18, 2018

As far as I can tell, there is no longer any way to get the root error out, which means we can't differentiate cert errors from other errors.

@thaJeztah
Copy link
Member Author

Boo! We should open a ticket for that upstream (breaking change.. again); do you think we should open a PR upstream? While looking at the code, I think I saw that other errors have a .Code() in the type that allowed getting an error-code (which probably is a lot better to detect what kind of error was produced)?

@thaJeztah
Copy link
Member Author

@dperny could this be the cause? grpc/grpc-go@7aea499#diff-3081f1aaaf958d7d5e00ccec7a62ea6f

Not sure if we hit that part, but it's changing codes.Internal to codes.Unauthenticated. If that's indeed the code we hit, then this part would ignore the error;

https://github.com/docker/swarmkit/blob/8a2b6fd64944bcef8154ced28f90aeec6abfeb04/node/node.go#L1315-L1318

@codecov
Copy link

codecov bot commented May 18, 2018

Codecov Report

Merging #2631 into master will decrease coverage by 8.95%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master    #2631      +/-   ##
==========================================
- Coverage   67.21%   58.26%   -8.96%     
==========================================
  Files         104       10      -94     
  Lines       15450     1253   -14197     
==========================================
- Hits        10384      730    -9654     
+ Misses       3998      453    -3545     
+ Partials     1068       70     -998

@thaJeztah
Copy link
Member Author

thaJeztah commented May 18, 2018

YOLO - pushed a commit to see if that was indeed the problem 😅

edit: nope, that didn't work; removed again

@dperny
Copy link
Collaborator

dperny commented May 18, 2018

Negative, the error that's bubbling up instead of the x509 error is this:

rpc error: code = Unavailable desc = all SubConns are in TransientFailure

my understanding of what changed in GRPC is that it now integrally uses a Balancer, which hides the actual connection to the server behind a SubConn. The idea, I think, might be to have multiple open connections to various servers. I'm not sure, exactly.

The end result, though, is that calling grpc tries to get one of these SubConns to use. But, because the x509 error is considered "Transient", we get back an error that all of the SubConns are in a transient failure state (meaning we could expect them to work again at some point, except the whole point of the broken code is to catch a case where we know the system won't work again at any point).

@dmcgowan
Copy link
Member

@dperny do you see a path forward on this or need help looking into it?

@cyli
Copy link
Contributor

cyli commented May 22, 2018

Looking into the Clone issue. From https://groups.google.com/forum/#!topic/grpc-io/yCUwuHycNWk we may have to write our own custom dialer if we want to surface the X509 errors? :|

@cyli
Copy link
Contributor

cyli commented May 24, 2018

Update: So for some of the tests, we can just use our own GRPC dialer. I'm less happy about doing this in our actual code however, because we'd have to write a dialer that did the TLS handshake ourselves, and then pass the error back via some other channel.

If we don't use our own dialer for the code, then:

  1. For some of our integration tests, we just will not be able assert anything about test what kind of error was returned, and we'll have to wait for a timeout instead of waiting for the test to fail immediately because GRPC will just retry on transient failure.

  2. Actual behavior if a node cert expired, or there is some other error with the certificate - rather than the node failing on startup, it will keep trying to reconnect, and the user will have to look at the logs to diagnose what's wrong. We'd specifically be regressing on Joining a cluster with wrong certificate hides error moby#33380.

There is apparently a workaround in 1.11 (see grpc/grpc-go#1855 and grpc/grpc-go#1917) where if you make an RPC call instead of dialing with WithBlock(), you should generally get the correct error back. In our non-test code, we do not ever dial with WithBlock() so that seems fine, if we can upgrade to 1.11.

However, in 1.11, transport.StreamFromContext has been removed, which we depend on in our raftproxy (although it's mainly for nicer error messages) and for forcibly disconnecting a node in the dispatcher (but we can probably do that some other way).

Can we upgrade swarmkit to 1.11+ instead, or do we have to be on 1.10? If we want to stay on 1.10, then maybe we just live with the slight change in behavior, and I'll just skip checking those errors on tests I guess.

If we want to upgrade to 1.11+ (which we'd eventually need to do anyway) then I can work on trying to get around our use of transport.StreamFromContext and try to make sure our tests pass with their workaround.

@thaJeztah
Copy link
Member Author

Yes, in the long run, we will be updating to newer versions (and it's probably good to update more frequent so that we catch breaking changes early)

@dmcgowan - are there plans to update containerd to a more recent version, and if so, would that be a change that's acceptable for an 1.1.x patch release of containerd?

@dmcgowan
Copy link
Member

are there plans to update containerd to a more recent version, and if so, would that be a change that's acceptable for an 1.1.x patch release of containerd?

We would like to keep containerd as up to date with GRPC as possible. We don't need to backport to a release since our primary concern here is the client end, which does not need to be vendored to a release.

@cyli cyli mentioned this pull request May 25, 2018
thaJeztah added 4 commits June 1, 2018 17:23
Signed-off-by: Sebastiaan van Stijn <[email protected]>
Signed-off-by: Sebastiaan van Stijn <[email protected]>
Signed-off-by: Sebastiaan van Stijn <[email protected]>
@thaJeztah
Copy link
Member Author

superseded by #2649

@thaJeztah thaJeztah closed this Jun 4, 2018
@thaJeztah thaJeztah deleted the update-grpc-etc branch June 4, 2018 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants