Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

containerd hangs on start in low-entropy situations #2451

Closed
hairyhenderson opened this issue Jul 10, 2018 · 13 comments
Closed

containerd hangs on start in low-entropy situations #2451

hairyhenderson opened this issue Jul 10, 2018 · 13 comments

Comments

@hairyhenderson
Copy link
Contributor

Description

Based on discussion in linuxkit/linuxkit#3032 and linuxkit/linuxkit#3096, it seems that a call to crypto/rand's Read function will block when running on Linux kernel versions 4.14.36 and up, due to a fix for CVE-2018-1108.

As I understand it, prior to the fix, /dev/random was usable before enough entropy was gathered to consider it as a crytographically-safe source. The fix effectively causes reads to /dev/random to block until enough entropy was gathered to fully initialize the CRNG. And Go's crypto/rand Read function uses the Linux getrandom(2) syscall (without setting GRND_NONBLOCK), which reads from /dev/random and therefore now blocks.

Anyway, all this boils down to this line:

rand.Read(b[:])

which will now block when not enough entropy is available. In my case it blocks for ~2mins, but it varies.

Based on the PR comment when that code was originally introduced, I believe that using math/rand instead would be "good enough" - especially:

Since there is already a nanosecond granular time component, a failure to get a unique value here is not a huge concern.

(from @dmcgowan)

Steps to reproduce the issue:

  1. start containerd on a system with a recent-enough kernel, and not enough entropy and without a RNG, for example with LinuxKit - the https://github.com/linuxkit/linuxkit/blob/master/examples/getty.yml example displays this every time for me on my PCEngines APU2C4, which has no HWRNG, though QEMU without a virtio-rng-pci device may work too
  2. watch containerd hang for seconds-to-minutes until CRNG is initialized

Describe the results you received:

containerd blocks for ~2mins on start

Describe the results you expected:

containerd starting ~2mins faster 😉

Output of containerd --version:

containerd github.com/containerd/containerd v1.1.0-rc.2 f630d5f0a639d7d73a806f19f1a6157e768756a5
@hairyhenderson
Copy link
Contributor Author

It looks like this patch allows containerd to start "enough" to not block other services in LinuxKit from starting:

diff --git a/services/leases/local.go b/services/leases/local.go
index d3e0c2f2c5..739e58ea9e 100644
--- a/services/leases/local.go
+++ b/services/leases/local.go
@@ -17,9 +17,9 @@
 package leases
 
 import (
-	"crypto/rand"
 	"encoding/base64"
 	"fmt"
+	"math/rand"
 	"time"
 
 	"google.golang.org/grpc"

but containerd itself still hangs, presumably due to the other uses of crypto/rand - I'll continue hacking away at this...

@dmcgowan
Copy link
Member

The hanging was not expected, the lease generator is intended to be designed that in low entropy situations, the ID generated is just less random rather than blocking or erroring. Looks like we need to re-address this assumption that rand.Read will error and not block when encountering low entropy.

@dmcgowan
Copy link
Member

@hairyhenderson
Copy link
Contributor Author

The hanging was not expected, the lease generator is intended to be designed that in low entropy situations, the ID generated is just less random rather than blocking or erroring.

Yeah, I think a lot of userspace code that makes the non-blocking assumption is going to start hanging in very hard-to-diagnose ways on early-boot when the newer kernel versions become more prevalent 😞.

Would you accept a PR to simply change crypto/rand to math/rand?

The Read method in both packages is effectively the same, just that math/rand will never block or error, but also isn't guaranteed to return cryptographically-unpredictable random data (which I don't see as a requirement for this).

@dmcgowan
Copy link
Member

Would you accept a PR to simply change crypto/rand to math/rand?

Yes, I agree cryptographically random is not necessary. We also add a nanosecond component, we could update this in the future to use UUID logic.

@crosbymichael
Copy link
Member

We can also just implement a reader with unix.GetRandom(bytes, unix.GRND_NONBLOCK)

@hairyhenderson
Copy link
Contributor Author

@crosbymichael I agree that it'd be useful to have a GRND_NONBLOCK version available (it's already available internally at https://github.com/golang/go/blob/master/src/internal/syscall/unix/getrandom_linux.go#L26..L49).

One thing to watch for is that GRND_NONBLOCK simply changes the behaviour in low-entropy situations from blocking to erroring. So, in this case, containerd would need to handle the error and retry (thus effectively blocking start-up anyway), or fall back to some other behaviour such as using math/rand.

Probably simpler and less surprising to stick to math/rand in all cases, especially since this isn't a situation that calls for cryptographically random IDs.

@crosbymichael
Copy link
Member

@hairyhenderson ok, that makes sense. Thanks!

@hairyhenderson
Copy link
Contributor Author

#2454 got merged, which is great, and that'll solve the main blocking issue. But there appears to be something else blocking - in my LinuxKit setup I get a shell from getty now, but the containerd logs don't start appearing until crng_init is done.

So I'm going to keep this issue open for now - I want to dig a little deeper and see if there's another blocking getrandom call (maybe in a dependency)... 🕵️

@dmcgowan
Copy link
Member

We really need to get rid of the docker/docker imports
https://github.com/containerd/containerd/blob/master/vendor/github.com/docker/docker/pkg/stringid/stringid.go#L85

This does raise an important issue which was not part of the PR, random should be seeded, but we should seed it in the binary command and not in the imports.

@hairyhenderson
Copy link
Contributor Author

Yup, that'll do it. 😂

@estesp
Copy link
Member

estesp commented Aug 22, 2018

Is there anything else we still need to investigate here or are the 2 PRs already merged sufficient?

@hairyhenderson
Copy link
Contributor Author

👋 @estesp 😁 TBH I'm not sure. I seem to recall that the patched containerd still had some entropy-related hanging. But my memory's fuzzy on exactly what combination of things I tried.

That particular project has been a bit on hold while I've enjoyed some summer vacation, but I'll try to get back to this in the next few days to see if the latest containerd still stalls.

robertgzr added a commit to balena-os/balena-engine that referenced this issue Dec 7, 2018
robertgzr added a commit to balena-os/balena-engine that referenced this issue Dec 11, 2018
Update vendor.conf and vendor/ to include
balena-os/balena-containerd#2

Connects-to: #105
Connects-to: containerd/containerd#2451

Signed-off-by: Robert Günzler <[email protected]>
robertgzr added a commit to balena-os/balena-engine that referenced this issue Dec 18, 2018
Update vendor.conf and vendor/ to include
balena-os/balena-containerd#2

Connects-to: #105
Connects-to: containerd/containerd#2451
Signed-off-by: Robert Günzler <[email protected]>
robertgzr added a commit to balena-os/balena-engine that referenced this issue Jan 14, 2019
Update vendor.conf and vendor/ to include
balena-os/balena-containerd#2

Connects-to: #105
Connects-to: containerd/containerd#2451
Signed-off-by: Robert Günzler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants