Use getrandom(2), if available, to implement Interop.GetRandomBytes() #37906

lpereira · 2019-05-23T20:36:31Z

(These changes are based on similar changes I made in coreclr.)

The getrandom(2) system call, introduced in Linux 3.17, allows one to obtain numbers from the same pool exported as /dev/urandom (if called with the GRND_NONBLOCK flag). The main advantage is that there's no need to keep a file descriptor open, but this also means that a good source of entropy is always available, even if /dev isn't present (such as in a tight chroot environment).

Refactor how the method to obtain random numbers by using an ifunc, which will continue to prefer arc4random_buf(), if available, and pick between getrandom(2) or reading from /dev/urandom. Falling back to lrand48(3) happens if either of these last two methods fail for whatever reason.

It's important to note that the entropy quality for /dev/urandom and whatever getrandom(2) returns is already pretty good, so it might not be a good idea to XOR it with values from lrand48(3). Since we're not dealing with a CSPRNG here, though, it's good to leave that as the fallback mechanism.

vcsjones · 2019-05-23T21:07:40Z

src/Native/Unix/System.Native/pal_random.c

        {
-            int32_t offset = 0;
-            do
+            if (errno == EINTR)


According to the docs, getrandom will return -1 and errno will be EAGAIN when GRND_NONBLOCK is used and the entropy pool is not available. Should this include EAGAIN to try again?

I suspect that is probably not likely as I would think that would only happen only very early in the boot process, but, perhaps worth considering.

Yes, it's often the case during early boot scenarios, or on embedded systems. However, calling getrandom() in a tight loop if it keeps returning EAGAIN might turn into an infinite loop in such cases; I'm not sure it's advisable.

I have at least one VM that can take minutes to fully initialize.

Ubuntu 16.04.6

$ dmesg | grep random [ 1.844973] random: udevadm: uninitialized urandom read (16 bytes read, 2 bits of entropy available) [ 1.846829] random: udevadm: uninitialized urandom read (16 bytes read, 2 bits of entropy available) [ 1.849528] random: systemd-udevd: uninitialized urandom read (16 bytes read, 2 bits of entropy available) [ 1.850996] random: systemd-udevd: uninitialized urandom read (16 bytes read, 2 bits of entropy available) [ 1.852320] random: systemd-udevd: uninitialized urandom read (16 bytes read, 2 bits of entropy available) [ 1.864731] random: systemd-udevd: uninitialized urandom read (16 bytes read, 2 bits of entropy available) [ 1.866058] random: systemd-udevd: uninitialized urandom read (16 bytes read, 2 bits of entropy available) [ 1.867459] random: systemd-udevd: uninitialized urandom read (16 bytes read, 2 bits of entropy available) [ 1.870948] random: systemd-udevd: uninitialized urandom read (16 bytes read, 2 bits of entropy available) [ 1.872280] random: systemd-udevd: uninitialized urandom read (16 bytes read, 2 bits of entropy available) [ 180.310815] random: nonblocking pool is initialized

Newer kernels trust the CPU, so it's less likely:

Ubuntu 19.04:

$ dmesg | grep random [ 0.017218] random: get_random_bytes called from start_kernel+0x97/0x516 with crng_init=0 [ 0.536646] random: crng done (trusting CPU's manufacturer)

Though not necessarily:

Ubuntu 18.04:

$ dmesg | grep random [ 0.000000] random: get_random_bytes called from start_kernel+0x99/0x4fd with crng_init=0 [ 1.452697] random: fast init done [ 1.454713] random: systemd-udevd: uninitialized urandom read (16 bytes read) [ 1.462508] random: systemd-udevd: uninitialized urandom read (16 bytes read) [ 1.466497] random: systemd-udevd: uninitialized urandom read (16 bytes read) [ 10.935718] random: crng init done [ 10.935721] random: 7 urandom warning(s) missed due to ratelimiting

So the EAGAIN either needs to fail over (e.g. to "dirty reads" from /dev/urandom), or the read should just be a blocking read. But it's definitely an important case to handle.

(Possibly it's already being handled OK, just that the assert needs to be removed)

The assert was just a copy from the read(/dev/urandom) codepath. Maybe a good idea to remove it from there.

BTW, not sure if you're able to set this up in your VM, but there is a virtio-rng device that some hypervisors implement (qemu-kvm being one of them) that provide entropy from the host to the guest; the Linux kernel is able to use it if detected.

@bartonjs Is the "no entropy available" message you're receiving on a fresh VM, or on a VM that has been rebooted? Per http://man7.org/linux/man-pages/man4/random.4.html:

If a seed file is saved across reboots as recommended below (all
major Linux distributions have done this since 2000 at least), the
output is cryptographically secure against attackers without local
root access as soon as it is reloaded in the boot sequence, and
perfectly adequate for network encryption session keys.

IMO blocking is the preferred behavior, especially if it happens only very rarely. If somebody's calling our APIs expecting high-entropy random values then we need to give them high-entropy random values. We can't pull a sleight of hand and give them low-entropy values.

IMO blocking is the preferred behavior

This caused significant problems when we were initially bringing up .NET on Linux. We initially had a blocking implementation, and start-up of the process could end up getting blocked for seconds, until the user wildly moved their mouse around to introduce entropy into the system.

Maybe what we're talking about here is a special case somehow, or maybe things have been improved on all distros/versions we now support, but I don't believe blocking waiting for sufficient entropy is viable.

FWIW, the Python crew considered this same problem a few years ago. They took a breaking change between Python 3.5.2 and Python 3.6 in order to make the os.urandom() function block on low-entropy scenarios because of the security risks inherent in the original behavior. See https://www.python.org/dev/peps/pep-0524/ for more information.

os.urandom returns bytes suitable for cryptographic use (https://docs.python.org/2/library/os.html#os.urandom).

SystemNative_GetNonCryptographicallySecureRandomBytes is not meant to be suitable for cryptographic use as the name suggests.

If this API isn't intended to produce strong randomness, we should reconsider having Guid.NewGuid depend on it. We know from experience that for better or worse, devs use Guid.NewGuid to produce random identifiers and expect those identifiers to have strong randomness.

stephentoub · 2019-05-24T14:11:38Z

cc: @janvorli, @bartonjs

danmoseley · 2019-05-24T15:20:50Z

@tmds

lpereira · 2019-05-24T21:45:09Z

(Some of the build break is because this uses the ifunc attribute that's not supported by some of the older compilers used in CI. I'll address this with a different strategy, but I still want input on the approach.)

src/Native/Unix/System.Native/pal_random.c

src/Native/Unix/configure.cmake

lpereira · 2019-05-29T21:58:55Z

Thanks everybody for the review -- I force-pushed a new version addressing most of the comments.

src/Native/Unix/System.Native/pal_random.c

lpereira · 2019-07-15T20:55:38Z

Force-pushed a new version with the requested changes:

No static constructors, but atomic CAS is back
No changes to the build system to determine availability of getrandom() -- rely on SYS_getrandom to be defined on Linux
Buffer is always XOR-ed with result from lrand48()

maryamariyan · 2019-08-01T22:13:39Z

/azp run corefx-ci

azure-pipelines · 2019-08-01T22:13:50Z

Azure Pipelines successfully started running 1 pipeline(s).

maryamariyan · 2019-08-01T22:24:28Z

There was failure on Windows x86_Release, so I am retriggering the CI.

@lpereira thank you for the PR.
Some questions for you:

There seems to be a few unresolved comments remaining on this PR. Are you going to be addressing them this PR?

We are planning to reduce the number of open PRs in corefx and only keep the active ones open.
If you think this will not be merged soon, then it is best if we comment on what are the actionable items left (needs review, blocked on another issue/PR, etc.) on each and then perhaps close and reopen when it is ready.

Thanks again and please let me know how you see the timeline and actionable items for the open PRs.

lpereira · 2019-08-01T22:56:31Z

@maryamariyan Comments were resolved already, just not marked as resolved here on GitHub. I've marked them as such here.

I don't think there are any impediments to having this merged (with the exception of the Windows x86_Release, which we'll know soon enough if there's something else to be done).

bartonjs

The race condition between srand48 and potential parallel calls to lrand48 needs to be addressed (potentially including explaining why it isn't a problem)

safern · 2019-09-30T21:11:06Z

@lpereira what's the current state of this PR, do you have any plans on addressing feedback soon?

lpereira · 2019-10-01T23:38:23Z

@lpereira what's the current state of this PR, do you have any plans on addressing feedback soon?

Feedback will be addressed shortly. Notifications for this item ended up falling through the cracks of my email client.

The getrandom(2) system call, introduced in Linux 3.17, allows one to obtain numbers from the same pool exported as /dev/urandom (if called with the GRND_NONBLOCK flag). The main advantage is that there's no need to keep a file descriptor open, but this also means that a good source of entropy is always available, even if /dev isn't present (such as in a tight chroot environment). Refactor how the method to obtain random numbers by using an ifunc, which will continue to prefer arc4random_buf(), if available, or pick between getrandom(2) or reading from /dev/urandom. The behavior of XOR-ing the results with lrand48(3) is still there as a workaround broken hardware RNGs. This PRNG is still initialized with time(NULL), which doesn't have a lot of entropy, and #38034 has been filed to address that in the future.

lpereira · 2019-10-01T23:57:57Z

Rebased on top of current master and pushed addressing comments.

maryamariyan · 2019-10-08T23:15:24Z

@bartonjs @joperezr this PR seems to be ready to review again.

bartonjs · 2019-10-09T00:19:21Z

src/Native/Unix/System.Native/pal_random.c

-                    // Another thread has already set the rand_des
-                    close(fd);
-                }
+                sched_yield();


Is it possible that a normal priority thread won the compare exchange, but a high priority thread gets stuck in here? Will that just deadlock, or will the high priority thread eventually yield anyways? (the description of sched_yield suggests that a single high priority thread just ignores the yield call).

bartonjs · 2019-10-09T00:20:13Z

The rebase (well, the force-push) made it difficult to review what was different, but it looks like the srand race condition has been alleviated (modulo one thread priority question). Any other feedback, @jkotas?

jkotas · 2019-10-09T00:37:58Z

Any other feedback, @jkotas?

This: #37906 (comment)

stephentoub · 2019-10-09T14:30:38Z

src/Native/Unix/System.Native/pal_random.c

-        {
-            int fd;
+        return fd;
+    }


There are no issues with the thread-safety around the reads of this field? e.g. is there any situation where the compiler/hardware might introduce a read such that it effectively became int a = fd; if (fd >= 0) return a;? (If that could happen we could end up returning -1.)

src/Native/Unix/System.Native/pal_random.c

stephentoub · 2019-10-09T14:40:27Z

Has any perf testing been done on this? What is the impact, positive or negative?

stephentoub · 2019-10-22T19:16:58Z

@lpereira, are you planning to continue working on this?

lpereira · 2019-10-22T21:01:50Z

@lpereira, are you planning to continue working on this?

I am. I'll address these comments this week; been busy with other things.

maryamariyan · 2019-11-06T21:10:22Z

Thank you for your contribution. As announced in dotnet/coreclr#27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:

In your corefx repository clone, create patch by running git format-patch origin
In your runtime repository clone, apply the patch by running git apply --directory src/libraries <path to the patch created in step 1>

joperezr · 2019-11-12T23:03:12Z

Hey @lpereira still planning on finishing this? The cutoff for the migration is tomorrow.

maryamariyan · 2019-12-02T19:43:23Z

Thank you for your contribution. As announced in #27549 the dotnet/runtime repository will be used going forward for changes to this code base. Closing this PR as no more changes will be accepted into master for this repository. If you’d like to continue working on this change please move it to dotnet/runtime.

vcsjones reviewed May 23, 2019

View reviewed changes

lpereira force-pushed the use-getrandom branch from 4c49f42 to 9e02ab5 Compare May 24, 2019 16:19

am11 reviewed May 26, 2019

View reviewed changes

src/Native/Unix/System.Native/pal_random.c Outdated Show resolved Hide resolved

lpereira force-pushed the use-getrandom branch 2 times, most recently from 6ba47dc to c7d1517 Compare May 28, 2019 19:49