Skip to content
This repository was archived by the owner on Jan 23, 2023. It is now read-only.

Conversation

@lpereira
Copy link

(These changes are based on similar changes I made in coreclr.)

The getrandom(2) system call, introduced in Linux 3.17, allows one to obtain numbers from the same pool exported as /dev/urandom (if called with the GRND_NONBLOCK flag). The main advantage is that there's no need to keep a file descriptor open, but this also means that a good source of entropy is always available, even if /dev isn't present (such as in a tight chroot environment).

Refactor how the method to obtain random numbers by using an ifunc, which will continue to prefer arc4random_buf(), if available, and pick between getrandom(2) or reading from /dev/urandom. Falling back to lrand48(3) happens if either of these last two methods fail for whatever reason.

It's important to note that the entropy quality for /dev/urandom and whatever getrandom(2) returns is already pretty good, so it might not be a good idea to XOR it with values from lrand48(3). Since we're not dealing with a CSPRNG here, though, it's good to leave that as the fallback mechanism.

{
int32_t offset = 0;
do
if (errno == EINTR)
Copy link
Member

@vcsjones vcsjones May 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the docs, getrandom will return -1 and errno will be EAGAIN when GRND_NONBLOCK is used and the entropy pool is not available. Should this include EAGAIN to try again?

I suspect that is probably not likely as I would think that would only happen only very early in the boot process, but, perhaps worth considering.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's often the case during early boot scenarios, or on embedded systems. However, calling getrandom() in a tight loop if it keeps returning EAGAIN might turn into an infinite loop in such cases; I'm not sure it's advisable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have at least one VM that can take minutes to fully initialize.

Ubuntu 16.04.6

$ dmesg | grep random
[    1.844973] random: udevadm: uninitialized urandom read (16 bytes read, 2 bits of entropy available)
[    1.846829] random: udevadm: uninitialized urandom read (16 bytes read, 2 bits of entropy available)
[    1.849528] random: systemd-udevd: uninitialized urandom read (16 bytes read, 2 bits of entropy available)
[    1.850996] random: systemd-udevd: uninitialized urandom read (16 bytes read, 2 bits of entropy available)
[    1.852320] random: systemd-udevd: uninitialized urandom read (16 bytes read, 2 bits of entropy available)
[    1.864731] random: systemd-udevd: uninitialized urandom read (16 bytes read, 2 bits of entropy available)
[    1.866058] random: systemd-udevd: uninitialized urandom read (16 bytes read, 2 bits of entropy available)
[    1.867459] random: systemd-udevd: uninitialized urandom read (16 bytes read, 2 bits of entropy available)
[    1.870948] random: systemd-udevd: uninitialized urandom read (16 bytes read, 2 bits of entropy available)
[    1.872280] random: systemd-udevd: uninitialized urandom read (16 bytes read, 2 bits of entropy available)
[  180.310815] random: nonblocking pool is initialized

Newer kernels trust the CPU, so it's less likely:

Ubuntu 19.04:

$ dmesg | grep random
[    0.017218] random: get_random_bytes called from start_kernel+0x97/0x516 with crng_init=0
[    0.536646] random: crng done (trusting CPU's manufacturer)

Though not necessarily:

Ubuntu 18.04:

$ dmesg | grep random
[    0.000000] random: get_random_bytes called from start_kernel+0x99/0x4fd with crng_init=0
[    1.452697] random: fast init done
[    1.454713] random: systemd-udevd: uninitialized urandom read (16 bytes read)
[    1.462508] random: systemd-udevd: uninitialized urandom read (16 bytes read)
[    1.466497] random: systemd-udevd: uninitialized urandom read (16 bytes read)
[   10.935718] random: crng init done
[   10.935721] random: 7 urandom warning(s) missed due to ratelimiting

So the EAGAIN either needs to fail over (e.g. to "dirty reads" from /dev/urandom), or the read should just be a blocking read. But it's definitely an important case to handle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Possibly it's already being handled OK, just that the assert needs to be removed)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assert was just a copy from the read(/dev/urandom) codepath. Maybe a good idea to remove it from there.

BTW, not sure if you're able to set this up in your VM, but there is a virtio-rng device that some hypervisors implement (qemu-kvm being one of them) that provide entropy from the host to the guest; the Linux kernel is able to use it if detected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bartonjs Is the "no entropy available" message you're receiving on a fresh VM, or on a VM that has been rebooted? Per http://man7.org/linux/man-pages/man4/random.4.html:

If a seed file is saved across reboots as recommended below (all
major Linux distributions have done this since 2000 at least), the
output is cryptographically secure against attackers without local
root access as soon as it is reloaded in the boot sequence, and
perfectly adequate for network encryption session keys.

IMO blocking is the preferred behavior, especially if it happens only very rarely. If somebody's calling our APIs expecting high-entropy random values then we need to give them high-entropy random values. We can't pull a sleight of hand and give them low-entropy values.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO blocking is the preferred behavior

This caused significant problems when we were initially bringing up .NET on Linux. We initially had a blocking implementation, and start-up of the process could end up getting blocked for seconds, until the user wildly moved their mouse around to introduce entropy into the system.

Maybe what we're talking about here is a special case somehow, or maybe things have been improved on all distros/versions we now support, but I don't believe blocking waiting for sufficient entropy is viable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, the Python crew considered this same problem a few years ago. They took a breaking change between Python 3.5.2 and Python 3.6 in order to make the os.urandom() function block on low-entropy scenarios because of the security risks inherent in the original behavior. See https://www.python.org/dev/peps/pep-0524/ for more information.

Copy link
Member

@jkotas jkotas Oct 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

os.urandom returns bytes suitable for cryptographic use (https://docs.python.org/2/library/os.html#os.urandom).

SystemNative_GetNonCryptographicallySecureRandomBytes is not meant to be suitable for cryptographic use as the name suggests.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this API isn't intended to produce strong randomness, we should reconsider having Guid.NewGuid depend on it. We know from experience that for better or worse, devs use Guid.NewGuid to produce random identifiers and expect those identifiers to have strong randomness.

@stephentoub
Copy link
Member

cc: @janvorli, @bartonjs

@danmoseley
Copy link
Member

@tmds

@lpereira
Copy link
Author

(Some of the build break is because this uses the ifunc attribute that's not supported by some of the older compilers used in CI. I'll address this with a different strategy, but I still want input on the approach.)

@lpereira lpereira force-pushed the use-getrandom branch 2 times, most recently from 6ba47dc to c7d1517 Compare May 28, 2019 19:49
@lpereira
Copy link
Author

Thanks everybody for the review -- I force-pushed a new version addressing most of the comments.

@lpereira
Copy link
Author

Force-pushed a new version with the requested changes:

  • No static constructors, but atomic CAS is back
  • No changes to the build system to determine availability of getrandom() -- rely on SYS_getrandom to be defined on Linux
  • Buffer is always XOR-ed with result from lrand48()

@lpereira lpereira added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Jul 16, 2019
@lpereira lpereira force-pushed the use-getrandom branch 2 times, most recently from b73f370 to 4d2df52 Compare July 17, 2019 01:03
@stephentoub stephentoub removed * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) labels Jul 17, 2019
@maryamariyan
Copy link

/azp run corefx-ci

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@maryamariyan
Copy link

maryamariyan commented Aug 1, 2019

There was failure on Windows x86_Release, so I am retriggering the CI.

@lpereira thank you for the PR.
Some questions for you:

  • There seems to be a few unresolved comments remaining on this PR. Are you going to be addressing them this PR?

We are planning to reduce the number of open PRs in corefx and only keep the active ones open.
If you think this will not be merged soon, then it is best if we comment on what are the actionable items left (needs review, blocked on another issue/PR, etc.) on each and then perhaps close and reopen when it is ready.

Thanks again and please let me know how you see the timeline and actionable items for the open PRs.

@lpereira
Copy link
Author

lpereira commented Aug 1, 2019

@maryamariyan Comments were resolved already, just not marked as resolved here on GitHub. I've marked them as such here.

I don't think there are any impediments to having this merged (with the exception of the Windows x86_Release, which we'll know soon enough if there's something else to be done).

Copy link
Member

@bartonjs bartonjs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The race condition between srand48 and potential parallel calls to lrand48 needs to be addressed (potentially including explaining why it isn't a problem)

@safern
Copy link
Member

safern commented Sep 30, 2019

@lpereira what's the current state of this PR, do you have any plans on addressing feedback soon?

@lpereira
Copy link
Author

lpereira commented Oct 1, 2019

@lpereira what's the current state of this PR, do you have any plans on addressing feedback soon?

Feedback will be addressed shortly. Notifications for this item ended up falling through the cracks of my email client.

The getrandom(2) system call, introduced in Linux 3.17, allows one to
obtain numbers from the same pool exported as /dev/urandom (if called
with the GRND_NONBLOCK flag).  The main advantage is that there's no
need to keep a file descriptor open, but this also means that a good
source of entropy is always available, even if /dev isn't present (such
as in a tight chroot environment).

Refactor how the method to obtain random numbers by using an ifunc,
which will continue to prefer arc4random_buf(), if available, or pick
between getrandom(2) or reading from /dev/urandom.

The behavior of XOR-ing the results with lrand48(3) is still there as a
workaround broken hardware RNGs.  This PRNG is still initialized with
time(NULL), which doesn't have a lot of entropy, and #38034 has been
filed to address that in the future.
@lpereira
Copy link
Author

lpereira commented Oct 1, 2019

Rebased on top of current master and pushed addressing comments.

@maryamariyan
Copy link

@bartonjs @joperezr this PR seems to be ready to review again.

// Another thread has already set the rand_des
close(fd);
}
sched_yield();
Copy link
Member

@bartonjs bartonjs Oct 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that a normal priority thread won the compare exchange, but a high priority thread gets stuck in here? Will that just deadlock, or will the high priority thread eventually yield anyways? (the description of sched_yield suggests that a single high priority thread just ignores the yield call).

@bartonjs
Copy link
Member

bartonjs commented Oct 9, 2019

The rebase (well, the force-push) made it difficult to review what was different, but it looks like the srand race condition has been alleviated (modulo one thread priority question). Any other feedback, @jkotas?

@jkotas
Copy link
Member

jkotas commented Oct 9, 2019

Any other feedback, @jkotas?

This: #37906 (comment)

{
int fd;
return fd;
}
Copy link
Member

@stephentoub stephentoub Oct 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are no issues with the thread-safety around the reads of this field? e.g. is there any situation where the compiler/hardware might introduce a read such that it effectively became int a = fd; if (fd >= 0) return a;? (If that could happen we could end up returning -1.)

@stephentoub
Copy link
Member

Has any perf testing been done on this? What is the impact, positive or negative?

@stephentoub
Copy link
Member

@lpereira, are you planning to continue working on this?

@lpereira
Copy link
Author

@lpereira, are you planning to continue working on this?

I am. I'll address these comments this week; been busy with other things.

@maryamariyan
Copy link

maryamariyan commented Nov 6, 2019

Thank you for your contribution. As announced in dotnet/coreclr#27549 this repository will be moving to dotnet/runtime on November 13. If you would like to continue working on this PR after this date, the easiest way to move the change to dotnet/runtime is:

  1. In your corefx repository clone, create patch by running git format-patch origin
  2. In your runtime repository clone, apply the patch by running git apply --directory src/libraries <path to the patch created in step 1>

@joperezr
Copy link
Member

Hey @lpereira still planning on finishing this? The cutoff for the migration is tomorrow.

@lpereira lpereira added the post-consolidation PRs which will be hand ported to dotnet/runtime label Nov 13, 2019
@ViktorHofer ViktorHofer added the * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) label Nov 14, 2019
@maryamariyan
Copy link

Thank you for your contribution. As announced in #27549 the dotnet/runtime repository will be used going forward for changes to this code base. Closing this PR as no more changes will be accepted into master for this repository. If you’d like to continue working on this change please move it to dotnet/runtime.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

area-System.Runtime * NO MERGE * The PR is not ready for merge yet (see discussion for detailed reasons) post-consolidation PRs which will be hand ported to dotnet/runtime

Projects

None yet

Development

Successfully merging this pull request may close these issues.