8334015: Add Support for UUID Version 7 (UUIDv7) defined in RFC 9562#25303
8334015: Add Support for UUID Version 7 (UUIDv7) defined in RFC 9562#25303kieran-farrell wants to merge 26 commits intoopenjdk:masterfrom
Conversation
|
👋 Welcome back kieran-farrell! A progress list of the required criteria for merging this PR into |
|
@kieran-farrell This change now passes all automated pre-integration checks. ℹ️ This project also has non-automated pre-integration requirements. Please see the file CONTRIBUTING.md for details. After integration, the commit message for the final commit will be: You can use pull request commands such as /summary, /contributor and /issue to adjust it as needed. At the time when this comment was updated there had been 1853 new commits pushed to the
As there are no conflicts, your changes will automatically be rebased on top of these commits when integrating. If you prefer to avoid this automatic rebasing, please check the documentation for the /integrate command for further details. As you do not have Committer status in this project an existing Committer must agree to sponsor your change. Possible candidates are the reviewers of this PR (@RogerRiggs, @jaikiran, @AlanBateman) but any other Committer may sponsor as well. ➡️ To flag this PR as ready for integration with the above commit message, type |
|
@kieran-farrell The following label will be automatically applied to this pull request:
When this pull request is ready to be reviewed, an "RFR" email will be sent to the corresponding mailing list. If you would like to change these labels, use the /label pull request command. |
Webrevs
|
| * remaining bytes with cryptographically strong random data. | ||
| * | ||
| * @return A {@code UUID} generated from the current system time | ||
| */ |
There was a problem hiding this comment.
Seems like there should be a reference to the RFC somewhere here using the @SPEC javadoc tag.
And also in the class javadoc.
There was a problem hiding this comment.
spec updated in class javadoc to RFC 9562 which obsolete RFC 4122, @SPEC tag also added above the method.
There was a problem hiding this comment.
I'd say "create" instead of "retrieve" in the first line comment. (Though that word is used in the other static factories).
The "sub-millisecond precision" can't be relied upon. Its the precision that gives the impression that it can be added to the millisecond value and get an exact time. But the nano-second value is read from a different clock and the bits being used from it may wrap-around in the time between the reads of the clocks.
The second description ("derived from") is makes fewer promises than the first-line comment.
An application should not use them for any purpose other than random. And for that the buffer already contains random bytes.
An alternative is to capture the MS time and the nano-time on first use to compute an offset and then use only the nano-time plus/minus the offset to create the version 7 UUIDs.
There was a problem hiding this comment.
The timestamp method may mislead an app developer thinking its the time from the version 7 timestamp.
The first-line comment might make it more obvious if it says:
The timestamp value associated with a version 1 UUID.
The Unix Epoch time would be easier to use if a method was added to return it.
long UnixEpochTimeNanos() (name subject to bike-shedding).
And throwing if it was not version 7.
There was a problem hiding this comment.
The first 48 bits need to be from the unix epoch time stamp in ms only to remain complaint. If I understand correctly, to be able to guarantee that the nsBits are actually time orderable, System.nanoTime() would have to be pinned to a point where System.currentTimeMillis() increments by 1ms (which doesnt seem possible). Otherwise any offset could be at any arbitrary point in a given millisecond and nsBits would wrap around from that.
I think the options are to either use the random only approach with ms precision or an guaranteed monotonic counter based approach and take the hit on performance.
I agree with updating to "create" from "retrieve" on the first line comment and also with the method name change to avoid confusion with UUIDv1, however to either unixEpochTimeNanos() or unixEpochTimeMillis() or otherwise depending on which implementation we settle on.
There was a problem hiding this comment.
We don't have a reliable sub-microsecond source across platforms, so can only be confident using random numbers for the low order bits. (and I don't expect a higher resolution monotonic clock to be available anytime soon).
Would the static factory method be more useful if it took the mstime as a parameter?
That would allow the caller to pick the time source. For example,
UUID ts = UUID.timesteampUUID(System.currentTimeMillis());
It would avoid locking into the System.currentTimeMillis() and allows the application to create UUIDs for whatever time meets its needs.
Adding a unixEpochTimeMillis() makes sense, re-assembling the time bits into the ms time.
There was a problem hiding this comment.
In terms of using another clock it could open up possiblities of passing non complaint time formats to the method, but maybe thats something that could be mostly prevented with input validation and clear documentation i.e. “timestamp must be Unix time in milliseconds.” As far as I know all other standard Java methods that gives the current time in milliseconds ultimately delegates to System.currentTimeMillis().
Maybe there would be a use case in generating UUIDs by passing compliant timestamps for entries created in the past, like updating a database to use UUIDv7?
There was a problem hiding this comment.
RFC 9562 does not indicate any relation between the timestamp value and the "current" time.
So yes, document it as the Unix Epoch, milliseconds since 1970.
There could be plenty of use cases for creating a UUID from some other source of time, like the timestamp of a file or a time field in a database.
There was a problem hiding this comment.
Latest update:
- changed to millis + randomByte allocation implementation
- updated method to take user provided timestamp
- added API which delegates to method with current timestamp if no user input provided
- validated user input (this is far from fool proof, but prevents negative time values and some other time formats)
- updated method name to unixEpochTimeMillis
- added a note with ref to rfc to make user aware of time format required for rfc compliance
- added test coverage
Updated benchmark:
Benchmark Mode Cnt Score Error Units
UUIDBenchmark.unixEpochTimeMillis avgt 25 138.076 ± 11.438 ns/op
| ng.nextBytes(randomBytes); | ||
|
|
||
| // Set the first 6 bytes to the ms time | ||
| for (int i = 0; i < 6; i++) { | ||
| randomBytes[i] = (byte) (msTime >>> (40 - 8 * i)); | ||
| } | ||
|
|
||
| // Scale sub-ms nanoseconds to a 12-bit value | ||
| int nsBits = (int) ((nsTime % 1_000_000) / 1_000_000.0 * 4096); | ||
|
|
||
| // Set version and increased percision time bits | ||
| randomBytes[6] = (byte) (0x70 | ((nsBits >>> 8) & 0x0F)); | ||
| randomBytes[7] = (byte) (nsBits & 0xFF); | ||
|
|
||
| // Set variant to 2 | ||
| randomBytes[8] &= 0x3F; | ||
| randomBytes[8] |= (byte) 0x80; | ||
|
|
||
| return new UUID(randomBytes); |
There was a problem hiding this comment.
This could remove the allocation by composing the high and low longs using shifts and binary operations and ng.next().
Can the sub-microsecond value just be truncated and avoid the expensive divide operation?
There was a problem hiding this comment.
Can the sub-microsecond value just be truncated and avoid the expensive divide operation?'
method 3 of secion 6.2 of https://www.rfc-editor.org/rfc/rfc9562.html#name-monotonicity-and-counters states
start with the portion of the timestamp expressed as a fraction of the clock's tick value (fraction of a millisecond for UUIDv7). Compute the count of possible values that can be represented in the available bit space, 4096 for the UUIDv7 rand_a field. Using floating point or scaled integer arithmetic, multiply this fraction of a millisecond value by 4096 and round down (toward zero) to an integer result to arrive at a number between 0 and the maximum allowed for the indicated bits, which sorts monotonically based on time. '
so i think we might have to keep the division? though i re-shuffled the equation to
int nsBits = (int) ((nsTime % 1_000_000) / 1_000_000.0 * 4096);
which gives scaled integer division rather than floating point and gave a very slight imporved perfomance to 143.758 ± 2.135 ns/op
There was a problem hiding this comment.
This could remove the allocation by composing the high and low longs using shifts and binary operations and ng.next().
do you mean to create the UUID using most and least significant bytes? if so, I've tried out some variations, i found creating the 64 bit lsb with ng.nextLong() brings a large pefomance decrease over using the nextBytes method, but the below implemntation keeping with the nextByte(byte[]) api brings a performance increase to 121.128 ± 30.486 ns/op, though the code might appear a little roundabout.
public static UUID timestampUUID() {
long msTime = System.currentTimeMillis();
long nsTime = System.nanoTime();
// Scale sub-ms nanoseconds to a 12-bit value
int nsBits = (int) ((nsTime % 1_000_000L) * 4096L / 1_000_000L);
// Compose the 64 most significant bits: [48-bit msTime | 4-bit version | 12-bit nsBits]
long mostSigBits =
((msTime & 0xFFFFFFFFFFFFL) << 16) |
(0x7L << 12) |
nsBits;
// Generate 8 random bytes for least significant bits
byte[] randomBytes = new byte[8];
SecureRandom ng = UUID.Holder.numberGenerator;
ng.nextBytes(randomBytes);
long leastSigBits = 0;
for (int i = 0; i < 8; i++) {
leastSigBits = (leastSigBits << 8) | (randomBytes[i] & 0xFF);
}
// Set variant (bits 62–63) to '10'
leastSigBits &= 0x3FFFFFFFFFFFFFFFL;
leastSigBits |= 0x8000000000000000L;
return new UUID(mostSigBits, leastSigBits);
}
There was a problem hiding this comment.
There's no (time-based) relationship between the currentTimeMillis() value and the nanoTime value.
They are independent clocks and are read separately and are un-correlated. They won't be usable as lsb of the millis value.
I'm surprised that the nextBytes is slower, since it looks like it calls nextLong and puts it in a newly allocated byte[8]. Normal perf measurements won't account for the gc overhead to recover it.
The nsBits computation looks odd, nanoTme returns nanoseconds (10^9), the remainder (% 1_000_000) is then milliseconds.
There was a problem hiding this comment.
if so, I've tried out some variations, i found creating the 64 bit lsb with ng.nextLong() brings a large pefomance decrease over using the nextBytes method
Probably because SecureRandom gets #nextLong() from Random, which ends up calling #next(int) twice, so it allocates twice. Overriding #nextLong() in SecureRandom may help a little but will still have to allocate as long as SecureRandomSpi is not updated.
There was a problem hiding this comment.
The nsBytes were included to give some additional but not guaranteed monotonicity in the event of msTime collisions on a single instance of a JVM. I updated the method to print the msTime and nsBits for visual purposes seen below. The nsBits increase until cycling back to a lower value on the last or +/- 1 of the last instance of the same msTime value as tested on Linux and MacOS . I'm not sure of the reason why the cycles are almost in sync given that the times are not linked. However Marschall has pointed out that nanoTime() resolution is weaker on windows so maybe its not something that can be depended on. In that case I could revert to the random byte only implementation and remove the nsBytes fro random data.
As for the nsBits computation, in order to align with the RFC, my understanding was to get the remainder of nsTime (% 1_000_000) to get the nanoseconds within the current millisecond, divide by 1_000_000 to convert to a fraction of a ms and multiply by 4096L to scale to the integer range to fit into the 12 bits permitted for the ns timestamp.
msTime: 1747835228108, nsBits: 1336
msTime: 1747835228137, nsBits: 1333
msTime: 1747835228137, nsBits: 1794
msTime: 1747835228137, nsBits: 2206
msTime: 1747835228137, nsBits: 2766
msTime: 1747835228137, nsBits: 3040
msTime: 1747835228137, nsBits: 3485
msTime: 1747835228137, nsBits: 3901
msTime: 1747835228138, nsBits: 239
msTime: 1747835228138, nsBits: 651
msTime: 1747835228138, nsBits: 918
msTime: 1747835228138, nsBits: 1321
msTime: 1747835228138, nsBits: 1729
msTime: 1747835228138, nsBits: 2140
msTime: 1747835228138, nsBits: 2535
msTime: 1747835228138, nsBits: 2874
msTime: 1747835228138, nsBits: 3243
msTime: 1747835228138, nsBits: 3641
msTime: 1747835228138, nsBits: 3967
msTime: 1747835228139, nsBits: 227
msTime: 1747835228139, nsBits: 524
There was a problem hiding this comment.
This thread can be resolved.
| */ | ||
| public static UUID timestampUUID() { | ||
| long msTime = System.currentTimeMillis(); | ||
| long nsTime = System.nanoTime(); |
There was a problem hiding this comment.
I is my understanding that System#nanoTime() does not provide the nanosecond fraction and is not guaranteed to have nanosecond resolution. You'd have to look at VM#getNanoTimeAdjustment(long) to get the nanotime adjustment. However the resolution of this is OS dependent, usually microseconds, likely 10 microseconds on Windows.
| ng.nextBytes(randomBytes); | ||
|
|
||
| // Set the first 6 bytes to the ms time | ||
| for (int i = 0; i < 6; i++) { | ||
| randomBytes[i] = (byte) (msTime >>> (40 - 8 * i)); | ||
| } | ||
|
|
||
| // Scale sub-ms nanoseconds to a 12-bit value | ||
| int nsBits = (int) ((nsTime % 1_000_000) / 1_000_000.0 * 4096); | ||
|
|
||
| // Set version and increased percision time bits | ||
| randomBytes[6] = (byte) (0x70 | ((nsBits >>> 8) & 0x0F)); | ||
| randomBytes[7] = (byte) (nsBits & 0xFF); | ||
|
|
||
| // Set variant to 2 | ||
| randomBytes[8] &= 0x3F; | ||
| randomBytes[8] |= (byte) 0x80; | ||
|
|
||
| return new UUID(randomBytes); |
There was a problem hiding this comment.
if so, I've tried out some variations, i found creating the 64 bit lsb with ng.nextLong() brings a large pefomance decrease over using the nextBytes method
Probably because SecureRandom gets #nextLong() from Random, which ends up calling #next(int) twice, so it allocates twice. Overriding #nextLong() in SecureRandom may help a little but will still have to allocate as long as SecureRandomSpi is not updated.
|
Hi @BsoBird, thanks for making a comment in an OpenJDK project! All comments and discussions in the OpenJDK Community must be made available under the OpenJDK Terms of Use. If you already are an OpenJDK Author, Committer or Reviewer, please click here to open a new issue so that we can record that fact. Please Use "Add GitHub user BsoBird" for the summary. If you are not an OpenJDK Author, Committer or Reviewer, simply check the box below to accept the OpenJDK Terms of Use for your comments.
Your comment will be automatically restored once you have accepted the OpenJDK Terms of Use. |
| } | ||
|
|
||
| /** | ||
| * Static factory to create a version 7 (time-based) {@code UUID} with the current |
There was a problem hiding this comment.
| * Static factory to create a version 7 (time-based) {@code UUID} with the current | |
| * Static factory to create a version 7 (Unix epoch time-based) {@code UUID} with the current |
|
The RFC states that 'Implementations SHOULD utilize a cryptographically secure pseudorandom number generator (CSPRNG) to provide values that are both difficult to predict ("unguessable") and have a low likelihood of collision ("unique")." That implementation uses ThreadLocalRandom which does not generate cryptographically secure randomness. For improved security and uniqueness of UUIDs, it might be better to use SecureRandom, also aligning with the behavoir of the randomUUID() method. |
| for (int i = 0; i < 6; i++) { | ||
| randomBytes[i] = (byte) (timestamp >>> (40 - 8 * i)); | ||
| } |
There was a problem hiding this comment.
If you unroll this loop by hand, by shifting and assigning to each byte, there will be a slight performance boost.
And it can take advantage of an optimization the recognized assignment to adjacent bytes and generates a single assignment for 4 bytes and then 2 bytes.
Otherwise, looks good.
Co-authored-by: Andrey Turbanov <[email protected]>
|
@kieran-farrell Please do not rebase or force-push to an active PR as it invalidates existing review comments. Note for future reference, the bots always squash all changes into a single commit automatically as part of the integration. See OpenJDK Developers’ Guide for more information. |
| throw new AssertionError("Generated UUID should not be null for timestamp: " + timestamp); | ||
| } | ||
| } catch (Exception e) { | ||
| throw new AssertionError("Unexpected exception with timestamp " + timestamp + ": " + e); |
There was a problem hiding this comment.
Please pass the original exception e as the cause of this AssertionError, something like:
throw new AssertionError("Unexpected exception with timestamp " + timestamp, e);
That will help debug any failures if at all they happen. Same comment for a few other places in this new test method.
| } | ||
| } | ||
|
|
||
| private static void epochMillis_userInputTest() { |
There was a problem hiding this comment.
Nit - it might be better to name this method testEpochMillisTimestamp()
|
I've added |
jaikiran
left a comment
There was a problem hiding this comment.
Thank you for all these updates and the patience during the reviews. The latest changes in b2239d8 look good to me. The latest state of the CSR too looks good to me and I see it has been Reviewed by others too. I think you can go ahead and move it to Finalized.
RogerRiggs
left a comment
There was a problem hiding this comment.
Looks good.
Please re-finalize the CSR.
|
thanks all for the reviews! |
There was a problem hiding this comment.
As follow-up (not this PR), we might consider re-visiting the switch from UUID in the first sentence/paragraph to "global identifiers" in the second paragraph. We seem to missing an "also known" to allow this switch.
There was a problem hiding this comment.
| @@ -180,6 +180,62 @@ public static UUID nameUUIDFromBytes(byte[] name) { | |||
| return new UUID(md5Bytes); | |||
| } | |||
|
|
|||
| /** | |||
| * Creates a {@code UUIDv7} {@code UUID} from the given Unix Epoch timestamp. | |||
There was a problem hiding this comment.
The existing UUID docs, including the existing static factory methods, use "type $N UUID" rather than "UUIDv$N". It would be okay to say "Creates a type 7 UUID (UUIDv7) ..." so that it fits better with the existing docs, and allows you to use UUIDv7 in this description and API note.
| * version value of 1, 2, 3 and 4, respectively. | ||
| * <p> See <a href="https://www.rfc-editor.org/rfc/rfc9562.html"> | ||
| * <i>RFC 9562: Universally Unique Identifiers (UUIDs)</i></a> for the complete specification, | ||
| * including algorithms used to create {@code UUID}s. |
There was a problem hiding this comment.
Moving the dependency from RFC 4122 to RFC 95262 is good.
With the new reference then it would be good to expand a bit of what the reader will find. Instead of "including algorithms ..." then it would be clearer to say "including the UUID format, layouts, and algorithms for creating UUIDs.".
AlanBateman
left a comment
There was a problem hiding this comment.
New method + spec look good.
We should create a follow-up issue to re-visit the switch from UUID in the first sentence/paragraph to "global identifiers" in the second paragraph.
|
/integrate |
|
@kieran-farrell |
|
I'll sponsor it when the CI run finishes. Thanks! |
|
/sponsor |
|
Going to push as commit 642ba4c.
Your commit was automatically rebased without conflicts. |
|
@RogerRiggs @kieran-farrell Pushed as commit 642ba4c. 💡 You may see a message that your pull request was closed with unmerged commits. This can be safely ignored. |
| // Should not throw for valid currentTimeMillis() timestamp | ||
| long timestamp = System.currentTimeMillis(); | ||
| try { | ||
| UUID u = UUID.ofEpochMillis(timestamp); |
There was a problem hiding this comment.
Note that this test will start failing in 20889 HE.
There was a problem hiding this comment.
Hello @ExE-Boss, what would be the failure in that case? I had a very brief look at the linked page and I couldn't relate how that impacts the timestamp based UUID.ofEpochMillis(...).
There was a problem hiding this comment.
@jaikiran UUID::ofEpochMillis throws IllegalArgumentException when the timestamp doesn’t fit within 48 bits, and +10889-08-02T05:31:50.656Z1 needs 49 bits to represent, which is tested on line 183:
jdk/test/jdk/java/util/UUID/UUIDTest.java
Lines 180 to 185 in 2761361
With the recent approval of UUIDv7 (https://datatracker.ietf.org/doc/rfc9562/), this PR aims to add a new static method UUID.timestampUUID() which constructs and returns a UUID in support of the new time generated UUID version.
The specification requires embedding the current timestamp in milliseconds into the first bits 0–47. The version number in bits 48–51, bits 52–63 are available for sub-millisecond precision or for pseudorandom data. The variant is set in bits 64–65. The remaining bits 66–127 are free to use for more pseudorandom data or to employ a counter based approach for increased time percision (https://www.rfc-editor.org/rfc/rfc9562.html#name-uuid-version-7).
The choice of implementation comes down to balancing the sensitivity level of being able to distingush UUIDs created below <1ms apart with performance. A test simulating a high-concurrency environment with 4 threads generating 10000 UUIDv7 values in parallel to measure the collision rate of each implementation (the amount of times the time based portion of the UUID was not unique and entries could not distinguished by time) yeilded the following results for each implemtation:
Performance tests show a decrease in performance as expected with the counter based implementation due to the introduction of synchronization:
The best balance here might be to employ a higher-precision implementation as the large increase in time sensitivity comes at a very slight performance cost.
Progress
Issues
Reviewers
Reviewing
Using
gitCheckout this PR locally:
$ git fetch https://git.openjdk.org/jdk.git pull/25303/head:pull/25303$ git checkout pull/25303Update a local copy of the PR:
$ git checkout pull/25303$ git pull https://git.openjdk.org/jdk.git pull/25303/headUsing Skara CLI tools
Checkout this PR locally:
$ git pr checkout 25303View PR using the GUI difftool:
$ git pr show -t 25303Using diff file
Download this PR as a diff file:
https://git.openjdk.org/jdk/pull/25303.diff
Using Webrev
Link to Webrev Comment