Skip to content

pkg/stringid: optimize GenerateRandomID#48706

Merged
thaJeztah merged 1 commit intomoby:masterfrom
thaJeztah:stringid_optimize
Oct 21, 2024
Merged

pkg/stringid: optimize GenerateRandomID#48706
thaJeztah merged 1 commit intomoby:masterfrom
thaJeztah:stringid_optimize

Conversation

@thaJeztah
Copy link
Copy Markdown
Member

@thaJeztah thaJeztah commented Oct 20, 2024

pkg/stringid: optimize GenerateRandomID

GenerateRandomID has a check to verify if the generated ID was numeric. This
check was added because a container's short-ID is used as default hostname for
containers, which isn't allowed to be consisting of only numbers (see moby#3869
and https://bugzilla.redhat.com/show_bug.cgi?id=1059122.

Producing an random ID with only numbers is a rare corner-case, but the check
would always be executed and wasn't optimized.

This patch applies some optimizations:

  • The code was using strconv.ParseUInt, which has additional checks for
    signs ("+" or "-"); hex.EncodeToString would never produce these, so
    we can use strconv.ParseInt instead (which doesn't have these checks).
  • The code was using TruncateID(id) to get the short-ID. The TruncateID
    function is designed to also handle digests, and for that checks for
    the given ID to contain colons (:), which it would split to remove
    the algorithm (sha256:) before truncating to the short-ID length.
    That check wasn't needed either, because those would not be produced
    by hex.EncodeToString, so instead, we can just truncate the ID.
  • Finally, all we really need to check for is if the ID consists of only
    numeric characters (0-9) so, let's do just that; if any non-numeric
    value is found, the ID is valid, and we can terminate the loop.

I did some basic benchmark to compare all of the above in isolation;

  • BenchmarkParseInt: strconv.ParseInt(TruncateID(id), 10, 64)
  • BenchmarkParseUInt: strconv.ParseUint(TruncateID(id), 10, 64)
  • BenchmarkParseUIntNoTrunc: strconv.ParseUint(id[:shortLen], 10, 64)
  • BenchmarkAllNum: allNum(id[:shortLen])

Results of the above:

BenchmarkParseInt-10                1713937       691.0 ns/op     480 B/op      18 allocs/op
BenchmarkParseIntNoTrunc-10         3385483       356.1 ns/op     480 B/op      18 allocs/op
BenchmarkParseUInt-10               2112538       567.7 ns/op     384 B/op      12 allocs/op
BenchmarkParseUIntNoTrunc-10        4325847       266.7 ns/op     384 B/op      12 allocs/op
BenchmarkAllNum-10                 77277264        15.29 ns/op      0 B/op       0 allocs/op

Difference for GenerateRandomID as a whole is less dramatic, as in most
cases ParseInt would bail out early, but still saves some allocations, and
performance is ~14% better:

BenchmarkGenerateRandomID-10        2807764       424.5 ns/op     240 B/op       6 allocs/op
BenchmarkGenerateRandomIDNew-10     3288866       366.6 ns/op     160 B/op       3 allocs/op

@thaJeztah thaJeztah added status/2-code-review area/performance area/daemon Core Engine kind/refactor PR's that refactor, or clean-up code labels Oct 20, 2024
@thaJeztah thaJeztah added this to the 28.0.0 milestone Oct 20, 2024
@thaJeztah thaJeztah self-assigned this Oct 20, 2024
@thaJeztah thaJeztah changed the title pkg/stringid: optimize BenchmarkGenerateRandomID pkg/stringid: optimize GenerateRandomID Oct 20, 2024
@thaJeztah thaJeztah force-pushed the stringid_optimize branch 3 times, most recently from b99dd7b to 70c3ec4 Compare October 21, 2024 08:08
@thaJeztah thaJeztah marked this pull request as ready for review October 21, 2024 08:08
Comment thread pkg/stringid/stringid_test.go Outdated
func BenchmarkGenerateRandomID(b *testing.B) {
b.ReportAllocs()
for i := 0; i < b.N; i++ {
_ = GenerateRandomID()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This benchmark isn't able to measure the performance consistently as the behavior of GenerateRandomID is not deterministic due to usage of crypto/rand:

  1. It has a random number of iterations (each calls rand.Read and performs the check)
  2. The rand.Read itself doesn't guarantee a consistent performance.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To make this consistent we could swap the Reader with a math/rand reader created from a source with a fixed seed (just for the benchmark).

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yeah, good point. Honestly, I initially had the separate benchmark for the "numeric" check in isolation (for this reason), and added the other benchmark mostly to see if the effect was (somewhat) measurable in the wider scope of this function;

I did some basic benchmark to compare all of the above in isolation;

  • BenchmarkParseInt: strconv.ParseInt(TruncateID(id), 10, 64)
  • BenchmarkParseUInt: strconv.ParseUint(TruncateID(id), 10, 64)
  • BenchmarkParseUIntNoTrunc: strconv.ParseUint(id[:shortLen], 10, 64)
  • BenchmarkAllNum: allNum(id[:shortLen])

Perhaps it's not worth keeping the benchmark, and I should just remove it; WDYT?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went ahead and dropped the first commit.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can still keep the benchmark, but perhaps it would be worth to add a comment that it's not suitable for comparing different implementations without addressing the RNG predictability first.

GenerateRandomID has a check to verify if the generated ID was numeric. This
check was added because a container's short-ID is used as default hostname for
containers, which isn't allowed to be consisting of only numbers (see [moby#3869]
and https://bugzilla.redhat.com/show_bug.cgi?id=1059122.

Producing an random ID with only numbers is a rare corner-case, but the check
would always be executed and wasn't optimized.

This patch applies some optimizations:

- The code was using `strconv.ParseUInt`, which has additional checks for
  signs ("+" or "-"); `hex.EncodeToString` would never produce these, so
  we can use `strconv.ParseInt` instead (which doesn't have these checks).
- The code was using `TruncateID(id)` to get the short-ID. The `TruncateID`
  function is designed to also handle digests, and for that checks for
  the given ID to contain colons (`:`), which it would split to remove
  the algorithm (`sha256:`) before truncating to the short-ID length.
  That check wasn't needed either, because those would not be produced
  by `hex.EncodeToString`, so instead, we can just truncate the ID.
- Finally, all we _really_ need to check for is if the ID consists of only
  numeric characters (`0-9`) so, let's do just that; if any non-numeric
  value is found, the ID is valid, and we can terminate the loop.

I did some basic benchmark to compare all of the above in isolation;

- BenchmarkParseInt: `strconv.ParseInt(TruncateID(id), 10, 64)`
- BenchmarkParseUInt: `strconv.ParseUint(TruncateID(id), 10, 64)`
- BenchmarkParseUIntNoTrunc: `strconv.ParseUint(id[:shortLen], 10, 64)`
- BenchmarkAllNum: `allNum(id[:shortLen])`

Results of the above:

    BenchmarkParseInt-10                1713937       691.0 ns/op     480 B/op      18 allocs/op
    BenchmarkParseIntNoTrunc-10         3385483       356.1 ns/op     480 B/op      18 allocs/op
    BenchmarkParseUInt-10               2112538       567.7 ns/op     384 B/op      12 allocs/op
    BenchmarkParseUIntNoTrunc-10        4325847       266.7 ns/op     384 B/op      12 allocs/op
    BenchmarkAllNum-10                 77277264        15.29 ns/op      0 B/op       0 allocs/op

Difference for `GenerateRandomID` as a whole is less dramatic, as in most
cases `ParseInt` would bail out early, but still saves some allocations, and
performance is ~14% better:

    BenchmarkGenerateRandomID-10        2807764       424.5 ns/op     240 B/op       6 allocs/op
    BenchmarkGenerateRandomIDNew-10     3288866       366.6 ns/op     160 B/op       3 allocs/op

[moby#3869]: moby#3869

Signed-off-by: Sebastiaan van Stijn <[email protected]>
Copy link
Copy Markdown
Member

@laurazard laurazard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, with a nit.

Comment thread pkg/stringid/stringid.go
Comment on lines +56 to +63
func allNum(id string) bool {
for _, c := range []byte(id) {
if c > '9' || c < '0' {
return false
}
}
return true
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing wrong with this, but it's worth calling out that this is only fine because the output of hex.EncodeToString will only return strings composed of characters belonging to [0-9a-f] – iirc, in general, this isn't safe due to strings being represented as utf8 code-points which might be made up of more than 1 byte.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think, albeit maybe overly cautious, it wouldn't hurt to add a comment about this.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes, that makes sense. I looked at it from a "this is non-exported" perspective, but agreed, it's better to be cautious here.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! ("erm, actually") I don't think we need a warning here. I had this in the back of my mind, but had to refresh my memory a bit to verify, but UTF8 was designed with that in mind; for multibyte characters, the first byte is always outside of the ASCII range, so unless we would manage to skip a byte, we would never hit this.

Yes, whoever designed UTF8 was really smart!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, that makes sense! TIL, thanks @thaJeztah!

@thaJeztah thaJeztah merged commit 3f9e489 into moby:master Oct 21, 2024
@thaJeztah thaJeztah deleted the stringid_optimize branch October 21, 2024 14:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/daemon Core Engine area/performance kind/refactor PR's that refactor, or clean-up code status/4-merge

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants