Faster StringDictionaryBuilder (~60% faster) (#1851) #1861
Faster StringDictionaryBuilder (~60% faster) (#1851) #1861tustvold merged 1 commit intoapache:masterfrom
Conversation
7aed1fe to
14bbd49
Compare
arrow/Cargo.toml
Outdated
There was a problem hiding this comment.
This is "technically" a new dependency as indexmap depends on hashbrown = 0.11, I wasn't sure whether to use an out of data dependency or not here...
There was a problem hiding this comment.
maybe it is time to submit a PR to update index map?
There was a problem hiding this comment.
There has been an open issue since February - indexmap-rs/indexmap#217
There was a problem hiding this comment.
Update here -- I spent a few minutes looking into the use of indexmap -- I think with a small amount of effort we could remove that dependency. I'll write up a ticket and maybe a PR for it shortly
arrow/Cargo.toml
Outdated
There was a problem hiding this comment.
This is a new dependency, although it is the default hash function for hashbrown so...
14bbd49 to
c054466
Compare
c054466 to
0bffee1
Compare
0bffee1 to
8032029
Compare
8032029 to
0e893cf
Compare
There was a problem hiding this comment.
These APIs are effectively part of #1860
0e893cf to
75f4945
Compare
| bench = false | ||
|
|
||
| [dependencies] | ||
| ahash = { version = "0.7", default-features = false } |
There was a problem hiding this comment.
Adding these new dependencies are my only hesitation for this PR.
Any other thoughts from maintainers or users?
There was a problem hiding this comment.
Hashbrown isn't a new dependency as it is used by indexmap (and the standard library). It also by default brings in ahash so that I think isn't new either.
There was a problem hiding this comment.
What do you mean that hashbrown is used by the standard library?
There was a problem hiding this comment.
The hashmap implementation in the standard library is hashbrown since Rust 1.36 - https://blog.rust-lang.org/2019/07/04/Rust-1.36.0.html#a-new-hashmapk-v-implementation
There was a problem hiding this comment.
The standard library includes its own version though (so it won't pull in the same version / dependency as adding the dependency in a project does)
There was a problem hiding this comment.
Yeah, I was being a bit disingenuous. I more meant from the perspective, "can I rely on this to be well-supported going forward", etc... it is pretty safe 😆
alamb
left a comment
There was a problem hiding this comment.
I will find time to review this PR in detail in a day or two. I think the dependency issue has been resolved (as in the new dependencies are ok, as they were transitive dependencies anyways)
|
@alamb if it helps this is a direct port of the string interner I added to IOx in https://github.com/influxdata/influxdb_iox/pull/1273 |
| for (idx, maybe_value) in dictionary_values.iter().enumerate() { | ||
| match maybe_value { | ||
| Some(value) => { | ||
| let hash = compute_hash(&state, value.as_bytes()); |
There was a problem hiding this comment.
For my own curiosity, I stared at this for a while
I would describie this as using HashMap as a HashSet of pointers (indexes) to canonical string values in StringBuilder. It then uses the low level hash brown APIs to checks if the same string has previously been inserted.
Very nice
I wondered a little why we needed to store any value for key 🤔 at all, but then I realized it is needed so we the code can find the corresponding string value
Which issue does this PR close?
Closes #1851
Relates to #1843
Rationale for this change
StringDictionaryBuilder can be made significantly faster
What changes are included in this PR?
There are two major changes in this PR
The first is ~40% uplift regardless of data shape, the latter adds a further performance improvement ranging from ~10-20% depending on the dictionary size.
Are there any user-facing changes?
No