Faster StringDictionaryBuilder (~60% faster) (#1851) by tustvold · Pull Request #1861 · apache/arrow-rs

tustvold · 2022-06-13T11:41:59Z

Which issue does this PR close?

Closes #1851
Relates to #1843

Rationale for this change

StringDictionaryBuilder can be made significantly faster

What changes are included in this PR?

There are two major changes in this PR

Switch to ahash
Avoid caching string keys in HashMap

The first is ~40% uplift regardless of data shape, the latter adds a further performance improvement ranging from ~10-20% depending on the dictionary size.

string_dictionary_builder/(dict_size:20, len:1000, key_len: 5)                                                                             
                        time:   [15.148 us 15.179 us 15.213 us]
                        change: [-49.937% -49.607% -49.183%] (p = 0.00 < 0.05)
                        Performance has improved.
string_dictionary_builder/(dict_size:100, len:1000, key_len: 5)                                                                             
                        time:   [15.334 us 15.372 us 15.408 us]
                        change: [-60.780% -60.676% -60.577%] (p = 0.00 < 0.05)
                        Performance has improved.
string_dictionary_builder/(dict_size:100, len:1000, key_len: 10)                                                                             
                        time:   [14.638 us 14.653 us 14.668 us]
                        change: [-66.763% -66.716% -66.673%] (p = 0.00 < 0.05)
                        Performance has improved.
string_dictionary_builder/(dict_size:100, len:10000, key_len: 10)                                                                            
                        time:   [131.08 us 131.15 us 131.23 us]
                        change: [-61.008% -60.966% -60.922%] (p = 0.00 < 0.05)
                        Performance has improved.
string_dictionary_builder/(dict_size:100, len:10000, key_len: 100)                                                                            
                        time:   [379.73 us 379.89 us 380.06 us]
                        change: [-61.999% -61.946% -61.887%] (p = 0.00 < 0.05)
                        Performance has improved.

Are there any user-facing changes?

No

tustvold · 2022-06-13T11:43:41Z

arrow/Cargo.toml

This is "technically" a new dependency as indexmap depends on hashbrown = 0.11, I wasn't sure whether to use an out of data dependency or not here...

maybe it is time to submit a PR to update index map?

There has been an open issue since February - indexmap-rs/indexmap#217

Update here -- I spent a few minutes looking into the use of indexmap -- I think with a small amount of effort we could remove that dependency. I'll write up a ticket and maybe a PR for it shortly

tustvold · 2022-06-13T11:45:11Z

arrow/Cargo.toml

This is a new dependency, although it is the default hash function for hashbrown so...

tustvold · 2022-06-24T11:21:35Z

arrow/src/array/builder/generic_list_builder.rs

These APIs are effectively part of #1860

alamb · 2022-06-26T13:04:05Z

arrow/Cargo.toml

 bench = false

 [dependencies]
+ahash = { version = "0.7", default-features = false }


Adding these new dependencies are my only hesitation for this PR.

Any other thoughts from maintainers or users?

Hashbrown isn't a new dependency as it is used by indexmap (and the standard library). It also by default brings in ahash so that I think isn't new either.

What do you mean that hashbrown is used by the standard library?

The hashmap implementation in the standard library is hashbrown since Rust 1.36 - https://blog.rust-lang.org/2019/07/04/Rust-1.36.0.html#a-new-hashmapk-v-implementation

The standard library includes its own version though (so it won't pull in the same version / dependency as adding the dependency in a project does)

Yeah, I was being a bit disingenuous. I more meant from the perspective, "can I rely on this to be well-supported going forward", etc... it is pretty safe 😆

alamb

I will find time to review this PR in detail in a day or two. I think the dependency issue has been resolved (as in the new dependencies are ok, as they were transitive dependencies anyways)

tustvold · 2022-06-29T11:21:20Z

@alamb if it helps this is a direct port of the string interner I added to IOx in https://github.com/influxdata/influxdb_iox/pull/1273

Dandandan

LGTM

alamb

nice 👍

alamb · 2022-06-29T23:51:07Z

arrow/src/array/builder/string_dictionary_builder.rs

+        for (idx, maybe_value) in dictionary_values.iter().enumerate() {
+            match maybe_value {
+                Some(value) => {
+                    let hash = compute_hash(&state, value.as_bytes());


For my own curiosity, I stared at this for a while

I would describie this as using HashMap as a HashSet of pointers (indexes) to canonical string values in StringBuilder. It then uses the low level hash brown APIs to checks if the same string has previously been inserted.

Very nice

I wondered a little why we needed to store any value for key 🤔 at all, but then I realized it is needed so we the code can find the corresponding string value

github-actions bot added the arrow Changes to the arrow crate label Jun 13, 2022

tustvold force-pushed the faster-string-dictionary-builder branch from 7aed1fe to 14bbd49 Compare June 13, 2022 11:42

tustvold commented Jun 13, 2022

View reviewed changes

arrow/Cargo.toml Outdated

Copy link

Contributor Author

tustvold Jun 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new dependency, although it is the default hash function for hashbrown so...

tustvold force-pushed the faster-string-dictionary-builder branch from 14bbd49 to c054466 Compare June 13, 2022 11:46

alamb mentioned this pull request Jun 15, 2022

Remove indexmap dependency #1882

Closed

tustvold force-pushed the faster-string-dictionary-builder branch from c054466 to 0bffee1 Compare June 24, 2022 11:19

github-actions bot added the arrow-flight Changes to the arrow-flight crate label Jun 24, 2022

tustvold force-pushed the faster-string-dictionary-builder branch from 0bffee1 to 8032029 Compare June 24, 2022 11:19

tustvold marked this pull request as ready for review June 24, 2022 11:20

tustvold force-pushed the faster-string-dictionary-builder branch from 8032029 to 0e893cf Compare June 24, 2022 11:20

tustvold commented Jun 24, 2022

View reviewed changes

arrow/src/array/builder/generic_list_builder.rs Outdated

Copy link

Contributor Author

tustvold Jun 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These APIs are effectively part of #1860

Faster StringDictionaryBuilder (apache#1851)

75f4945

tustvold force-pushed the faster-string-dictionary-builder branch from 0e893cf to 75f4945 Compare June 24, 2022 11:36

tustvold removed the arrow-flight Changes to the arrow-flight crate label Jun 24, 2022

alamb reviewed Jun 26, 2022

View reviewed changes

alamb reviewed Jun 28, 2022

View reviewed changes

Dandandan approved these changes Jun 29, 2022

View reviewed changes

tustvold merged commit 903b24a into apache:master Jun 29, 2022

alamb reviewed Jun 29, 2022

View reviewed changes

tustvold mentioned this pull request Jul 21, 2022

Faster parquet DictEncoder (~20%) #2123

Merged

Conversation

tustvold commented Jun 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Dandandan Jun 27, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

tustvold commented Jun 29, 2022

Uh oh!

Dandandan left a comment

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

tustvold commented Jun 13, 2022 •

edited

Loading

Dandandan Jun 27, 2022 •

edited

Loading