Skip to content

Conversation

@gf2121
Copy link
Contributor

@gf2121 gf2121 commented Sep 20, 2023

Description

Recently, we captured a flame graph in a scene with frequent updates, which showed that sorting deleted terms occupied a high CPU ratio.

In scenarios with many deleted terms, most terms could have the same field name. So a data structure like Map<String, Map<BytesRef, Integer>> instead of Map<Term, Integer> could be better here —— We can avoid field name compare and use MSBRadixSort to sort the bytes for each field.

We can also take advantage of BytesRefHash to implement Map<BytesRef, Integer> to get a more efficient memory layout, and there's already a MSBRadixSort impl in BytesRefHash we can reuse.

Benchmark

We benchmarked the sort logic for 1,000,000 terms, showing 66% took decreasing:

  Baseline Candidate Took Diff
total took 692 234 -66.18%

An E2E benchmark that delete document with term 1,000,000 times showing took decreased 43% (with default iw config).

  Baseline Candidate Took Diff
total took 1629 923 -43.34%

@gf2121 gf2121 changed the title Speed up sort on deleted terms Use radix sort to speed up the sorting of deleted terms Sep 21, 2023
Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! I left some minor comments but it looks great in general.

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I like that it also makes buffered deletes more memory-efficient as a side-effect.

@gf2121
Copy link
Contributor Author

gf2121 commented Sep 22, 2023

Thanks @jpountz !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants