Skip to content

Refactor the Numeric Tree to use HyperLogLogs - [MOD-7966, MOD-8116]#5049

Merged
GuyAv46 merged 52 commits intomasterfrom
guyav-POC_numeric_tree_with_hll
Dec 19, 2024
Merged

Refactor the Numeric Tree to use HyperLogLogs - [MOD-7966, MOD-8116]#5049
GuyAv46 merged 52 commits intomasterfrom
guyav-POC_numeric_tree_with_hll

Conversation

@GuyAv46
Copy link
Collaborator

@GuyAv46 GuyAv46 commented Sep 26, 2024

Refactoring the cardinality mechanism of the numeric tree.

Background

  • Until version 2.6, each leaf held a cardinality values array that held all its unique values. This allowed the system to track its cardinality and decide when to split.
  • In 2.6, a modification to this mechanism was introduced, which only takes into account 10% of the leaf's data (checking cardinality every 10th insertion) to speed up indexing time.
  • In the splitting logic, it was assumed that the cardinality of 10% of the data is 10% of the overall cardinality (which may be a good estimate if the data cardinality is high).

Issues:

  1. This assumption may be false, causing the cardinality estimation to become up to X10 higher than it is, yielding premature node splitting, higher memory consumption, slower indexing, and a less optimal numeric index.
  2. Cardinality check is done linearly on the cardinality array, which is slow
  3. The cardinality values array can get up to 2500 entries before splitting, which by itself consumes (sizeof double + sizeof size_t) * 2500 = 40,000 Bytes

The Solution

  • In this PR, we refactor the cardinality mechanism and replace the cardinality values array with a HyperLogLog (HLL).
  • Since updating the HLL is relatively fast (O(1)), we now consult the entire dataset in the cardinality calculation with a comparable execution time, yielding a more accurate cardinality estimation
  • We move to split leaves over the leaf median, yielding a more balanced split every time
  • Memory consumption of a single HLL is 2^6 = 64 Bytes (with current settings of 2^6 registers).

Additional Details:

  1. The split cardinality of a leaf is now determined by the node's depth (instead of relying on the parent split cardinality, which may be too high after a massive data update).
  2. Our HLL implementation got a facelift, including the usage of efficient CPU instructions and a caching mechanism
  3. Dropping the definitions NF_INFINITY and NF_NEGATIVE_INFINITY, and start using the C macro INFINITY instead.
  4. Implementing heap_doubles for efficient heap of doubles.
  5. Various code cleanup

Mark if applicable

  • This PR introduces API changes
  • This PR introduces serialization changes

@GuyAv46 GuyAv46 changed the title basic implementation Numeric Tree with HLL Sep 26, 2024
@fcostaoliveira
Copy link
Contributor

fcostaoliveira commented Sep 26, 2024

Automated performance analysis summary

This comment was automatically generated given there is performance data available.

In summary:

  • Detected a total of 33 stable tests between versions.
  • Detected a total of 8 highly unstable benchmarks.
  • Detected a total of 7 improvements above the improvement water line.
  • Detected a total of 1 regressions bellow the regression water line 5.0.

You can check a comparison in detail via the grafana link

Comparison between master and guyav-POC_numeric_tree_with_hll.

Time Period from 30 days ago. (environment used: oss-standalone)

Test Case Baseline master (median obs. +- std.dev) Comparison guyav-POC_numeric_tree_with_hll (median obs. +- std.dev) % change (higher-better) Note
ftsb-10K-enwiki_abstract-hashes-fulltext-sortby 68 +- 1.7% (7 datapoints) 68.00 -0.9% No Change
ftsb-10K-enwiki_abstract-hashes-term-prefix 7473 +- 1.4% (7 datapoints) 7437.00 -0.5% No Change
ftsb-10K-enwiki_abstract-hashes-term-suffix 1944 +- 1.7% (7 datapoints) 2094.00 7.7% IMPROVEMENT
ftsb-10K-enwiki_abstract-hashes-term-suffix-withsuffixtrie 63739 +- 2.9% (7 datapoints) 60240.00 -5.5% REGRESSION
ftsb-10K-enwiki_abstract-hashes-term-wildcard 12388 +- 3.0% (7 datapoints) 12127.00 -2.1% No Change
ftsb-10K-enwiki_pages-hashes-fulltext-mixed_simple-1word-query_write_1_to_read_20.yml 1265 +- 4.3% (7 datapoints) 1294.00 2.3% No Change
ftsb-10K-enwiki_pages-hashes-load 43408 +- 2.7% (7 datapoints) 44134.00 1.7% No Change
ftsb-10K-multivalue-numeric-json 795 +- 1.7% (7 datapoints) 837.00 5.2% IMPROVEMENT
ftsb-10K-singlevalue-numeric-json 331 +- 2.4% (7 datapoints) 340.00 2.5% No Change
ftsb-1K-enwiki_abstract-hashes-term-contains 1691 +- 4.7% (7 datapoints) 1795.00 6.2% IMPROVEMENT
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-intersection-query 357 +- 7.4% (7 datapoints) 348.00 -2.6% waterline=7.4%. No Change
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-intersection-query-non-sortable 40 +- 13.3% UNSTABLE (7 datapoints) 42.00 6.2% UNSTABLE (very high variance)
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-intersection-query@100_ops_sec 100 +- 0.0% (7 datapoints) 100.00 -0.0% No Change
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-union-query 2663 +- 9.1% (7 datapoints) 2617.00 -1.7% waterline=9.1%. No Change
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-union-query-non-sortable 1106 +- 10.1% UNSTABLE (7 datapoints) 1141.00 3.2% UNSTABLE (very high variance)
ftsb-1M-enwiki_abstract-hashes-fulltext-2word-union-query@100_ops_sec 100 +- 0.0% (7 datapoints) 100.00 0.0% No Change
ftsb-1M-enwiki_abstract-hashes-fulltext-simple-1word-query 1066 +- 12.1% UNSTABLE (7 datapoints) 1048.00 -1.6% UNSTABLE (very high variance)
ftsb-1M-enwiki_abstract-hashes-fulltext-simple-1word-query@100_ops_sec 100 +- 0.0% (7 datapoints) 100.00 -0.0% No Change
ftsb-1M-enwiki_abstract-hashes-load 21772 +- 2.8% (7 datapoints) 22129.00 1.6% No Change
ftsb-1M-nyc_taxis-ftadd-load 19191 +- 4.1% (7 datapoints) 19355.00 0.9% No Change
ftsb-1M-nyc_taxis-hashes-load 20516 +- 3.9% (7 datapoints) 21421.00 4.4% potential IMPROVEMENT
search-aggregate-post-filter-simple.yml 80070 +- 3.5% (7 datapoints) 77469.00 -3.2% potential REGRESSION
search-expire-doc-10-milliseconds 0.60 +- 2.1% (4 datapoints) 0.61 1.7% No Change
search-expire-doc-1000-seconds 0.60 +- 2.1% (4 datapoints) 0.59 -1.7% No Change
search-expired-numeric-field-10-milliseconds 0.61 +- 2.5% (5 datapoints) 0.61 0.0%
search-expired-numeric-field-1000-seconds 0.60 +- 1.9% (5 datapoints) 0.61 1.7% No Change
search-filtering-tag-numeric 181 +- 8.1% (7 datapoints) 274.00 51.1% waterline=8.1%. IMPROVEMENT
search-filtering-tag-numeric-filter-pipeline 18459 +- 2.3% (7 datapoints) 18456.00 -0.0% No Change
search-ftsb-10K-enwiki_abstract-hashes-term-withoutsuffix-trie 37222 +- 3.9% (7 datapoints) 37501.00 0.8% No Change
search-ftsb-10K-enwiki_abstract-hashes-term-withsuffix-trie 37510 +- 4.4% (7 datapoints) 39037.00 4.1% potential IMPROVEMENT
search-ftsb-1700K-docs-union-iterators-q3 6.1 +- 1.9% (7 datapoints) 6.00 -2.0% No Change
search-ftsb-1M-enwiki_abstract-hashes-fulltext-2word-intersection-query-non-sortable@50_ops_sec 49 +- 10.8% UNSTABLE (7 datapoints) 50.00 2.7% UNSTABLE (very high variance)
search-ftsb-1M-enwiki_abstract-hashes-fulltext-2word-union-query-non-sortable@100_ops_sec 100 +- 0.0% (7 datapoints) 100.00 -0.0% No Change
search-ftsb-1M-enwiki_abstract-hashes-fulltext-simple-1word-query-non-sortable 147 +- 5.4% (7 datapoints) 154.00 4.3% waterline=5.4%. potential IMPROVEMENT
search-ftsb-370K-docs-union-iterators-q4 6.5 +- 2.2% (7 datapoints) 6.70 2.4% No Change
search-ftsb-5200K-docs-union-iterators-q1 0.60 +- 1.9% (6 datapoints) 0.62 3.3% potential IMPROVEMENT
search-ftsb-5500K-docs-union-iterators-q2 0.91 +- 0.5% (5 datapoints) 0.93 2.2% No Change
search-geo 179 +- 5.0% (7 datapoints) 180.00 0.6% waterline=5.0%. No Change
search-high-cardinality-negation-term-baseline 20 +- 5.2% (7 datapoints) 20.00 2.1% waterline=5.2%. No Change
search-high-cardinality-negation-term-comparison_union_all_other_terms 7.0 +- 3.7% (7 datapoints) 6.80 -1.7% No Change
search-numeric 2769 +- 35.2% UNSTABLE (7 datapoints) 2592.00 -6.4% UNSTABLE (very high variance)
search-numeric-optimize 9052 +- 2.3% (7 datapoints) 9764.00 7.9% IMPROVEMENT
search-numeric-sortby 3328 +- 20.9% UNSTABLE (7 datapoints) 4433.00 33.2% UNSTABLE (very high variance)
search-numeric-sortby-desc 2993 +- 27.4% UNSTABLE (7 datapoints) 2364.00 -21.0% UNSTABLE (very high variance)
search-numeric-sortby-desc-optimize 40 +- 26.6% UNSTABLE (7 datapoints) 26.00 -34.9% UNSTABLE (very high variance)
search-numeric-sortby-optimize 19 +- 3.3% (7 datapoints) 21.00 12.4% IMPROVEMENT
vecsim-arxiv-titles-384-angular-filters-m16-ef-128-fulltext-filter 485 +- 4.9% (7 datapoints) 509.00 4.9% potential IMPROVEMENT
vecsim-arxiv-titles-384-angular-filters-m16-ef-128-numeric-filter 98 +- 9.3% (7 datapoints) 132.00 34.5% waterline=9.3%. IMPROVEMENT
vecsim-arxiv-titles-384-angular-filters-m16-ef-128-tag-filter 50750 +- 3.7% (7 datapoints) 50984.00 0.5% No Change

@GuyAv46 GuyAv46 force-pushed the guyav-POC_numeric_tree_with_hll branch from 6ae69c3 to 16ad7d7 Compare October 9, 2024 07:42
@GuyAv46 GuyAv46 force-pushed the guyav-POC_numeric_tree_with_hll branch from f6a01f3 to c4b3266 Compare October 13, 2024 06:34
@GuyAv46 GuyAv46 changed the title Numeric Tree with HLL Numeric Tree with HLL - [MOD-7966] Oct 13, 2024
meiravgri added a commit that referenced this pull request Dec 9, 2024
allow multiple gc runs in each test

move args to test suite members
use pthread_join instead of pthread_detach

loop to enable multiple gc interval taken from #5049
github-merge-queue bot pushed a commit that referenced this pull request Dec 9, 2024
* free spec resources in the main thread.

* improve cpp gc tests suite:
allow multiple gc runs in each test

move args to test suite members
use pthread_join instead of pthread_detach

loop to enable multiple gc interval taken from #5049
redisearch-backport-pull-request bot pushed a commit that referenced this pull request Dec 9, 2024
* free spec resources in the main thread.

* improve cpp gc tests suite:
allow multiple gc runs in each test

move args to test suite members
use pthread_join instead of pthread_detach

loop to enable multiple gc interval taken from #5049

(cherry picked from commit fc50801)
redisearch-backport-pull-request bot pushed a commit that referenced this pull request Dec 9, 2024
* free spec resources in the main thread.

* improve cpp gc tests suite:
allow multiple gc runs in each test

move args to test suite members
use pthread_join instead of pthread_detach

loop to enable multiple gc interval taken from #5049

(cherry picked from commit fc50801)
meiravgri added a commit that referenced this pull request Dec 9, 2024
* free spec resources in the main thread.

* improve cpp gc tests suite:
allow multiple gc runs in each test

move args to test suite members
use pthread_join instead of pthread_detach

loop to enable multiple gc interval taken from #5049

(cherry picked from commit fc50801)
github-merge-queue bot pushed a commit that referenced this pull request Dec 9, 2024
[MOD-8115] Free spec resources in the main thread (#5324)

* free spec resources in the main thread.

* improve cpp gc tests suite:
allow multiple gc runs in each test

move args to test suite members
use pthread_join instead of pthread_detach

loop to enable multiple gc interval taken from #5049

(cherry picked from commit fc50801)

Co-authored-by: meiravgri <[email protected]>
meiravgri added a commit that referenced this pull request Dec 9, 2024
* free spec resources in the main thread.

* improve cpp gc tests suite:
allow multiple gc runs in each test

move args to test suite members
use pthread_join instead of pthread_detach

loop to enable multiple gc interval taken from #5049

(cherry picked from commit fc50801)
github-merge-queue bot pushed a commit that referenced this pull request Dec 9, 2024
[MOD-8115] Free spec resources in the main thread (#5324)

* free spec resources in the main thread.

* improve cpp gc tests suite:
allow multiple gc runs in each test

move args to test suite members
use pthread_join instead of pthread_detach

loop to enable multiple gc interval taken from #5049

(cherry picked from commit fc50801)

Co-authored-by: meiravgri <[email protected]>
github-merge-queue bot pushed a commit that referenced this pull request Dec 10, 2024
[MOD-8115] Free spec resources in the main thread (#5324)

* free spec resources in the main thread.

* improve cpp gc tests suite:
allow multiple gc runs in each test

move args to test suite members
use pthread_join instead of pthread_detach

loop to enable multiple gc interval taken from #5049

(cherry picked from commit fc50801)
github-merge-queue bot pushed a commit that referenced this pull request Dec 10, 2024
* [MOD-8115] Free spec resources in the main thread (#5324)

* free spec resources in the main thread.

* improve cpp gc tests suite:
allow multiple gc runs in each test

move args to test suite members
use pthread_join instead of pthread_detach

loop to enable multiple gc interval taken from #5049

(cherry picked from commit fc50801)

* fix
@benoitdion benoitdion requested a review from Copilot December 13, 2024 09:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 7 out of 25 changed files in this pull request and generated no comments.

Files not reviewed (18)
  • src/aggregate/reducers/count_distinct.c: Language not supported
  • src/debug_commands.c: Language not supported
  • src/fork_gc.c: Language not supported
  • src/hll/hll.c: Language not supported
  • src/hll/hll.h: Language not supported
  • src/inverted_index.c: Language not supported
  • src/inverted_index.h: Language not supported
  • src/numeric_filter.c: Language not supported
  • src/numeric_filter.h: Language not supported
  • src/numeric_index.c: Language not supported
  • src/numeric_index.h: Language not supported
  • src/optimizer_reader.c: Language not supported
  • src/query_param.c: Language not supported
  • src/util/heap_doubles.c: Language not supported
  • src/util/heap_doubles.h: Language not supported
  • tests/cpptests/test_cpp_forkgc.cpp: Language not supported
  • tests/cpptests/test_cpp_range.cpp: Language not supported
  • tests/cpptests/test_cpp_utils.cpp: Language not supported
Comments suppressed due to low confidence (2)

tests/pytests/test_json_multi_geo.py:96

  • The 'depth' parameter is not a standard parameter for 'env.assertEqual'. It should be removed.
env.assertEqual(int(info['num_docs']), num_docs, depth=1)

tests/pytests/test_json_multi_geo.py:97

  • The 'depth' parameter is not a standard parameter for 'env.assertEqual'. It should be removed.
env.assertEqual(float(info['inverted_sz_mb']), inverted_sz_mb, depth=1)

Comment on lines +444 to +445
t->emptyLeaves += rv.numLeaves;
t->numLeaves += rv.numLeaves;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: NumericRangeTree should hold its stats as struct and move its reference as a bookkeeping context to the functions, to avoid this copy

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. I can add it to the story for improving the parent range feature

@GuyAv46 GuyAv46 added this pull request to the merge queue Dec 19, 2024
Merged via the queue into master with commit ecb85b7 Dec 19, 2024
@GuyAv46 GuyAv46 deleted the guyav-POC_numeric_tree_with_hll branch December 19, 2024 11:11
redisearch-backport-pull-request bot pushed a commit that referenced this pull request Dec 19, 2024
…5049)

* basic implementation

* fix compilation

* minor changes

* implement split by median

* more cleanup

* implicit element size allocation

* minor cleanups and caching 32 - bits

* use standard API

* revert using stdc API

* added comments

* typo fix

* implement a simple doubles max heap

* gc code cleanup and preparations

* implement GC numeric fix

* simplified implementation

* more cleanup

* more cleanup

* more tidy up

* minor fixes

* tidy up

* tidy up

* minor improvement

* add an assertion

* add caching mechanism to HLL

* minor improvements

* tidy up

* more cleanup and simplifications

* fix tests and cleanup

* remove multiplier from testNumericTree

* implement minimal version of numeric tree debug print

* fix tests

* delete dead code

* remove code duplication

* address review comments

* another test case

* add a failing test

* fix GC logic

* flip ownership over flag

* move some calculations to the child process

* flip condition order

* address review comments and start counting leaves

* tidy up

* review fixes

* minor improvements

* tidy up

* address some review fixes

* address some more review comments

(cherry picked from commit ecb85b7)
@redisearch-backport-pull-request
Copy link
Contributor

Successfully created backport PR for 8.0:

github-merge-queue bot pushed a commit that referenced this pull request Dec 19, 2024
…8116] (#5380)

Refactor the Numeric Tree to use HyperLogLogs - [MOD-7966, MOD-8116] (#5049)

* basic implementation

* fix compilation

* minor changes

* implement split by median

* more cleanup

* implicit element size allocation

* minor cleanups and caching 32 - bits

* use standard API

* revert using stdc API

* added comments

* typo fix

* implement a simple doubles max heap

* gc code cleanup and preparations

* implement GC numeric fix

* simplified implementation

* more cleanup

* more cleanup

* more tidy up

* minor fixes

* tidy up

* tidy up

* minor improvement

* add an assertion

* add caching mechanism to HLL

* minor improvements

* tidy up

* more cleanup and simplifications

* fix tests and cleanup

* remove multiplier from testNumericTree

* implement minimal version of numeric tree debug print

* fix tests

* delete dead code

* remove code duplication

* address review comments

* another test case

* add a failing test

* fix GC logic

* flip ownership over flag

* move some calculations to the child process

* flip condition order

* address review comments and start counting leaves

* tidy up

* review fixes

* minor improvements

* tidy up

* address some review fixes

* address some more review comments

(cherry picked from commit ecb85b7)

Co-authored-by: GuyAv46 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants