Refactor the Numeric Tree to use HyperLogLogs - [MOD-7966, MOD-8116]#5049
Refactor the Numeric Tree to use HyperLogLogs - [MOD-7966, MOD-8116]#5049
Conversation
Automated performance analysis summaryThis comment was automatically generated given there is performance data available. In summary:
You can check a comparison in detail via the grafana link Comparison between master and guyav-POC_numeric_tree_with_hll.Time Period from 30 days ago. (environment used: oss-standalone)
|
6ae69c3 to
16ad7d7
Compare
f6a01f3 to
c4b3266
Compare
allow multiple gc runs in each test move args to test suite members use pthread_join instead of pthread_detach loop to enable multiple gc interval taken from #5049
* free spec resources in the main thread. * improve cpp gc tests suite: allow multiple gc runs in each test move args to test suite members use pthread_join instead of pthread_detach loop to enable multiple gc interval taken from #5049
[MOD-8115] Free spec resources in the main thread (#5324) * free spec resources in the main thread. * improve cpp gc tests suite: allow multiple gc runs in each test move args to test suite members use pthread_join instead of pthread_detach loop to enable multiple gc interval taken from #5049 (cherry picked from commit fc50801) Co-authored-by: meiravgri <[email protected]>
[MOD-8115] Free spec resources in the main thread (#5324) * free spec resources in the main thread. * improve cpp gc tests suite: allow multiple gc runs in each test move args to test suite members use pthread_join instead of pthread_detach loop to enable multiple gc interval taken from #5049 (cherry picked from commit fc50801) Co-authored-by: meiravgri <[email protected]>
[MOD-8115] Free spec resources in the main thread (#5324) * free spec resources in the main thread. * improve cpp gc tests suite: allow multiple gc runs in each test move args to test suite members use pthread_join instead of pthread_detach loop to enable multiple gc interval taken from #5049 (cherry picked from commit fc50801)
* [MOD-8115] Free spec resources in the main thread (#5324) * free spec resources in the main thread. * improve cpp gc tests suite: allow multiple gc runs in each test move args to test suite members use pthread_join instead of pthread_detach loop to enable multiple gc interval taken from #5049 (cherry picked from commit fc50801) * fix
There was a problem hiding this comment.
Copilot reviewed 7 out of 25 changed files in this pull request and generated no comments.
Files not reviewed (18)
- src/aggregate/reducers/count_distinct.c: Language not supported
- src/debug_commands.c: Language not supported
- src/fork_gc.c: Language not supported
- src/hll/hll.c: Language not supported
- src/hll/hll.h: Language not supported
- src/inverted_index.c: Language not supported
- src/inverted_index.h: Language not supported
- src/numeric_filter.c: Language not supported
- src/numeric_filter.h: Language not supported
- src/numeric_index.c: Language not supported
- src/numeric_index.h: Language not supported
- src/optimizer_reader.c: Language not supported
- src/query_param.c: Language not supported
- src/util/heap_doubles.c: Language not supported
- src/util/heap_doubles.h: Language not supported
- tests/cpptests/test_cpp_forkgc.cpp: Language not supported
- tests/cpptests/test_cpp_range.cpp: Language not supported
- tests/cpptests/test_cpp_utils.cpp: Language not supported
Comments suppressed due to low confidence (2)
tests/pytests/test_json_multi_geo.py:96
- The 'depth' parameter is not a standard parameter for 'env.assertEqual'. It should be removed.
env.assertEqual(int(info['num_docs']), num_docs, depth=1)
tests/pytests/test_json_multi_geo.py:97
- The 'depth' parameter is not a standard parameter for 'env.assertEqual'. It should be removed.
env.assertEqual(float(info['inverted_sz_mb']), inverted_sz_mb, depth=1)
| t->emptyLeaves += rv.numLeaves; | ||
| t->numLeaves += rv.numLeaves; |
There was a problem hiding this comment.
TODO: NumericRangeTree should hold its stats as struct and move its reference as a bookkeeping context to the functions, to avoid this copy
There was a problem hiding this comment.
Good idea. I can add it to the story for improving the parent range feature
…5049) * basic implementation * fix compilation * minor changes * implement split by median * more cleanup * implicit element size allocation * minor cleanups and caching 32 - bits * use standard API * revert using stdc API * added comments * typo fix * implement a simple doubles max heap * gc code cleanup and preparations * implement GC numeric fix * simplified implementation * more cleanup * more cleanup * more tidy up * minor fixes * tidy up * tidy up * minor improvement * add an assertion * add caching mechanism to HLL * minor improvements * tidy up * more cleanup and simplifications * fix tests and cleanup * remove multiplier from testNumericTree * implement minimal version of numeric tree debug print * fix tests * delete dead code * remove code duplication * address review comments * another test case * add a failing test * fix GC logic * flip ownership over flag * move some calculations to the child process * flip condition order * address review comments and start counting leaves * tidy up * review fixes * minor improvements * tidy up * address some review fixes * address some more review comments (cherry picked from commit ecb85b7)
|
Successfully created backport PR for |
…8116] (#5380) Refactor the Numeric Tree to use HyperLogLogs - [MOD-7966, MOD-8116] (#5049) * basic implementation * fix compilation * minor changes * implement split by median * more cleanup * implicit element size allocation * minor cleanups and caching 32 - bits * use standard API * revert using stdc API * added comments * typo fix * implement a simple doubles max heap * gc code cleanup and preparations * implement GC numeric fix * simplified implementation * more cleanup * more cleanup * more tidy up * minor fixes * tidy up * tidy up * minor improvement * add an assertion * add caching mechanism to HLL * minor improvements * tidy up * more cleanup and simplifications * fix tests and cleanup * remove multiplier from testNumericTree * implement minimal version of numeric tree debug print * fix tests * delete dead code * remove code duplication * address review comments * another test case * add a failing test * fix GC logic * flip ownership over flag * move some calculations to the child process * flip condition order * address review comments and start counting leaves * tidy up * review fixes * minor improvements * tidy up * address some review fixes * address some more review comments (cherry picked from commit ecb85b7) Co-authored-by: GuyAv46 <[email protected]>
Refactoring the cardinality mechanism of the numeric tree.
Background
Issues:
(sizeof double + sizeof size_t) * 2500= 40,000 BytesThe Solution
O(1)), we now consult the entire dataset in the cardinality calculation with a comparable execution time, yielding a more accurate cardinality estimation2^6= 64 Bytes (with current settings of 2^6 registers).Additional Details:
NF_INFINITYandNF_NEGATIVE_INFINITY, and start using the C macroINFINITYinstead.heap_doublesfor efficient heap of doubles.Mark if applicable