perf: improve high frequency term search speed #7804

haohuaijin · 2025-08-01T03:38:07Z

remove deprecated ZO_FULL_TEXT_SEARCH_TYPE env
improve search speed for high frequency term by skip tantivy search

tantivy mode

In OpenObserve, Tantivy is used in three main ways:

No row_ids returned(IndexOptimizeMode except SimpleSelect):
- Always fast.
Row_ids returned with SimpleSelect:
- Always fast when the number of row_ids returned is small.
- Still fast when the number of row_ids is large, because only a few files are searched.
Row_ids returned without SimpleSelect:
- Fast when the number of row_ids is small.
- Slow when the number of row_ids is large—in this case, it's even slower than using DataFusion directly.

This PR optimizes the third scenario. Specifically, if (row_ids / total_ids) of cpu_num files exceeds ZO_INVERTED_INDEX_SKIP_THRESHOLD, tantivy search will be skipped to avoid performance degradation.

test

data size: 300GB, compression size: 2.1GB, index size 7.7GB
full text field: log, message
secondary index field: k8s_namespace_name, k8s_pod_name, k8s_container_name, code

the filter str_match(kubernetes_namespace_name, 'ziox') can filter out 48% row_ids;

main branch

SELECT kubernetes_namespace_name, count(1) as cnt from default where  str_match(kubernetes_namespace_name, 'ziox') group by kubernetes_namespace_name order by cnt desc

tooks: 1.6s

this branch

SELECT kubernetes_namespace_name, count(1) as cnt from default where  str_match(kubernetes_namespace_name, 'ziox') group by kubernetes_namespace_name order by cnt desc

tooks: 850ms

directly use datafusion

SELECT kubernetes_namespace_name, count(1) as cnt from default where  str_match(kubernetes_namespace_name, 'ziox') group by kubernetes_namespace_name order by cnt desc

tooks: 700ms

greptile-apps

Greptile Summary

This PR implements significant performance improvements for high-frequency term search operations in Tantivy, along with code cleanup and removal of deprecated functionality.

Core Performance Improvements:
The main enhancement is in src/service/search/grpc/storage.rs where the filter_file_list_by_tantivy_index function has been completely restructured. The previous implementation used try_join_all which waited for all parallel tasks to complete before processing results. The new approach uses streaming with buffer_unordered to process results as they become available, enabling early termination when search operations become inefficient.

Key technical changes include:

Adaptive threshold mechanism: The system now monitors the number of row IDs returned and can terminate searches early if too many files return excessive results, preventing system overload
Streaming task processing: Results are processed incrementally rather than waiting for all tasks to complete
Improved file grouping: The logic has been restructured from group-first to file-first iteration for better parallelization
Enhanced concurrency control: Semaphore acquisition has been moved inside tasks for better resource management
New utility functions: regroup_tantivy_files and into_chunks have been added to support the new execution model

Code Cleanup:
The PR removes deprecated functionality including the full_text_search_type configuration field that was scheduled for removal in version 0.15.0. This cleanup extends to the search pipeline where deprecated prefix search handling has been removed from the flight service. Additionally, minor formatting improvements have been made to test code and function signatures for better readability.

These changes integrate with OpenObserve's existing search infrastructure while maintaining backward compatibility for the search API. The performance improvements specifically target high-frequency term scenarios where the previous implementation could become a bottleneck.

Confidence score: 4/5

This PR appears safe to merge with significant performance benefits and proper cleanup of deprecated code
The confidence score reflects the complexity of the Tantivy search changes which, while well-structured, involve substantial modifications to critical search functionality
The src/service/search/grpc/storage.rs file needs careful attention due to the significant algorithmic changes in the search processing logic

_{5 files reviewed, no comments}

_{Edit Code Review Bot Settings | Greptile}

github-actions · 2025-08-01T03:39:46Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Dead Code The returned `query_limit` from `partition_tantivy_files` is never used, which may indicate incomplete or dead logic. let (index_parquet_files, query_limit) = partition_tantivy_files(index_parquet_files, &idx_optimize_mode, target_partitions); Potential Bug `group_num` is derived from the first group's length and `max_group_len` from the number of groups, which seems swapped or misnamed and could lead to incorrect loop behavior. let group_num = index_parquet_files.first().unwrap_or(&vec![]).len(); let max_group_len = index_parquet_files.len(); Import Conflict Both `futures::StreamExt` and `tokio_stream::StreamExt` are imported, risking ambiguous method resolution for `StreamExt` traits. use futures::{StreamExt, stream}; use hashbrown::HashMap; use infra::{ cache::file_data, errors::{Error, ErrorCodes}, }; use itertools::Itertools; use tantivy::Directory; use tokio::sync::Semaphore; use tokio_stream::StreamExt as _;

github-actions · 2025-08-01T03:41:01Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Fix undefined timer variable The variable `start` is undefined in this scope; you likely meant to measure from `search_start`. Replace `start` with `search_start` to correctly compute elapsed time. src/service/search/grpc/storage.rs [790] -let took = start.elapsed().as_millis() as usize; +let took = search_start.elapsed().as_millis() as usize; Suggestion importance[1-10]: 10 __ Why: The code references an undefined `start`, causing a compile error; using `search_start` correctly measures elapsed time.	High
General	Correct group and length assignment The names are swapped: `group_num` should be the number of groups and `max_group_len` the maximum group size. Swap their assignments to match their intended meaning. src/service/search/grpc/storage.rs [734-735] -let group_num = index_parquet_files.first().unwrap_or(&vec![]).len(); -let max_group_len = index_parquet_files.len(); +let group_num = index_parquet_files.len(); +let max_group_len = index_parquet_files.iter().map(\|g\| g.len()).max().unwrap_or(0); Suggestion importance[1-10]: 5 __ Why: Swapping these assignments fixes the logging to accurately report the number of groups and the maximum group size, improving readability with minimal impact.	Low

- [x] remove ZO_FEATURE_QUERY_NOT_FILTER_WITH_INDEX env - [x] due to this pr #7804, this part of code is not need

haohuaijin added 2 commits July 31, 2025 17:19

perf: improve high frequency term search speed

496fb3c

improve parallel for indexoptimizenmode

a040e5c

github-actions bot added the 🧹 Updates label Aug 1, 2025

greptile-apps bot reviewed Aug 1, 2025

View reviewed changes

Merge branch 'main' into improve-tantivy-parallel

8aa67ac

github-actions bot added the Review effort 4/5 label Aug 1, 2025

add test case

10df494

haohuaijin requested review from hengfeiyang and uddhavdave August 1, 2025 09:04

uddhavdave approved these changes Aug 1, 2025

View reviewed changes

haohuaijin merged commit 5f249ed into main Aug 4, 2025
63 checks passed

haohuaijin deleted the improve-tantivy-parallel branch August 4, 2025 02:59

haohuaijin mentioned this pull request Aug 4, 2025

chore: set ZO_INVERTED_INDEX_SKIP_THRESHOLD default to 35 #7839

Merged

haohuaijin mentioned this pull request Aug 29, 2025

chore: remove special handler for negative index search #8213

Merged

2 tasks

hengfeiyang pushed a commit that referenced this pull request Aug 30, 2025

chore: remove special handler for negative index search (#8213)

9d1f672

- [x] remove ZO_FEATURE_QUERY_NOT_FILTER_WITH_INDEX env - [x] due to this pr #7804, this part of code is not need

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: improve high frequency term search speed #7804

perf: improve high frequency term search speed #7804

Uh oh!

haohuaijin commented Aug 1, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

github-actions bot commented Aug 1, 2025

Uh oh!

github-actions bot commented Aug 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

perf: improve high frequency term search speed #7804

perf: improve high frequency term search speed #7804

Uh oh!

Conversation

haohuaijin commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

tantivy mode

test

main branch

this branch

directly use datafusion

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 4/5

Uh oh!

github-actions bot commented Aug 1, 2025

PR Reviewer Guide 🔍

Uh oh!

github-actions bot commented Aug 1, 2025

PR Code Suggestions ✨

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

haohuaijin commented Aug 1, 2025 •

edited

Loading