Skip to content

Conversation

@haohuaijin
Copy link
Collaborator

@haohuaijin haohuaijin commented Aug 1, 2025

  • remove deprecated ZO_FULL_TEXT_SEARCH_TYPE env
  • improve search speed for high frequency term by skip tantivy search

tantivy mode

In OpenObserve, Tantivy is used in three main ways:

  1. No row_ids returned(IndexOptimizeMode except SimpleSelect):

    • Always fast.
  2. Row_ids returned with SimpleSelect:

    • Always fast when the number of row_ids returned is small.
    • Still fast when the number of row_ids is large, because only a few files are searched.
  3. Row_ids returned without SimpleSelect:

    • Fast when the number of row_ids is small.
    • Slow when the number of row_ids is large—in this case, it's even slower than using DataFusion directly.

This PR optimizes the third scenario. Specifically, if (row_ids / total_ids) of cpu_num files exceeds ZO_INVERTED_INDEX_SKIP_THRESHOLD, tantivy search will be skipped to avoid performance degradation.

test

data size: 300GB, compression size: 2.1GB, index size 7.7GB
full text field: log, message
secondary index field: k8s_namespace_name, k8s_pod_name, k8s_container_name, code

the filter str_match(kubernetes_namespace_name, 'ziox') can filter out 48% row_ids;

main branch

SELECT kubernetes_namespace_name, count(1) as cnt from default where  str_match(kubernetes_namespace_name, 'ziox') group by kubernetes_namespace_name order by cnt desc

tooks: 1.6s

this branch

SELECT kubernetes_namespace_name, count(1) as cnt from default where  str_match(kubernetes_namespace_name, 'ziox') group by kubernetes_namespace_name order by cnt desc 

tooks: 850ms

directly use datafusion

SELECT kubernetes_namespace_name, count(1) as cnt from default where  str_match(kubernetes_namespace_name, 'ziox') group by kubernetes_namespace_name order by cnt desc 

tooks: 700ms

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR implements significant performance improvements for high-frequency term search operations in Tantivy, along with code cleanup and removal of deprecated functionality.

Core Performance Improvements:
The main enhancement is in src/service/search/grpc/storage.rs where the filter_file_list_by_tantivy_index function has been completely restructured. The previous implementation used try_join_all which waited for all parallel tasks to complete before processing results. The new approach uses streaming with buffer_unordered to process results as they become available, enabling early termination when search operations become inefficient.

Key technical changes include:

  • Adaptive threshold mechanism: The system now monitors the number of row IDs returned and can terminate searches early if too many files return excessive results, preventing system overload
  • Streaming task processing: Results are processed incrementally rather than waiting for all tasks to complete
  • Improved file grouping: The logic has been restructured from group-first to file-first iteration for better parallelization
  • Enhanced concurrency control: Semaphore acquisition has been moved inside tasks for better resource management
  • New utility functions: regroup_tantivy_files and into_chunks have been added to support the new execution model

Code Cleanup:
The PR removes deprecated functionality including the full_text_search_type configuration field that was scheduled for removal in version 0.15.0. This cleanup extends to the search pipeline where deprecated prefix search handling has been removed from the flight service. Additionally, minor formatting improvements have been made to test code and function signatures for better readability.

These changes integrate with OpenObserve's existing search infrastructure while maintaining backward compatibility for the search API. The performance improvements specifically target high-frequency term scenarios where the previous implementation could become a bottleneck.

Confidence score: 4/5

  • This PR appears safe to merge with significant performance benefits and proper cleanup of deprecated code
  • The confidence score reflects the complexity of the Tantivy search changes which, while well-structured, involve substantial modifications to critical search functionality
  • The src/service/search/grpc/storage.rs file needs careful attention due to the significant algorithmic changes in the search processing logic

5 files reviewed, no comments

Edit Code Review Bot Settings | Greptile

@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2025

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Dead Code

The returned query_limit from partition_tantivy_files is never used, which may indicate incomplete or dead logic.

let (index_parquet_files, query_limit) =
    partition_tantivy_files(index_parquet_files, &idx_optimize_mode, target_partitions);
Potential Bug

group_num is derived from the first group's length and max_group_len from the number of groups, which seems swapped or misnamed and could lead to incorrect loop behavior.

let group_num = index_parquet_files.first().unwrap_or(&vec![]).len();
let max_group_len = index_parquet_files.len();
Import Conflict

Both futures::StreamExt and tokio_stream::StreamExt are imported, risking ambiguous method resolution for StreamExt traits.

use futures::{StreamExt, stream};
use hashbrown::HashMap;
use infra::{
    cache::file_data,
    errors::{Error, ErrorCodes},
};
use itertools::Itertools;
use tantivy::Directory;
use tokio::sync::Semaphore;
use tokio_stream::StreamExt as _;

@github-actions
Copy link
Contributor

github-actions bot commented Aug 1, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Fix undefined timer variable

The variable start is undefined in this scope; you likely meant to measure from
search_start. Replace start with search_start to correctly compute elapsed time.

src/service/search/grpc/storage.rs [790]

-let took = start.elapsed().as_millis() as usize;
+let took = search_start.elapsed().as_millis() as usize;
Suggestion importance[1-10]: 10

__

Why: The code references an undefined start, causing a compile error; using search_start correctly measures elapsed time.

High
General
Correct group and length assignment

The names are swapped: group_num should be the number of groups and max_group_len
the maximum group size. Swap their assignments to match their intended meaning.

src/service/search/grpc/storage.rs [734-735]

-let group_num = index_parquet_files.first().unwrap_or(&vec![]).len();
-let max_group_len = index_parquet_files.len();
+let group_num = index_parquet_files.len();
+let max_group_len = index_parquet_files.iter().map(|g| g.len()).max().unwrap_or(0);
Suggestion importance[1-10]: 5

__

Why: Swapping these assignments fixes the logging to accurately report the number of groups and the maximum group size, improving readability with minimal impact.

Low

@haohuaijin haohuaijin merged commit 5f249ed into main Aug 4, 2025
63 checks passed
@haohuaijin haohuaijin deleted the improve-tantivy-parallel branch August 4, 2025 02:59
hengfeiyang pushed a commit that referenced this pull request Aug 30, 2025
- [x] remove ZO_FEATURE_QUERY_NOT_FILTER_WITH_INDEX env
- [x] due to this pr
#7804, this part of code
is not need
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants