Skip to content

Conversation

@hengfeiyang
Copy link
Contributor

@hengfeiyang hengfeiyang commented Sep 5, 2025

User description

Add two new ENV:

ZO_INVERTED_INDEX_MIN_TOKEN_LENGTH=2
ZO_INVERTED_INDEX_MAX_TOKEN_LENGTH=64

The previous version we only remove the tokens longer than 64, didn't remove the small tokens.

Now, we support both remove mini token and huge token.

Why we got error like this:

thread 'job_runtime' panicked at /index.crates.io-1949cf8c6b5b557f/tantivy-stacker-0.3.0/src/memory_arena.rs:205:9:

Because Tantivy use int32 for memory address, it can't generate data page large than 4GB, and we only generate one segment file for one parquet file in OpenObserve, it caused one problem, if the parquet is very large that will generate a lot of tokens, at the end, the Tantivy segment will exceeds the limit then paniced.

This PR improved the token generated, will auto remove:

  • the single char token, less than 2.
  • large token that more than 64 chars.

With these changes we can succeeded generate 5GB parquet file and the Tantivy file.


PR Type

Enhancement, Bug fix


Description

  • Add min/max token length configs

  • Implement short-token removal filter

  • Apply token limits in analyzers

  • Fix compactor file naming collisions


Diagram Walkthrough

flowchart LR
  CFG["Config: add min/max token length + validation"] -- "used by" --> TOK["Tokenizer builder: apply RemoveShort + RemoveLong"]
  TOK -- "new filter" --> RSF["RemoveShortFilter"]
  O2["O2 Tokenizer tests"] -- "adjust thresholds" --> TOK
  ID["ider: generate_file_name()"] -- "use in" --> JOB["job/files: storage file naming"]
  ID -- "use in" --> CMP["compactor: merged file naming"]
Loading

File Walkthrough

Relevant files
Enhancement
4 files
config.rs
Add token length configs and validation hook                         
+36/-6   
ider.rs
Introduce unique file name generator                                         
+7/-0     
mod.rs
Wire min/max token filters into analyzers                               
+18/-7   
remove_short.rs
Implement RemoveShortFilter and tests                                       
+120/-0 
Tests
1 files
o2_tokenizer.rs
Align base64 threshold with max token; test update             
+3/-3     
Bug fix
2 files
mod.rs
Use new filename generator; remove ad-hoc suffix                 
+2/-3     
merge.rs
Use new filename generator for merged outputs                       
+2/-2     

@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2025

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Config Bounds

The max token length is derived using max(config, 64). This prevents lowering the limit below 64 via config, which may be intended but contradicts the new env var description. Confirm desired behavior or switch to clamping (min/max) semantics.

pub fn o2_tokenizer_build() -> TextAnalyzer {
    let cfg = get_config();
    let min_token_length =
        std::cmp::max(cfg.limit.inverted_index_min_token_length, MIN_TOKEN_LENGTH);
    let max_token_length =
        std::cmp::max(cfg.limit.inverted_index_max_token_length, MAX_TOKEN_LENGTH);
    if cfg.common.inverted_index_camel_case_tokenizer_disabled {
        tantivy::tokenizer::TextAnalyzer::builder(SimpleTokenizer::default())
            .filter(RemoveShortFilter::limit(min_token_length))
            .filter(tantivy::tokenizer::RemoveLongFilter::limit(
                max_token_length,
            ))
Position Gaps

Removing short tokens preserves original token positions, resulting in non-sequential positions (e.g., first token can have position 1). Validate that downstream components expect gaps and that scoring/phrase queries in Tantivy behave as desired.

    let mut a = TextAnalyzer::builder(SimpleTokenizer::default())
        .filter(RemoveShortFilter::limit(2))
        .build();
    let mut token_stream = a.token_stream(text);
    let mut tokens: Vec<Token> = vec![];
    let mut add_token = |token: &Token| {
        tokens.push(token.clone());
    };
    token_stream.process(&mut add_token);
    tokens
}

#[test]
fn test_remove_short() {
    let tokens = token_stream_helper("i am a hello tantivy, happy searching!");
    assert_eq!(tokens.len(), 5);
    assert_token(&tokens[0], 1, "am", 2, 4);
    assert_token(&tokens[1], 3, "hello", 7, 12);
    assert_token(&tokens[2], 4, "tantivy", 13, 20);
    assert_token(&tokens[3], 5, "happy", 22, 27);
    assert_token(&tokens[4], 6, "searching", 28, 37);
Base64 Threshold

MIN_BASE64_LENGTH tied to MAX_TOKEN_LENGTH+1 drastically lowers the prior 1024 threshold; verify this doesn’t reintroduce noisy base64-like tokens and aligns with performance goals.

fn looks_like_base64(s: &str) -> bool {
    // Base64 characteristics:
    // 1. Only contains [A-Za-z0-9], the tokenizer will remove all non-alphanumeric characters
    // 2. Should be reasonably long (to avoid false positives)
    const MIN_BASE64_LENGTH: usize = super::MAX_TOKEN_LENGTH + 1; // Adjust this threshold as needed

    if s.len() < MIN_BASE64_LENGTH {
        return false;

@github-actions
Copy link
Contributor

github-actions bot commented Sep 5, 2025

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Enforce min-max consistency

Ensure min_token_length never exceeds max_token_length; otherwise all tokens could
be dropped. After clamping individual values, enforce min <= max by adjusting one
side if needed.

src/config/src/utils/tantivy/tokenizer/mod.rs [30-33]

-let min_token_length =
-    std::cmp::max(cfg.limit.inverted_index_min_token_length, MIN_TOKEN_LENGTH);
-let max_token_length =
-    std::cmp::max(cfg.limit.inverted_index_max_token_length, MAX_TOKEN_LENGTH);
+let mut min_token_length = cfg
+    .limit
+    .inverted_index_min_token_length
+    .clamp(MIN_TOKEN_LENGTH, MAX_TOKEN_LENGTH);
+let mut max_token_length = cfg
+    .limit
+    .inverted_index_max_token_length
+    .clamp(MIN_TOKEN_LENGTH, MAX_TOKEN_LENGTH);
+if min_token_length > max_token_length {
+    min_token_length = max_token_length;
+}
Suggestion importance[1-10]: 8

__

Why: Ensuring min_token_length <= max_token_length prevents pathological configurations that could drop all tokens. The improved code accurately derives from the new hunk and meaningfully hardens behavior.

Medium
Clamp token length bounds

Using max here ignores smaller configured limits and can allow overly long tokens.
Clamp the configured value between sensible min and max bounds, and use it directly
without inflating beyond MAX_TOKEN_LENGTH.

src/config/src/utils/tantivy/tokenizer/mod.rs [33]

-let max_token_length =
-    std::cmp::max(cfg.limit.inverted_index_max_token_length, MAX_TOKEN_LENGTH);
+let max_token_length = cfg
+    .limit
+    .inverted_index_max_token_length
+    .clamp(MIN_TOKEN_LENGTH, MAX_TOKEN_LENGTH);
Suggestion importance[1-10]: 7

__

Why: The existing code uses max with MAX_TOKEN_LENGTH, which can exceed the intended upper bound; clamping to [MIN_TOKEN_LENGTH, MAX_TOKEN_LENGTH] is a reasonable fix. It improves correctness by honoring configured limits without allowing overly long tokens, though it's not critical.

Medium
General
Count Unicode characters

token.text.len() measures bytes, not characters, so multibyte UTF-8 tokens may be
misclassified. Use chars().count() to compare character length consistently with the
filter's description.

src/config/src/utils/tantivy/tokenizer/remove_short.rs [21-23]

 impl<T> RemoveShortFilterStream<T> {
     fn predicate(&self, token: &Token) -> bool {
-        token.text.len() >= self.token_length_limit
+        token.text.chars().count() >= self.token_length_limit
     }
 }
Suggestion importance[1-10]: 6

__

Why: The suggestion aligns with the filter's description about UTF-8 but trades performance for semantic accuracy; depending on intent (bytes vs chars), this may or may not be desirable. It's a valid minor improvement with limited impact.

Low

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR introduces configurable token length filtering for Tantivy's inverted index to resolve critical memory issues that were causing panics when processing large parquet files. The core problem was that Tantivy uses int32 for memory addressing, limiting data pages to 4GB, but OpenObserve generates one segment file per parquet file. When processing very large parquet files, the excessive number of tokens (including single characters and very long strings) would cause Tantivy segments to exceed this 4GB limit, resulting in panics at memory_arena.rs:205:9.

The solution implements a two-pronged token filtering approach:

New Environment Variables:

  • ZO_INVERTED_INDEX_MIN_TOKEN_LENGTH=2 - Filters out tokens shorter than 2 characters
  • ZO_INVERTED_INDEX_MAX_TOKEN_LENGTH=64 - Filters out tokens longer than 64 characters (previously only long token filtering existed)

Key Changes:

  1. Token Filtering: A new RemoveShortFilter is added in src/config/src/utils/tantivy/tokenizer/remove_short.rs that complements the existing RemoveLongFilter. Both filters are now applied in the tokenizer pipeline.

  2. Configuration: The config system now validates and sets sensible defaults for token length limits, ensuring minimum=2 and maximum=64 when values are unset.

  3. Tokenizer Integration: The o2_tokenizer_build() function in mod.rs now applies both min and max token length filters to both SimpleTokenizer and O2Tokenizer paths.

  4. Base64 Detection Threshold: The O2Tokenizer's base64 detection logic is updated to use the configurable max token length (MAX_TOKEN_LENGTH + 1 = 65) instead of a hardcoded 1024, creating more consistent behavior.

  5. File Name Generation: Several files now use ider::generate_file_name() instead of ider::generate() to provide enhanced uniqueness with a 4-character hex suffix, supporting the improved token processing workflow.

With these changes, OpenObserve can successfully process 5GB parquet files without hitting Tantivy's memory limits, significantly improving the system's ability to handle large datasets while maintaining search functionality.

Confidence score: 3/5

  • This PR addresses a critical production issue but has some implementation inconsistencies that need attention
  • Score reflects the importance of the fix but concerns about inconsistent tokenization behavior between different code paths
  • Pay close attention to the tokenizer module files, especially the inconsistency between o2_tokenizer_build and o2_collect_tokens functions

7 files reviewed, 5 comments

Edit Code Review Bot Settings | Greptile

@hengfeiyang hengfeiyang merged commit 890a8d4 into main Sep 6, 2025
28 of 30 checks passed
@hengfeiyang hengfeiyang deleted the feat/tantivy-tokens branch September 6, 2025 14:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants