fix: add length filter for tantivy token #8310

hengfeiyang · 2025-09-05T11:12:44Z

User description

Add two new ENV:

ZO_INVERTED_INDEX_MIN_TOKEN_LENGTH=2
ZO_INVERTED_INDEX_MAX_TOKEN_LENGTH=64

The previous version we only remove the tokens longer than 64, didn't remove the small tokens.

Now, we support both remove mini token and huge token.

Why we got error like this:

thread 'job_runtime' panicked at /index.crates.io-1949cf8c6b5b557f/tantivy-stacker-0.3.0/src/memory_arena.rs:205:9:

Because Tantivy use int32 for memory address, it can't generate data page large than 4GB, and we only generate one segment file for one parquet file in OpenObserve, it caused one problem, if the parquet is very large that will generate a lot of tokens, at the end, the Tantivy segment will exceeds the limit then paniced.

This PR improved the token generated, will auto remove:

the single char token, less than 2.
large token that more than 64 chars.

With these changes we can succeeded generate 5GB parquet file and the Tantivy file.

PR Type

Enhancement, Bug fix

Description

Add min/max token length configs
Implement short-token removal filter
Apply token limits in analyzers
Fix compactor file naming collisions

Diagram Walkthrough

flowchart LR
  CFG["Config: add min/max token length + validation"] -- "used by" --> TOK["Tokenizer builder: apply RemoveShort + RemoveLong"]
  TOK -- "new filter" --> RSF["RemoveShortFilter"]
  O2["O2 Tokenizer tests"] -- "adjust thresholds" --> TOK
  ID["ider: generate_file_name()"] -- "use in" --> JOB["job/files: storage file naming"]
  ID -- "use in" --> CMP["compactor: merged file naming"]

File Walkthrough

Relevant files

Enhancement

4 files

config.rs `Add token length configs and validation hook`	+36/-6
ider.rs `Introduce unique file name generator`	+7/-0
mod.rs `Wire min/max token filters into analyzers`	+18/-7
remove_short.rs `Implement RemoveShortFilter and tests`	+120/-0

Tests

1 files

o2_tokenizer.rs `Align base64 threshold with max token; test update`	+3/-3

Bug fix

2 files

mod.rs `Use new filename generator; remove ad-hoc suffix`	+2/-3
merge.rs `Use new filename generator for merged outputs`	+2/-2

github-actions · 2025-09-05T11:13:48Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🧪 PR contains tests
🔒 No security concerns identified
⚡ Recommended focus areas for review Config Bounds The max token length is derived using max(config, 64). This prevents lowering the limit below 64 via config, which may be intended but contradicts the new env var description. Confirm desired behavior or switch to clamping (min/max) semantics. pub fn o2_tokenizer_build() -> TextAnalyzer { let cfg = get_config(); let min_token_length = std::cmp::max(cfg.limit.inverted_index_min_token_length, MIN_TOKEN_LENGTH); let max_token_length = std::cmp::max(cfg.limit.inverted_index_max_token_length, MAX_TOKEN_LENGTH); if cfg.common.inverted_index_camel_case_tokenizer_disabled { tantivy::tokenizer::TextAnalyzer::builder(SimpleTokenizer::default()) .filter(RemoveShortFilter::limit(min_token_length)) .filter(tantivy::tokenizer::RemoveLongFilter::limit( max_token_length, )) Position Gaps Removing short tokens preserves original token positions, resulting in non-sequential positions (e.g., first token can have position 1). Validate that downstream components expect gaps and that scoring/phrase queries in Tantivy behave as desired. let mut a = TextAnalyzer::builder(SimpleTokenizer::default()) .filter(RemoveShortFilter::limit(2)) .build(); let mut token_stream = a.token_stream(text); let mut tokens: Vec<Token> = vec![]; let mut add_token = \|token: &Token\| { tokens.push(token.clone()); }; token_stream.process(&mut add_token); tokens } #[test] fn test_remove_short() { let tokens = token_stream_helper("i am a hello tantivy, happy searching!"); assert_eq!(tokens.len(), 5); assert_token(&tokens[0], 1, "am", 2, 4); assert_token(&tokens[1], 3, "hello", 7, 12); assert_token(&tokens[2], 4, "tantivy", 13, 20); assert_token(&tokens[3], 5, "happy", 22, 27); assert_token(&tokens[4], 6, "searching", 28, 37); Base64 Threshold MIN_BASE64_LENGTH tied to MAX_TOKEN_LENGTH+1 drastically lowers the prior 1024 threshold; verify this doesn’t reintroduce noisy base64-like tokens and aligns with performance goals. fn looks_like_base64(s: &str) -> bool { // Base64 characteristics: // 1. Only contains [A-Za-z0-9], the tokenizer will remove all non-alphanumeric characters // 2. Should be reasonably long (to avoid false positives) const MIN_BASE64_LENGTH: usize = super::MAX_TOKEN_LENGTH + 1; // Adjust this threshold as needed if s.len() < MIN_BASE64_LENGTH { return false;

github-actions · 2025-09-05T11:14:29Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category	Suggestion	Impact
Possible issue	Enforce min-max consistency Ensure `min_token_length` never exceeds `max_token_length`; otherwise all tokens could be dropped. After clamping individual values, enforce `min <= max` by adjusting one side if needed. src/config/src/utils/tantivy/tokenizer/mod.rs [30-33] -let min_token_length = - std::cmp::max(cfg.limit.inverted_index_min_token_length, MIN_TOKEN_LENGTH); -let max_token_length = - std::cmp::max(cfg.limit.inverted_index_max_token_length, MAX_TOKEN_LENGTH); +let mut min_token_length = cfg + .limit + .inverted_index_min_token_length + .clamp(MIN_TOKEN_LENGTH, MAX_TOKEN_LENGTH); +let mut max_token_length = cfg + .limit + .inverted_index_max_token_length + .clamp(MIN_TOKEN_LENGTH, MAX_TOKEN_LENGTH); +if min_token_length > max_token_length { + min_token_length = max_token_length; +} Suggestion importance[1-10]: 8 __ Why: Ensuring `min_token_length <= max_token_length` prevents pathological configurations that could drop all tokens. The improved code accurately derives from the new hunk and meaningfully hardens behavior.	Medium
Possible issue	Clamp token length bounds Using max here ignores smaller configured limits and can allow overly long tokens. Clamp the configured value between sensible min and max bounds, and use it directly without inflating beyond `MAX_TOKEN_LENGTH`. src/config/src/utils/tantivy/tokenizer/mod.rs [33] -let max_token_length = - std::cmp::max(cfg.limit.inverted_index_max_token_length, MAX_TOKEN_LENGTH); +let max_token_length = cfg + .limit + .inverted_index_max_token_length + .clamp(MIN_TOKEN_LENGTH, MAX_TOKEN_LENGTH); Suggestion importance[1-10]: 7 __ Why: The existing code uses `max` with `MAX_TOKEN_LENGTH`, which can exceed the intended upper bound; clamping to `[MIN_TOKEN_LENGTH, MAX_TOKEN_LENGTH]` is a reasonable fix. It improves correctness by honoring configured limits without allowing overly long tokens, though it's not critical.	Medium
General	Count Unicode characters `token.text.len()` measures bytes, not characters, so multibyte UTF-8 tokens may be misclassified. Use `chars().count()` to compare character length consistently with the filter's description. src/config/src/utils/tantivy/tokenizer/remove_short.rs [21-23] impl<T> RemoveShortFilterStream<T> { fn predicate(&self, token: &Token) -> bool { - token.text.len() >= self.token_length_limit + token.text.chars().count() >= self.token_length_limit } } Suggestion importance[1-10]: 6 __ Why: The suggestion aligns with the filter's description about UTF-8 but trades performance for semantic accuracy; depending on intent (bytes vs chars), this may or may not be desirable. It's a valid minor improvement with limited impact.	Low

greptile-apps

Greptile Summary

This PR introduces configurable token length filtering for Tantivy's inverted index to resolve critical memory issues that were causing panics when processing large parquet files. The core problem was that Tantivy uses int32 for memory addressing, limiting data pages to 4GB, but OpenObserve generates one segment file per parquet file. When processing very large parquet files, the excessive number of tokens (including single characters and very long strings) would cause Tantivy segments to exceed this 4GB limit, resulting in panics at memory_arena.rs:205:9.

The solution implements a two-pronged token filtering approach:

New Environment Variables:

ZO_INVERTED_INDEX_MIN_TOKEN_LENGTH=2 - Filters out tokens shorter than 2 characters
ZO_INVERTED_INDEX_MAX_TOKEN_LENGTH=64 - Filters out tokens longer than 64 characters (previously only long token filtering existed)

Key Changes:

Token Filtering: A new RemoveShortFilter is added in src/config/src/utils/tantivy/tokenizer/remove_short.rs that complements the existing RemoveLongFilter. Both filters are now applied in the tokenizer pipeline.
Configuration: The config system now validates and sets sensible defaults for token length limits, ensuring minimum=2 and maximum=64 when values are unset.
Tokenizer Integration: The o2_tokenizer_build() function in mod.rs now applies both min and max token length filters to both SimpleTokenizer and O2Tokenizer paths.
Base64 Detection Threshold: The O2Tokenizer's base64 detection logic is updated to use the configurable max token length (MAX_TOKEN_LENGTH + 1 = 65) instead of a hardcoded 1024, creating more consistent behavior.
File Name Generation: Several files now use ider::generate_file_name() instead of ider::generate() to provide enhanced uniqueness with a 4-character hex suffix, supporting the improved token processing workflow.

With these changes, OpenObserve can successfully process 5GB parquet files without hitting Tantivy's memory limits, significantly improving the system's ability to handle large datasets while maintaining search functionality.

Confidence score: 3/5

This PR addresses a critical production issue but has some implementation inconsistencies that need attention
Score reflects the importance of the fix but concerns about inconsistent tokenization behavior between different code paths
Pay close attention to the tokenizer module files, especially the inconsistency between o2_tokenizer_build and o2_collect_tokens functions

_{7 files reviewed, 5 comments}

_{Edit Code Review Bot Settings | Greptile}

src/config/src/ider.rs

src/config/src/utils/tantivy/tokenizer/remove_short.rs

src/config/src/config.rs

src/config/src/utils/tantivy/tokenizer/mod.rs

hengfeiyang added 3 commits September 5, 2025 19:02

fix: add length filter for tantivy token

0a0f457

fix: generate file name for compactor

2d1f8ee

ci: cargo fmt

7076049

github-actions bot added ☢️ Bug Something isn't working ✏️ Feature labels Sep 5, 2025

hengfeiyang requested review from haohuaijin and uddhavdave September 5, 2025 11:13

github-actions bot added the Review effort 3/5 label Sep 5, 2025

uddhavdave approved these changes Sep 5, 2025

View reviewed changes

greptile-apps bot reviewed Sep 5, 2025

View reviewed changes

haohuaijin approved these changes Sep 5, 2025

View reviewed changes

hengfeiyang added 3 commits September 5, 2025 20:15

Merge branch 'main' into feat/tantivy-tokens

0278c86

Merge branch 'main' into feat/tantivy-tokens

486748c

Merge branch 'main' into feat/tantivy-tokens

61fac18

hengfeiyang merged commit 890a8d4 into main Sep 6, 2025
28 of 30 checks passed

hengfeiyang deleted the feat/tantivy-tokens branch September 6, 2025 14:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: add length filter for tantivy token #8310

fix: add length filter for tantivy token #8310

Uh oh!

hengfeiyang commented Sep 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Sep 5, 2025

Uh oh!

github-actions bot commented Sep 5, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fix: add length filter for tantivy token #8310

fix: add length filter for tantivy token #8310

Uh oh!

Conversation

hengfeiyang commented Sep 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

User description

PR Type

Description

Diagram Walkthrough

File Walkthrough

Uh oh!

github-actions bot commented Sep 5, 2025

PR Reviewer Guide 🔍

Uh oh!

github-actions bot commented Sep 5, 2025

PR Code Suggestions ✨

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 3/5

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hengfeiyang commented Sep 5, 2025 •

edited by github-actions bot

Loading