Skip to content

Conversation

@hengfeiyang
Copy link
Contributor

When we search this will got result not found:

match_all('INFO actix_web::middleware::logger: 10.1.62.162 "POST /api/quebec/file/_json HTTP/1.1" 200')

The problem is we removed tokens which length is less than 2 and more than 64. But we didn't do the same when search.

@github-actions github-actions bot added the ☢️ Bug Something isn't working label Oct 28, 2025
@github-actions
Copy link
Contributor

Failed to generate code suggestions for PR

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Greptile Summary

Fixed a critical tokenizer inconsistency where o2_collect_tokens wasn't applying the same length filters (min 2 chars, max 64 chars) that are applied during indexing, causing search queries with single-character tokens to fail with "not found" results.

Key Changes:

  • Replaced direct instantiation of SimpleTokenizer/O2Tokenizer in o2_collect_tokens with a call to o2_tokenizer_build()
  • Now both indexing and search operations use identical tokenization logic with RemoveShortFilter and RemoveLongFilter
  • Ensures queries like match_all('INFO actix_web...') work correctly by filtering single-char tokens during search, matching the index behavior

Impact:

  • Resolves search failures when queries contain single-character tokens
  • Maintains consistency between index creation and search query processing

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk
  • The fix is a simple, well-targeted change that solves a clear tokenizer inconsistency bug. It replaces manual tokenizer instantiation with a call to the existing o2_tokenizer_build() function, ensuring search and indexing use identical token processing. The change is minimal (7 lines removed, 1 line added), has no side effects, and improves correctness without introducing new complexity or risks.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
src/config/src/utils/tantivy/tokenizer/mod.rs 5/5 Fixed tokenizer consistency issue where o2_collect_tokens wasn't applying length filters during search, causing single-character token mismatches

Sequence Diagram

sequenceDiagram
    participant User
    participant SearchAPI
    participant o2_collect_tokens
    participant o2_tokenizer_build
    participant RemoveShortFilter
    participant RemoveLongFilter
    participant Index

    User->>SearchAPI: match_all('INFO actix_web...')
    SearchAPI->>o2_collect_tokens: Tokenize search query
    
    Note over o2_collect_tokens: BEFORE: Used SimpleTokenizer/O2Tokenizer directly<br/>(no length filtering)
    Note over o2_collect_tokens: AFTER: Uses o2_tokenizer_build()<br/>(applies length filtering)
    
    o2_collect_tokens->>o2_tokenizer_build: Build tokenizer with filters
    o2_tokenizer_build->>RemoveShortFilter: Apply min_token_length >= 2
    RemoveShortFilter->>RemoveLongFilter: Apply max_token_length <= 64
    RemoveLongFilter-->>o2_collect_tokens: Configured tokenizer
    
    o2_collect_tokens->>o2_collect_tokens: Process tokens with filters
    Note over o2_collect_tokens: Single char tokens removed
    
    o2_collect_tokens-->>SearchAPI: Filtered tokens
    SearchAPI->>Index: Query with filtered tokens
    Index-->>User: Consistent results
Loading

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

@testdino-playwright-reporter
Copy link

⚠️ Test Run Unstable


Author: hengfeiyang | Branch: fix/tantivy-search | Commit: 9d56927

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
All tests passed 366 341 0 19 6 93% 4m 39s

View Detailed Results

@testdino-playwright-reporter
Copy link

⚠️ Test Run Unstable


Author: hengfeiyang | Branch: fix/tantivy-search | Commit: 927bc0d

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
All tests passed 366 342 0 19 5 93% 4m 39s

View Detailed Results

@testdino-playwright-reporter
Copy link

⚠️ Test Run Unstable


Author: hengfeiyang | Branch: fix/tantivy-search | Commit: 1223564

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
All tests passed 366 344 0 19 3 94% 4m 57s

View Detailed Results

@testdino-playwright-reporter
Copy link

⚠️ Test Run Unstable


Author: hengfeiyang | Branch: fix/tantivy-search | Commit: 1223564

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
All tests passed 366 346 0 19 1 95% 4m 38s

View Detailed Results

@testdino-playwright-reporter
Copy link

⚠️ Test Run Unstable


Author: hengfeiyang | Branch: fix/tantivy-search | Commit: 175f468

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
All tests passed 366 342 0 19 5 93% 4m 39s

View Detailed Results

@testdino-playwright-reporter
Copy link

⚠️ Test Run Unstable


Author: hengfeiyang | Branch: fix/tantivy-search | Commit: 175f468

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
All tests passed 366 346 0 19 1 95% 4m 39s

View Detailed Results

@testdino-playwright-reporter
Copy link

⚠️ Test Run Unstable


Author: haohuaijin | Branch: fix/tantivy-search | Commit: 03dc0ea

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
All tests passed 366 345 0 19 2 94% 4m 39s

View Detailed Results

@testdino-playwright-reporter
Copy link

⚠️ Test Run Unstable


Author: hengfeiyang | Branch: fix/tantivy-search | Commit: ee5540e

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
All tests passed 366 343 0 19 4 94% 7m 5s

View Detailed Results

@hengfeiyang hengfeiyang merged commit 848c604 into main Oct 29, 2025
31 of 32 checks passed
@hengfeiyang hengfeiyang deleted the fix/tantivy-search branch October 29, 2025 23:45
uddhavdave added a commit that referenced this pull request Nov 13, 2025
uddhavdave added a commit that referenced this pull request Dec 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

☢️ Bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants