-
Notifications
You must be signed in to change notification settings - Fork 715
fix: search not found with single char #8904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Failed to generate code suggestions for PR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Fixed a critical tokenizer inconsistency where o2_collect_tokens wasn't applying the same length filters (min 2 chars, max 64 chars) that are applied during indexing, causing search queries with single-character tokens to fail with "not found" results.
Key Changes:
- Replaced direct instantiation of
SimpleTokenizer/O2Tokenizerino2_collect_tokenswith a call too2_tokenizer_build() - Now both indexing and search operations use identical tokenization logic with
RemoveShortFilterandRemoveLongFilter - Ensures queries like
match_all('INFO actix_web...')work correctly by filtering single-char tokens during search, matching the index behavior
Impact:
- Resolves search failures when queries contain single-character tokens
- Maintains consistency between index creation and search query processing
Confidence Score: 5/5
- This PR is safe to merge with minimal risk
- The fix is a simple, well-targeted change that solves a clear tokenizer inconsistency bug. It replaces manual tokenizer instantiation with a call to the existing
o2_tokenizer_build()function, ensuring search and indexing use identical token processing. The change is minimal (7 lines removed, 1 line added), has no side effects, and improves correctness without introducing new complexity or risks. - No files require special attention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/config/src/utils/tantivy/tokenizer/mod.rs | 5/5 | Fixed tokenizer consistency issue where o2_collect_tokens wasn't applying length filters during search, causing single-character token mismatches |
Sequence Diagram
sequenceDiagram
participant User
participant SearchAPI
participant o2_collect_tokens
participant o2_tokenizer_build
participant RemoveShortFilter
participant RemoveLongFilter
participant Index
User->>SearchAPI: match_all('INFO actix_web...')
SearchAPI->>o2_collect_tokens: Tokenize search query
Note over o2_collect_tokens: BEFORE: Used SimpleTokenizer/O2Tokenizer directly<br/>(no length filtering)
Note over o2_collect_tokens: AFTER: Uses o2_tokenizer_build()<br/>(applies length filtering)
o2_collect_tokens->>o2_tokenizer_build: Build tokenizer with filters
o2_tokenizer_build->>RemoveShortFilter: Apply min_token_length >= 2
RemoveShortFilter->>RemoveLongFilter: Apply max_token_length <= 64
RemoveLongFilter-->>o2_collect_tokens: Configured tokenizer
o2_collect_tokens->>o2_collect_tokens: Process tokens with filters
Note over o2_collect_tokens: Single char tokens removed
o2_collect_tokens-->>SearchAPI: Filtered tokens
SearchAPI->>Index: Query with filtered tokens
Index-->>User: Consistent results
1 file reviewed, no comments
|
| Status | Total | Passed | Failed | Skipped | Flaky | Pass Rate | Duration |
|---|---|---|---|---|---|---|---|
| All tests passed | 366 | 341 | 0 | 19 | 6 | 93% | 4m 39s |
|
| Status | Total | Passed | Failed | Skipped | Flaky | Pass Rate | Duration |
|---|---|---|---|---|---|---|---|
| All tests passed | 366 | 342 | 0 | 19 | 5 | 93% | 4m 39s |
|
| Status | Total | Passed | Failed | Skipped | Flaky | Pass Rate | Duration |
|---|---|---|---|---|---|---|---|
| All tests passed | 366 | 344 | 0 | 19 | 3 | 94% | 4m 57s |
|
| Status | Total | Passed | Failed | Skipped | Flaky | Pass Rate | Duration |
|---|---|---|---|---|---|---|---|
| All tests passed | 366 | 346 | 0 | 19 | 1 | 95% | 4m 38s |
|
| Status | Total | Passed | Failed | Skipped | Flaky | Pass Rate | Duration |
|---|---|---|---|---|---|---|---|
| All tests passed | 366 | 342 | 0 | 19 | 5 | 93% | 4m 39s |
|
| Status | Total | Passed | Failed | Skipped | Flaky | Pass Rate | Duration |
|---|---|---|---|---|---|---|---|
| All tests passed | 366 | 346 | 0 | 19 | 1 | 95% | 4m 39s |
|
| Status | Total | Passed | Failed | Skipped | Flaky | Pass Rate | Duration |
|---|---|---|---|---|---|---|---|
| All tests passed | 366 | 345 | 0 | 19 | 2 | 94% | 4m 39s |
|
| Status | Total | Passed | Failed | Skipped | Flaky | Pass Rate | Duration |
|---|---|---|---|---|---|---|---|
| All tests passed | 366 | 343 | 0 | 19 | 4 | 94% | 7m 5s |
merge to rc9 Co-authored-by: Hengfei Yang <[email protected]>
merge to rc9 Co-authored-by: Hengfei Yang <[email protected]>
When we search this will got result not found:
The problem is we removed tokens which length is less than
2and more than64. But we didn't do the same when search.