-
Notifications
You must be signed in to change notification settings - Fork 715
feat: A better "Improve scan file" for wal parquet search #7876 #8242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: A better "Improve scan file" for wal parquet search #7876 #8242
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR introduces a significant performance optimization for WAL (Write-Ahead Log) parquet file scanning by implementing time-range-aware directory filtering. The key enhancement is the addition of two new functions in src/config/src/utils/file.rs:
wal_dir_datetime_filter_builder: Creates datetime-based path filters that can parse directory structures with embedded dates (YYYY/MM/DD/HH format) and determine if directories fall within a specified time rangescan_files_filtered: An async directory walker that applies filters during traversal, allowing early pruning of directory branches that don't match the criteria
The core problem being solved is that the existing scan_files function was gathering all nested file paths before applying any filtering, causing 8+ second delays when searching through 130K+ files even for recent data queries like "last 15 minutes". The new implementation leverages OpenObserve's hierarchical date-based WAL storage structure to skip entire directory trees that fall outside the query time range.
In src/service/search/grpc/wal.rs, the get_file_list_inner function now uses scan_files_filtered instead of the original scan_files, applying datetime filters when time ranges are present in the query. When no time range is specified, it falls back to an unfiltered scan, maintaining backward compatibility.
The filtering logic works by parsing date components from directory paths, handling partial paths gracefully, and using breadth-first traversal to make early pruning decisions. This approach is particularly effective for time-series data storage patterns where files are organized chronologically in the filesystem.
Confidence score: 4/5
- This PR addresses a clear performance bottleneck with a well-architected solution that should significantly improve search times
- The implementation correctly handles edge cases like partial date paths and maintains backward compatibility
- The async implementation and proper error handling demonstrate good engineering practices, though the PathBuf cloning in the async context could be monitored for performance impact
2 files reviewed, no comments
|
I tested this branch, it can't search data, always return empty. please test it again. |
|
One tips: for local test with wal parquets. you can set the env bellow, it will never move wal files. |
6eac407 to
ee58fde
Compare
| start_time: DateTime<Utc>, | ||
| end_time: DateTime<Utc>, | ||
| extension_pattern: String, | ||
| skip_count: usize, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one suggestion, how about we change the skip_count to root_path? skip_count is difficult to understand what is this mean. and i see the user case always use root_path to generate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Context
The code needs to handle a tricky issue with the Write-Ahead Log (WAL) path. The full path is structured like this:
<WAL_ROOT>/files/<org_id>/<stream_type>/<stream_name>/<thread_id>/<YY>/<MM>/<DD>/<HH>/<timestamped_filename>.parquet
The challenge is that while we know the skippable root path (<WAL_ROOT>/files/<org_id>/<stream_type>/<stream_name>), the <thread_id> segment is a variable that is only discovered during the path traversal.
To filter correctly, we must skip one additional segment (the thread_id) beyond the known root path.
Why this is difficult
The current filter mechanism expects a known, constant path to skip. We have two options, neither of which feels ideal:
1. Introduce a Placeholder:
We could pass a placeholder like .../stream_name/thread_id to the filter. While this works because the filter only cares about the path segments, it's misleading. There's no constant named thread_id in the path; it's a dynamic value. This feels inauthentic and could confuse future developers.
2. Hardcode the 'Plus One':
A cleaner, more direct approach is to hardcode a +1 skip count. Since this code is specifically for WAL paths, we can make the assumption that we always need to skip one additional segment for the thread_id.
This second option is the better choice. It's more explicit about the specific logic for WAL paths, avoiding the need for a fake placeholder. It's a pragmatic solution that prioritizes clarity and avoids misrepresenting the path structure.
aa58b53 to
951eb8a
Compare
…etime ranges and adjust path prefix skip_count for thread_id segment in WAL paths.
951eb8a to
5843632
Compare
|
More rebases with main - unrelated tests were failing. |
…ckly generated and not matching seconds and ns can give wrong results
|
@yaga-simha Great, i will try again. also can simple set minute, seconds to zero and only compare Y/M/D/H |

PR Type
Enhancement
Description
The User expressed concern in issue #7876 about unacceptable time cost for a simple "last 15 min" search against 130K files. This is suspected to be partly caused by config::utils::file::scan_files which gathers all the nested paths without filtering over time range from the query.
This PR introduces
scan_files_filtered- an async_walkdir based path aggregator that pre-filters directory paths with a parameterized filter. This new function takes a root path and a predicate that will be supplied to WalkDir which will iterate over the folder paths in relatively breadth first order - allowing us to cull unnecessary search paths for nested directories that don't qualify the predicate supplied as filter.Tests
There is a single test case that matches over a general case of paths - which I believe is robust enough to capture the common usage cases. [Rigorous tests can be added but a review can help trim down implementation]
AI
Used Gemini to add documentation not code.