Skip to content

Conversation

@yaga-simha
Copy link
Contributor

PR Type

Enhancement

Description

The User expressed concern in issue #7876 about unacceptable time cost for a simple "last 15 min" search against 130K files. This is suspected to be partly caused by config::utils::file::scan_files which gathers all the nested paths without filtering over time range from the query.

This PR introduces scan_files_filtered - an async_walkdir based path aggregator that pre-filters directory paths with a parameterized filter. This new function takes a root path and a predicate that will be supplied to WalkDir which will iterate over the folder paths in relatively breadth first order - allowing us to cull unnecessary search paths for nested directories that don't qualify the predicate supplied as filter.

Tests

There is a single test case that matches over a general case of paths - which I believe is robust enough to capture the common usage cases. [Rigorous tests can be added but a review can help trim down implementation]

AI

Used Gemini to add documentation not code.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR introduces a significant performance optimization for WAL (Write-Ahead Log) parquet file scanning by implementing time-range-aware directory filtering. The key enhancement is the addition of two new functions in src/config/src/utils/file.rs:

  1. wal_dir_datetime_filter_builder: Creates datetime-based path filters that can parse directory structures with embedded dates (YYYY/MM/DD/HH format) and determine if directories fall within a specified time range
  2. scan_files_filtered: An async directory walker that applies filters during traversal, allowing early pruning of directory branches that don't match the criteria

The core problem being solved is that the existing scan_files function was gathering all nested file paths before applying any filtering, causing 8+ second delays when searching through 130K+ files even for recent data queries like "last 15 minutes". The new implementation leverages OpenObserve's hierarchical date-based WAL storage structure to skip entire directory trees that fall outside the query time range.

In src/service/search/grpc/wal.rs, the get_file_list_inner function now uses scan_files_filtered instead of the original scan_files, applying datetime filters when time ranges are present in the query. When no time range is specified, it falls back to an unfiltered scan, maintaining backward compatibility.

The filtering logic works by parsing date components from directory paths, handling partial paths gracefully, and using breadth-first traversal to make early pruning decisions. This approach is particularly effective for time-series data storage patterns where files are organized chronologically in the filesystem.

Confidence score: 4/5

  • This PR addresses a clear performance bottleneck with a well-architected solution that should significantly improve search times
  • The implementation correctly handles edge cases like partial date paths and maintains backward compatibility
  • The async implementation and proper error handling demonstrate good engineering practices, though the PathBuf cloning in the async context could be monitored for performance impact

2 files reviewed, no comments

Edit Code Review Bot Settings | Greptile

@hengfeiyang
Copy link
Contributor

I tested this branch, it can't search data, always return empty. please test it again.

@hengfeiyang
Copy link
Contributor

One tips: for local test with wal parquets. you can set the env bellow, it will never move wal files.

ZO_FILE_PUSH_INTERVAL = 10000000000

@yaga-simha yaga-simha force-pushed the feature/improved-file-scan-util branch from 6eac407 to ee58fde Compare September 2, 2025 05:17
start_time: DateTime<Utc>,
end_time: DateTime<Utc>,
extension_pattern: String,
skip_count: usize,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one suggestion, how about we change the skip_count to root_path? skip_count is difficult to understand what is this mean. and i see the user case always use root_path to generate.

Copy link
Contributor Author

@yaga-simha yaga-simha Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context

The code needs to handle a tricky issue with the Write-Ahead Log (WAL) path. The full path is structured like this:

<WAL_ROOT>/files/<org_id>/<stream_type>/<stream_name>/<thread_id>/<YY>/<MM>/<DD>/<HH>/<timestamped_filename>.parquet

The challenge is that while we know the skippable root path (<WAL_ROOT>/files/<org_id>/<stream_type>/<stream_name>), the <thread_id> segment is a variable that is only discovered during the path traversal.

To filter correctly, we must skip one additional segment (the thread_id) beyond the known root path.

Why this is difficult

The current filter mechanism expects a known, constant path to skip. We have two options, neither of which feels ideal:

1. Introduce a Placeholder:

We could pass a placeholder like .../stream_name/thread_id to the filter. While this works because the filter only cares about the path segments, it's misleading. There's no constant named thread_id in the path; it's a dynamic value. This feels inauthentic and could confuse future developers.

2. Hardcode the 'Plus One':

A cleaner, more direct approach is to hardcode a +1 skip count. Since this code is specifically for WAL paths, we can make the assumption that we always need to skip one additional segment for the thread_id.

This second option is the better choice. It's more explicit about the specific logic for WAL paths, avoiding the need for a fake placeholder. It's a pragmatic solution that prioritizes clarity and avoids misrepresenting the path structure.

@yaga-simha yaga-simha force-pushed the feature/improved-file-scan-util branch 2 times, most recently from aa58b53 to 951eb8a Compare September 3, 2025 06:42
@yaga-simha yaga-simha force-pushed the feature/improved-file-scan-util branch from 951eb8a to 5843632 Compare September 3, 2025 07:07
@yaga-simha
Copy link
Contributor Author

More rebases with main - unrelated tests were failing.

…ckly generated and not matching seconds and ns can give wrong results
@yaga-simha
Copy link
Contributor Author

I tested this branch, it can't search data, always return empty. please test it again.

I have figured out how I have sabotaged myself with the way my phantom log are uploaded. Add to that I was matching for only hourly boundary skipping over minutes and seconds which is critically needed for comparing date time.

Ex:

  • A wal folder with 2025/09/03/19/... represents datetime=2025-09-03T19:00:00.000.
  • But this folder can contain log files with their timestamps above the start_time=2025-09-03T19:29:15.130.
  • But a naive check like date time >= start_time will fail because we constructed the date time from the path without the minutes and seconds details.

So, I fixed it buy attaching the minutes, seconds and nanoseconds from start_time and appending them to the date time computed from the path in hopes that, in case the date time's year, month, day and hour may match the start time's and would fall short because of missing minutes, seconds and nanoseconds.

I have tested this with a renewed log generator. Here is a snippet from debugging session.

image

@hengfeiyang
Copy link
Contributor

@yaga-simha Great, i will try again. also can simple set minute, seconds to zero and only compare Y/M/D/H

@hengfeiyang hengfeiyang merged commit af1b104 into openobserve:main Sep 4, 2025
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants