feat: A better "Improve scan file" for wal parquet search #7876 #8242

yaga-simha · 2025-09-01T18:54:14Z

PR Type

Enhancement

Description

The User expressed concern in issue #7876 about unacceptable time cost for a simple "last 15 min" search against 130K files. This is suspected to be partly caused by config::utils::file::scan_files which gathers all the nested paths without filtering over time range from the query.

This PR introduces scan_files_filtered - an async_walkdir based path aggregator that pre-filters directory paths with a parameterized filter. This new function takes a root path and a predicate that will be supplied to WalkDir which will iterate over the folder paths in relatively breadth first order - allowing us to cull unnecessary search paths for nested directories that don't qualify the predicate supplied as filter.

Tests

There is a single test case that matches over a general case of paths - which I believe is robust enough to capture the common usage cases. [Rigorous tests can be added but a review can help trim down implementation]

AI

Used Gemini to add documentation not code.

greptile-apps

Greptile Summary

This PR introduces a significant performance optimization for WAL (Write-Ahead Log) parquet file scanning by implementing time-range-aware directory filtering. The key enhancement is the addition of two new functions in src/config/src/utils/file.rs:

wal_dir_datetime_filter_builder: Creates datetime-based path filters that can parse directory structures with embedded dates (YYYY/MM/DD/HH format) and determine if directories fall within a specified time range
scan_files_filtered: An async directory walker that applies filters during traversal, allowing early pruning of directory branches that don't match the criteria

The core problem being solved is that the existing scan_files function was gathering all nested file paths before applying any filtering, causing 8+ second delays when searching through 130K+ files even for recent data queries like "last 15 minutes". The new implementation leverages OpenObserve's hierarchical date-based WAL storage structure to skip entire directory trees that fall outside the query time range.

In src/service/search/grpc/wal.rs, the get_file_list_inner function now uses scan_files_filtered instead of the original scan_files, applying datetime filters when time ranges are present in the query. When no time range is specified, it falls back to an unfiltered scan, maintaining backward compatibility.

The filtering logic works by parsing date components from directory paths, handling partial paths gracefully, and using breadth-first traversal to make early pruning decisions. This approach is particularly effective for time-series data storage patterns where files are organized chronologically in the filesystem.

Confidence score: 4/5

This PR addresses a clear performance bottleneck with a well-architected solution that should significantly improve search times
The implementation correctly handles edge cases like partial date paths and maintains backward compatibility
The async implementation and proper error handling demonstrate good engineering practices, though the PathBuf cloning in the async context could be monitored for performance impact

_{2 files reviewed, no comments}

_{Edit Code Review Bot Settings | Greptile}

src/config/src/utils/file.rs

hengfeiyang · 2025-09-02T03:45:01Z

I tested this branch, it can't search data, always return empty. please test it again.

hengfeiyang · 2025-09-02T03:46:19Z

One tips: for local test with wal parquets. you can set the env bellow, it will never move wal files.

ZO_FILE_PUSH_INTERVAL = 10000000000

hengfeiyang · 2025-09-02T12:53:38Z

src/config/src/utils/file.rs

+    start_time: DateTime<Utc>,
+    end_time: DateTime<Utc>,
+    extension_pattern: String,
+    skip_count: usize,


one suggestion, how about we change the skip_count to root_path? skip_count is difficult to understand what is this mean. and i see the user case always use root_path to generate.

Context

The code needs to handle a tricky issue with the Write-Ahead Log (WAL) path. The full path is structured like this:

<WAL_ROOT>/files/<org_id>/<stream_type>/<stream_name>/<thread_id>/<YY>/<MM>/<DD>/<HH>/<timestamped_filename>.parquet

The challenge is that while we know the skippable root path (<WAL_ROOT>/files/<org_id>/<stream_type>/<stream_name>), the <thread_id> segment is a variable that is only discovered during the path traversal.

To filter correctly, we must skip one additional segment (the thread_id) beyond the known root path.

Why this is difficult

The current filter mechanism expects a known, constant path to skip. We have two options, neither of which feels ideal:

1. Introduce a Placeholder:

We could pass a placeholder like .../stream_name/thread_id to the filter. While this works because the filter only cares about the path segments, it's misleading. There's no constant named thread_id in the path; it's a dynamic value. This feels inauthentic and could confuse future developers.

2. Hardcode the 'Plus One':

A cleaner, more direct approach is to hardcode a +1 skip count. Since this code is specifically for WAL paths, we can make the assumption that we always need to skip one additional segment for the thread_id.

This second option is the better choice. It's more explicit about the specific logic for WAL paths, avoiding the need for a fake placeholder. It's a pragmatic solution that prioritizes clarity and avoids misrepresenting the path structure.

…etime ranges and adjust path prefix skip_count for thread_id segment in WAL paths.

yaga-simha · 2025-09-03T07:08:37Z

More rebases with main - unrelated tests were failing.

…ckly generated and not matching seconds and ns can give wrong results

yaga-simha · 2025-09-03T19:59:32Z

I tested this branch, it can't search data, always return empty. please test it again.

I have figured out how I have sabotaged myself with the way my phantom log are uploaded. Add to that I was matching for only hourly boundary skipping over minutes and seconds which is critically needed for comparing date time.

Ex:

A wal folder with 2025/09/03/19/... represents datetime=2025-09-03T19:00:00.000.
But this folder can contain log files with their timestamps above the start_time=2025-09-03T19:29:15.130.
But a naive check like date time >= start_time will fail because we constructed the date time from the path without the minutes and seconds details.

So, I fixed it buy attaching the minutes, seconds and nanoseconds from start_time and appending them to the date time computed from the path in hopes that, in case the date time's year, month, day and hour may match the start time's and would fall short because of missing minutes, seconds and nanoseconds.

I have tested this with a renewed log generator. Here is a snippet from debugging session.

hengfeiyang · 2025-09-04T01:16:45Z

@yaga-simha Great, i will try again. also can simple set minute, seconds to zero and only compare Y/M/D/H

github-actions bot added ✏️ Feature 🧹 Updates labels Sep 1, 2025

greptile-apps bot reviewed Sep 1, 2025

View reviewed changes

yaga-simha commented Sep 1, 2025

View reviewed changes

src/config/src/utils/file.rs Show resolved Hide resolved

hengfeiyang requested review from hengfeiyang and uddhavdave September 2, 2025 03:45

yaga-simha force-pushed the feature/improved-file-scan-util branch from 6eac407 to ee58fde Compare September 2, 2025 05:17

hengfeiyang reviewed Sep 2, 2025

View reviewed changes

yaga-simha force-pushed the feature/improved-file-scan-util branch 2 times, most recently from aa58b53 to 951eb8a Compare September 3, 2025 06:42

yaga-simha added 6 commits September 3, 2025 12:37

Basic WalkDir timerange filter over WAL directories

989aa78

Adding tests to validate base case for scan_files_filtered

d64a8f4

Add documentation for scan_files_filtered.

ad3df58

Remove unnecessary/incorrect comment and code - mea culpa

0907036

Improve wal_dir_datetime_filter for early escape while evaluating dat…

dd9de3e

…etime ranges and adjust path prefix skip_count for thread_id segment in WAL paths.

Make test paths look more close to real usecase paths

5843632

yaga-simha force-pushed the feature/improved-file-scan-util branch from 951eb8a to 5843632 Compare September 3, 2025 07:07

Fix bug with datetime evaluation - parquet files can be extremely qui…

411a89f

…ckly generated and not matching seconds and ns can give wrong results

Merge branch 'main' into feature/improved-file-scan-util

1395cf1

hengfeiyang approved these changes Sep 4, 2025

View reviewed changes

Merge branch 'main' into feature/improved-file-scan-util

47ea42a

hengfeiyang merged commit af1b104 into openobserve:main Sep 4, 2025
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: A better "Improve scan file" for wal parquet search #7876 #8242

feat: A better "Improve scan file" for wal parquet search #7876 #8242

Uh oh!

yaga-simha commented Sep 1, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

hengfeiyang commented Sep 2, 2025

Uh oh!

hengfeiyang commented Sep 2, 2025

Uh oh!

hengfeiyang Sep 2, 2025

Uh oh!

yaga-simha Sep 3, 2025 •

edited

Loading

Uh oh!

yaga-simha commented Sep 3, 2025

Uh oh!

yaga-simha commented Sep 3, 2025

Uh oh!

hengfeiyang commented Sep 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: A better "Improve scan file" for wal parquet search #7876 #8242

feat: A better "Improve scan file" for wal parquet search #7876 #8242

Uh oh!

Conversation

yaga-simha commented Sep 1, 2025

PR Type

Description

Tests

AI

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 4/5

Uh oh!

Uh oh!

hengfeiyang commented Sep 2, 2025

Uh oh!

hengfeiyang commented Sep 2, 2025

Uh oh!

hengfeiyang Sep 2, 2025

Choose a reason for hiding this comment

Uh oh!

yaga-simha Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Context

Why this is difficult

1. Introduce a Placeholder:

2. Hardcode the 'Plus One':

Uh oh!

yaga-simha commented Sep 3, 2025

Uh oh!

yaga-simha commented Sep 3, 2025

Uh oh!

hengfeiyang commented Sep 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yaga-simha Sep 3, 2025 •

edited

Loading