Skip to content

Conversation

@yaga-simha
Copy link
Contributor

@yaga-simha yaga-simha commented Sep 10, 2025

User description

PR type

Chore / Code improvement

Description

We are addressing Greptile's complaint about Path::canonnicalize negating the async benefits and suggested tokio::fs::cannonicalize. Now all path canonnicalizations are performed asynchronously.

create_wal_dir_datetime_filter is now refactored into async_file.rs where its companion function scan_file_filtered() exists.


PR Type

Enhancement


Description

  • Switch to async canonicalization with Tokio

  • Defer canonicalization after filtering files

  • Batch resolve paths via join_all

  • Preserve limit before async canonicalization


Diagram Walkthrough

flowchart LR
  A["Dir entries stream"] -- "filter files" --> B["PathBuf list (uncanonicalized)"]
  B -- "apply limit (optional)" --> C["Bounded list"]
  C -- "tokio::fs::canonicalize (join_all)" --> D["Canonical paths (String)"]
Loading

File Walkthrough

Relevant files
Enhancement
async_file.rs
Replace blocking canonicalize with async batched canonicalize

src/config/src/utils/async_file.rs

  • Remove blocking PathBuf::canonicalize in walker.
  • Collect uncanonicalized files, then async canonicalize.
  • Use futures::future::join_all with tokio::fs::canonicalize.
  • Preserve string conversion and error filtering.
+10/-8   

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR improves async performance by replacing synchronous Path::canonicalize() with asynchronous tokio::fs::canonicalize() in the scan_files_filtered function within src/config/src/utils/async_file.rs. The change addresses a performance bottleneck where blocking path canonicalization operations were negating the benefits of async file scanning.

The refactor separates the file collection phase from the canonicalization phase. Instead of canonicalizing paths synchronously during iteration, the function now:

  1. Collects all uncanonicalized paths first into a Vec<PathBuf>
  2. Performs all canonicalization operations concurrently using futures::future::join_all() with tokio::fs::canonicalize()
  3. Processes the results, filtering out any canonicalization failures

This approach maintains the same functionality while being more efficient in async contexts by avoiding thread pool blocking operations and leveraging concurrent processing of canonicalization tasks. The change fits well with the codebase's async-first architecture and improves resource utilization during file system operations.

Confidence score: 3/5

  • This PR introduces a potential issue with silent error handling that could lead to incomplete results without indication
  • Score reflects the performance improvement benefits but concerns about error handling robustness in a critical file operation
  • Pay close attention to the error handling logic in async_file.rs around lines 213-217

1 file reviewed, no comments

Edit Code Review Bot Settings | Greptile

@github-actions github-actions bot changed the title chore: Update canonnicalization step for scan_files_filtered to be more async chore: Update canonnicalization step for scan_files_filtered to be more async Sep 10, 2025
@github-actions
Copy link
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Possible Logic Change

Canonicalization is now deferred until after applying the optional limit. This changes behavior from "limit canonicalized files" to "canonicalize after limiting uncanonicalized paths," which may yield fewer valid results if some later canonicalizations fail or if symlinks/files disappear. Confirm that this behavior change is intended and covered by tests.

let uncanonicalized_paths: Vec<PathBuf> = if let Some(limit_count) = limit {
    walker.take(limit_count).collect().await
} else {
    walker.collect().await
};

let files =
    futures::future::join_all(uncanonicalized_paths.iter().map(tokio::fs::canonicalize))
        .await
        .into_iter()
        .map(Result::ok)
        .filter_map(|path| path.and_then(|pbuf| pbuf.to_str().map(String::from)))
        .collect();

Ok(files)
Error Handling

Errors from tokio::fs::canonicalize are silently dropped via Result::ok, potentially masking IO/path issues. Consider logging, counting failures, or returning a structured error if all canonicalizations fail.

futures::future::join_all(uncanonicalized_paths.iter().map(tokio::fs::canonicalize))
    .await
    .into_iter()
    .map(Result::ok)
    .filter_map(|path| path.and_then(|pbuf| pbuf.to_str().map(String::from)))
    .collect();
Large Batch Concurrency

join_all over the full Vec may spawn many concurrent canonicalize ops with no bound. For very large inputs this can cause IO pressure or memory spikes. Consider using a bounded concurrency strategy (e.g., FuturesUnordered with a limit or buffer_unordered).

futures::future::join_all(uncanonicalized_paths.iter().map(tokio::fs::canonicalize))
    .await
    .into_iter()

@hengfeiyang
Copy link
Contributor

@yaga-simha Please also move wal_dir_datetime_filter_builder to async_file, it is belongs to scan_file_filtered

@github-actions
Copy link
Contributor

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Move owned paths into tasks

Use into_iter() instead of iter() when feeding canonicalize to avoid cloning and
ensure owned PathBufs are moved into the async tasks, preventing accidental borrows
across await points.

src/config/src/utils/async_file.rs [213-219]

 let files =
-    futures::future::join_all(uncanonicalized_paths.iter().map(tokio::fs::canonicalize))
+    futures::future::join_all(uncanonicalized_paths.into_iter().map(tokio::fs::canonicalize))
         .await
         .into_iter()
         .map(Result::ok)
         .filter_map(|path| path.and_then(|pbuf| pbuf.to_str().map(String::from)))
         .collect();
Suggestion importance[1-10]: 8

__

Why: Using into_iter() to move PathBufs into canonicalize avoids unnecessary borrowing/cloning and potential lifetime issues across awaits, matching ownership expectations for spawned futures.

Medium
Avoid short-lived references

Consider returning Some(pb.to_path_buf()) to avoid holding a reference to pb that
may not live long enough outside the closure. This prevents potential lifetime
issues when the collected values outlive the closure scope.

src/config/src/utils/async_file.rs [203]

-if pb.is_file() { Some(pb) } else { None }
+if pb.is_file() { Some(pb.to_path_buf()) } else { None }
Suggestion importance[1-10]: 7

__

Why: Changing Some(pb) to Some(pb.to_path_buf()) avoids moving/borrowing a &Path that may not live long enough and aligns with later .iter() usage. It’s a reasonable safety and clarity improvement with moderate impact.

Medium
General
Preserve non-UTF8 file paths

Avoid calling to_str() and String::from on paths, which may drop files with non-UTF8
names. Preserve all paths by collecting OsStrings (or keep PathBufs) and convert to
string later if required.

src/config/src/utils/async_file.rs [213-219]

-let files =
-    futures::future::join_all(uncanonicalized_paths.iter().map(tokio::fs::canonicalize))
+let files: Vec<PathBuf> =
+    futures::future::join_all(uncanonicalized_paths.into_iter().map(tokio::fs::canonicalize))
         .await
         .into_iter()
-        .map(Result::ok)
-        .filter_map(|path| path.and_then(|pbuf| pbuf.to_str().map(String::from)))
+        .filter_map(Result::ok)
         .collect();
Suggestion importance[1-10]: 6

__

Why: Avoiding to_str() prevents loss of non-UTF8 paths and collecting PathBufs is more robust. However, this changes the function’s return type/behavior, so impact depends on intended API.

Low

@yaga-simha
Copy link
Contributor Author

@yaga-simha Please also move wal_dir_datetime_filter_builder to async_file, it is belongs to scan_file_filtered

I thought about this before sending the PR. I agree with you. But this helper should belong in a general purpose file_path utility as this is not an async specific helper. In fact, it has more to do with wal paths than even utils/files too. Please suggest a better place.

@yaga-simha yaga-simha force-pushed the fix/scan_file_filtered_asyn_canonnicalize branch from 503d07c to c99240b Compare September 10, 2025 11:40
@yaga-simha yaga-simha force-pushed the fix/scan_file_filtered_asyn_canonnicalize branch from c99240b to fb169f8 Compare September 10, 2025 11:44
@hengfeiyang hengfeiyang merged commit 1957495 into main Sep 10, 2025
28 checks passed
@hengfeiyang hengfeiyang deleted the fix/scan_file_filtered_asyn_canonnicalize branch September 10, 2025 12:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants