Skip to content

Conversation

@YashodhanJoshi1
Copy link
Contributor

@YashodhanJoshi1 YashodhanJoshi1 commented Sep 11, 2024

This basically keeps the external interface for search the same, but splits the internal interface into two parts - fetch ids and fetch full file data by the ids. Reasoning behind this is that -

  • Currently we fetch full file data while processing the search query on the leader
  • However, when file_list table is large, this takes a lot of time. Furthermore, sending this full data to individual workers also has network and ser/de costs that increases with number of items
  • Thus, this aims to optimize that by the leader only fetching the ids (thus making the initial sql query faster) and sending only those (thus reducing the network cost) to workers. Then workers fetching full file data only for the ids allocated to them.
  • The expectation is that because we only send ids, the network cost should get reduced, and as the workers query their file data in parallel, the divided time for this should be less than the single big query done in leader. Hence the net time for search query should get reduced.
  • With testing it is observed that the db fetch time is reduced considerably, for example a 35k records fetch from 450k records is reduced from 102ms to 32 ms. However, this is just the db fetch time, the end-to-end search time still still similar, as the data searching part is the same, and at least in the test env, no significant speed advantages are observed with offloading the data fetch to workers.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a new asynchronous method for retrieving file lists by IDs, enhancing file management capabilities with improved filtering based on specific criteria.
    • Enhanced search functionality to allow more granular control over file retrieval using index files.
    • Added performance monitoring for the search operation, logging the time taken to retrieve file lists.
  • Bug Fixes

    • Improved error handling and logic for processing work groups, reducing potential runtime issues.
  • Documentation

    • Updated message types and service definitions to reflect new querying capabilities and structural changes in the search request.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 11, 2024

Walkthrough

The changes enhance the search functionality by introducing a new asynchronous method, get_file_list_by_ids, for retrieving FileKey objects based on specified IDs and filtering them according to the match_source criteria. The main search function's logic is revised to utilize a file_list_map, improving the handling of index files and file metadata. Additionally, a timing mechanism is implemented in the search_partition function to log the duration of file list retrieval.

Changes

File Path Change Summary
src/service/search/grpc/mod.rs Added get_file_list_by_ids function for retrieving FileKey objects by IDs, with enhanced logic to construct a file_list_map for improved filtering based on index files.
src/service/search/mod.rs Introduced a timing mechanism in the search_partition function to log the duration of file list retrieval and modified the step variable calculation to ensure compliance with the minimum step requirement.

Possibly related PRs

Suggested reviewers

  • hengfeiyang
  • oasisk

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share
Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    -- I pushed a fix in commit <commit_id>, please review it.
    -- Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    -- @coderabbitai generate unit testing code for this file.
    -- @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    -- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    -- @coderabbitai read src/utils.ts and generate unit testing code.
    -- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    -- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Signed-off-by: Yashodhan Joshi <[email protected]>
@hengfeiyang hengfeiyang self-requested a review September 11, 2024 12:00
@YashodhanJoshi1 YashodhanJoshi1 force-pushed the yj/file_list branch 2 times, most recently from c8b767f to 007455e Compare September 11, 2024 15:01
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range and nitpick comments (1)
src/infra/src/file_list/sqlite.rs (1)

358-393: LGTM!

The query_by_ids method implementation looks good. It efficiently fetches file records by their IDs in chunks to handle SQLite's limitations.

Consider simplifying the code by using flat_map instead of iter and map:

-        Ok(ret
-            .iter()
-            .map(|r| {
-                (
-                    format!("files/{}/{}/{}", r.stream, r.date, r.file),
-                    r.into(),
-                )
-            })
-            .collect())
+        Ok(ret.into_iter().flat_map(|r| {
+            Some((format!("files/{}/{}/{}", r.stream, r.date, r.file), r.into()))
+        }).collect())
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 4c5642e and da1c33c.

Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
Files selected for processing (7)
  • Cargo.toml (1 hunks)
  • src/infra/src/file_list/mod.rs (4 hunks)
  • src/infra/src/file_list/mysql.rs (1 hunks)
  • src/infra/src/file_list/postgres.rs (5 hunks)
  • src/infra/src/file_list/sqlite.rs (1 hunks)
  • src/service/search/cluster/mod.rs (16 hunks)
  • src/service/search/grpc/mod.rs (4 hunks)
Files skipped from review as they are similar to previous changes (3)
  • src/infra/src/file_list/mod.rs
  • src/infra/src/file_list/mysql.rs
  • src/infra/src/file_list/postgres.rs
Additional context used
Path-based instructions (3)
src/service/search/grpc/mod.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/file_list/sqlite.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/service/search/cluster/mod.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

Additional comments not posted (10)
Cargo.toml (1)

20-20: LGTM!

The version update from "0.11.0" to "0.12.0" aligns with the semantic versioning convention for a minor release. This increment suggests the introduction of new features, enhancements, or bug fixes while maintaining backward compatibility.

src/service/search/grpc/mod.rs (2)

145-153: Refactoring to optimize search performance and flexibility looks good!

The changes to use file_ids instead of file_list and retrieve the file list separately using get_file_list_by_ids is a good refactoring to optimize the search process. It allows fetching only the required file data based on IDs, reducing unnecessary data transfer.

Additionally, filtering and merging the retrieved file list with segment IDs from idx_files provides flexibility to narrow down the search to specific file segments, enabling more granular control over the search scope.

Overall, these optimizations should improve search performance and flexibility.


379-425: The new get_file_list_by_ids function is a well-designed and efficient implementation for retrieving the required file list.

The function encapsulates the logic for file list retrieval and filtering based on provided IDs, index files, and additional criteria. This promotes code reusability and maintainability.

Some key benefits of this implementation:

  • Querying the file list by IDs using query_by_ids ensures that only the required files are retrieved, optimizing the data retrieval process.
  • Filtering and merging segment IDs from idx_file_list allows for fine-grained control over which file segments are included in the search.
  • Further filtering the file list based on the match_source criteria ensures that only relevant files are returned based on the specified stream type and partition keys.
  • Sorting the file list by min_ts and deduplicating ensures a consistent and unique set of files.

Overall, this function provides an optimized and flexible way to retrieve the required file list, enhancing the search functionality.

src/infra/src/file_list/sqlite.rs (1)

395-438: LGTM!

The query_ids method implementation looks good. It provides a flexible way to retrieve file IDs based on various filtering conditions.

The use of the time_range parameter and the configuration-based upper bound calculation for max_ts allows for fine-grained control over the file IDs being retrieved.

The function returns FileId objects, which encapsulate the necessary information about the files, making it convenient for further processing.

src/service/search/cluster/mod.rs (6)

148-161: LGTM!

The change from retrieving full file objects to only file IDs is a good optimization. The variable renaming and log message update are consistent with this change.


169-187: Looks good!

The variable renaming and log message update are consistent with the fact that the file list is now filtered by the inverted index. Storing the filtered file list in req.idx_files for later use is a sensible approach.


Line range hint 279-371: Excellent!

The updates to variable names, function parameters, and the req.file_ids field are all consistent with the switch from using full file objects to file IDs. This change should help reduce the amount of data transferred to worker nodes during the search process.


872-891: Nice work!

The new get_file_id_list function looks good. Retrieving only the file IDs is a smart optimization that can help improve search performance. The function handles errors appropriately and returns the file IDs wrapped in an Option for safety.


Line range hint 893-909: Looks good to me!

Updating the partition_file_by_bytes function to accept FileId objects instead of FileKey objects is a sensible change that keeps the function consistent with the rest of the codebase. The partitioning logic remains the same, so the function should still perform as expected.


Line range hint 913-944: Looks great!

Updating the partition_file_by_hash function to accept FileId objects instead of FileKey objects keeps the function consistent with the rest of the codebase. The partitioning logic is unchanged, so the function should still distribute the file IDs among the querier nodes as expected.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range and nitpick comments (1)
src/service/search/grpc/mod.rs (1)

385-432: Consider adding error handling and logging.

The new get_file_list_by_ids function performs several operations, such as querying files by IDs, filtering files based on index files and partition keys, and sorting and deduplicating the file list. Consider adding error handling and logging statements to help diagnose issues and monitor the function's performance.

Here's an example of how you can add error handling and logging:

+use tracing::info;

 #[tracing::instrument(skip_all, fields(org_id = sql.org_id, stream_name = sql.stream_name))]
 pub(crate) async fn get_file_list_by_ids(
     _trace_id: &str,
     ids: &[i64],
     sql: &super::sql::Sql,
     stream_type: StreamType,
     partition_keys: &[StreamPartition],
     idx_file_list: &[IdxFileName],
 ) -> (Vec<FileKey>, usize) {
     let start = std::time::Instant::now();
-    let file_list = query_by_ids(ids).await.unwrap_or_default();
+    let file_list = match query_by_ids(ids).await {
+        Ok(files) => files,
+        Err(e) => {
+            tracing::error!("Failed to query files by IDs: {}", e);
+            return (Vec::new(), 0);
+        }
+    };
+
+    info!("Retrieved {} files by IDs", file_list.len());
     // ... rest of the function ...
+    let elapsed = start.elapsed().as_millis() as usize;
+    info!("get_file_list_by_ids completed in {} ms", elapsed);
+    (files, elapsed)
 }
Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between da1c33c and 934a728.

Files selected for processing (1)
  • src/service/search/grpc/mod.rs (4 hunks)
Additional context used
Path-based instructions (1)
src/service/search/grpc/mod.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

Additional comments not posted (5)
src/service/search/grpc/mod.rs (5)

25-25: LGTM!

The import of FileKey, StreamPartition, and StreamType from config::meta::stream is necessary for the new functionality.


34-34: LGTM!

The import of cluster_rpc and IdxFileName from proto is necessary for the new functionality.


38-38: LGTM!

The import of query_by_ids from file_list is necessary for the new functionality.


433-433: LGTM!

The ReleaseFileGuard struct and its Drop implementation ensure that the session data is cleared and WAL lock files are released when the guard goes out of scope. This is a good practice for resource management.


145-159: Verify the usage of get_file_list_by_ids in the codebase.

The new get_file_list_by_ids function is called to retrieve the file list based on the provided IDs, which is a significant change to the file retrieval logic. Ensure that this change is compatible with the rest of the codebase and does not introduce any unintended side effects.

Run the following script to verify the usage of get_file_list_by_ids:

Verification successful

To gather more information about the get_file_list_by_ids function and its potential impact on the codebase, I'll run additional scripts to examine its implementation and search for related functions and tests.


Based on the analysis of the get_file_list_by_ids function and its usage in the codebase, I can provide the final response:

The get_file_list_by_ids function appears to be compatible with the codebase, but lacks test coverage.

The function is only used once in the file where it's defined (src/service/search/grpc/mod.rs), which suggests it's unlikely to introduce unintended side effects on existing functionality. However, to ensure its correctness and maintainability, consider the following recommendations:

  • Add unit tests specifically for the get_file_list_by_ids function to validate its behavior and edge cases.
  • Document the function's purpose, parameters, and return value to improve code clarity for other developers.
  • If this function is intended to be used more widely in the future, consider moving it to a more central location in the codebase.
Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the usage of `get_file_list_by_ids` in the codebase.

# Test: Search for the function usage. Expect: Only valid occurrences.
rg --type rust -A 5 $'get_file_list_by_ids'

Length of output: 785


Script:

#!/bin/bash
# Description: Examine the implementation of get_file_list_by_ids and search for related functions and tests

# Get the full implementation of get_file_list_by_ids
echo "Implementation of get_file_list_by_ids:"
ast-grep --lang rust --pattern $'pub(crate) async fn get_file_list_by_ids($_) {
  $$$
}'

# Search for other file retrieval or processing functions
echo -e "\nOther file retrieval or processing functions:"
rg --type rust -g '!*test*' 'fn.*file.*' -A 2

# Search for relevant test files
echo -e "\nRelevant test files:"
fd -e rs test | xargs rg 'file_list|get_file'

Length of output: 49651

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Outside diff range and nitpick comments (1)
src/service/search/mod.rs (1)

269-275: Improve logging by using structured logging.

Instead of manually constructing the log message, consider using structured logging to make it easier to parse and analyze the logs later. For example:

log::info!(
    "search_partition: get file_list completed";
    "trace_id" => trace_id,
    "time_range" => ?meta.meta.time_range,
    "num_files" => files.len(),
    "took_ms" => file_list_took,
);

This allows logging backends and tools to more easily extract and query the individual fields.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 934a728 and f5f51d1.

Files selected for processing (2)
  • src/service/search/grpc/mod.rs (4 hunks)
  • src/service/search/mod.rs (2 hunks)
Files skipped from review as they are similar to previous changes (1)
  • src/service/search/grpc/mod.rs
Additional context used
Path-based instructions (1)
src/service/search/mod.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

Additional comments not posted (1)
src/service/search/mod.rs (1)

240-240: LGTM!

Capturing the start time using std::time::Instant::now() is a good way to measure the elapsed time for performance monitoring.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants