-
Notifications
You must be signed in to change notification settings - Fork 715
feat: optimise file list querying for better search perf #4518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Yashodhan Joshi <[email protected]>
Signed-off-by: Yashodhan Joshi <[email protected]>
Signed-off-by: Yashodhan Joshi <[email protected]>
Signed-off-by: Yashodhan Joshi <[email protected]>
Signed-off-by: Yashodhan Joshi <[email protected]>
WalkthroughThe changes enhance the search functionality by introducing a new asynchronous method, Changes
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Signed-off-by: Yashodhan Joshi <[email protected]>
Signed-off-by: Yashodhan Joshi <[email protected]>
c8b767f to
007455e
Compare
Signed-off-by: Yashodhan Joshi <[email protected]>
007455e to
7254f2c
Compare
Signed-off-by: Yashodhan Joshi <[email protected]>
Signed-off-by: Yashodhan Joshi <[email protected]>
Signed-off-by: Yashodhan Joshi <[email protected]>
Signed-off-by: Yashodhan Joshi <[email protected]>
Signed-off-by: Yashodhan Joshi <[email protected]>
Signed-off-by: Yashodhan Joshi <[email protected]>
Signed-off-by: Yashodhan Joshi <[email protected]>
Signed-off-by: Yashodhan Joshi <[email protected]>
Signed-off-by: Yashodhan Joshi <[email protected]>
Signed-off-by: Yashodhan Joshi <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Outside diff range and nitpick comments (1)
src/infra/src/file_list/sqlite.rs (1)
358-393: LGTM!The
query_by_idsmethod implementation looks good. It efficiently fetches file records by their IDs in chunks to handle SQLite's limitations.Consider simplifying the code by using
flat_mapinstead ofiterandmap:- Ok(ret - .iter() - .map(|r| { - ( - format!("files/{}/{}/{}", r.stream, r.date, r.file), - r.into(), - ) - }) - .collect()) + Ok(ret.into_iter().flat_map(|r| { + Some((format!("files/{}/{}/{}", r.stream, r.date, r.file), r.into())) + }).collect())
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
Files selected for processing (7)
- Cargo.toml (1 hunks)
- src/infra/src/file_list/mod.rs (4 hunks)
- src/infra/src/file_list/mysql.rs (1 hunks)
- src/infra/src/file_list/postgres.rs (5 hunks)
- src/infra/src/file_list/sqlite.rs (1 hunks)
- src/service/search/cluster/mod.rs (16 hunks)
- src/service/search/grpc/mod.rs (4 hunks)
Files skipped from review as they are similar to previous changes (3)
- src/infra/src/file_list/mod.rs
- src/infra/src/file_list/mysql.rs
- src/infra/src/file_list/postgres.rs
Additional context used
Path-based instructions (3)
src/service/search/grpc/mod.rs (1)
Pattern
**/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestionsMake sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.
src/infra/src/file_list/sqlite.rs (1)
Pattern
**/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestionsMake sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.
src/service/search/cluster/mod.rs (1)
Pattern
**/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestionsMake sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.
Additional comments not posted (10)
Cargo.toml (1)
20-20: LGTM!The version update from "0.11.0" to "0.12.0" aligns with the semantic versioning convention for a minor release. This increment suggests the introduction of new features, enhancements, or bug fixes while maintaining backward compatibility.
src/service/search/grpc/mod.rs (2)
145-153: Refactoring to optimize search performance and flexibility looks good!The changes to use
file_idsinstead offile_listand retrieve the file list separately usingget_file_list_by_idsis a good refactoring to optimize the search process. It allows fetching only the required file data based on IDs, reducing unnecessary data transfer.Additionally, filtering and merging the retrieved file list with segment IDs from
idx_filesprovides flexibility to narrow down the search to specific file segments, enabling more granular control over the search scope.Overall, these optimizations should improve search performance and flexibility.
379-425: The newget_file_list_by_idsfunction is a well-designed and efficient implementation for retrieving the required file list.The function encapsulates the logic for file list retrieval and filtering based on provided IDs, index files, and additional criteria. This promotes code reusability and maintainability.
Some key benefits of this implementation:
- Querying the file list by IDs using
query_by_idsensures that only the required files are retrieved, optimizing the data retrieval process.- Filtering and merging segment IDs from
idx_file_listallows for fine-grained control over which file segments are included in the search.- Further filtering the file list based on the
match_sourcecriteria ensures that only relevant files are returned based on the specified stream type and partition keys.- Sorting the file list by
min_tsand deduplicating ensures a consistent and unique set of files.Overall, this function provides an optimized and flexible way to retrieve the required file list, enhancing the search functionality.
src/infra/src/file_list/sqlite.rs (1)
395-438: LGTM!The
query_idsmethod implementation looks good. It provides a flexible way to retrieve file IDs based on various filtering conditions.The use of the
time_rangeparameter and the configuration-based upper bound calculation formax_tsallows for fine-grained control over the file IDs being retrieved.The function returns
FileIdobjects, which encapsulate the necessary information about the files, making it convenient for further processing.src/service/search/cluster/mod.rs (6)
148-161: LGTM!The change from retrieving full file objects to only file IDs is a good optimization. The variable renaming and log message update are consistent with this change.
169-187: Looks good!The variable renaming and log message update are consistent with the fact that the file list is now filtered by the inverted index. Storing the filtered file list in
req.idx_filesfor later use is a sensible approach.
Line range hint
279-371: Excellent!The updates to variable names, function parameters, and the
req.file_idsfield are all consistent with the switch from using full file objects to file IDs. This change should help reduce the amount of data transferred to worker nodes during the search process.
872-891: Nice work!The new
get_file_id_listfunction looks good. Retrieving only the file IDs is a smart optimization that can help improve search performance. The function handles errors appropriately and returns the file IDs wrapped in anOptionfor safety.
Line range hint
893-909: Looks good to me!Updating the
partition_file_by_bytesfunction to acceptFileIdobjects instead ofFileKeyobjects is a sensible change that keeps the function consistent with the rest of the codebase. The partitioning logic remains the same, so the function should still perform as expected.
Line range hint
913-944: Looks great!Updating the
partition_file_by_hashfunction to acceptFileIdobjects instead ofFileKeyobjects keeps the function consistent with the rest of the codebase. The partitioning logic is unchanged, so the function should still distribute the file IDs among the querier nodes as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Outside diff range and nitpick comments (1)
src/service/search/grpc/mod.rs (1)
385-432: Consider adding error handling and logging.The new
get_file_list_by_idsfunction performs several operations, such as querying files by IDs, filtering files based on index files and partition keys, and sorting and deduplicating the file list. Consider adding error handling and logging statements to help diagnose issues and monitor the function's performance.Here's an example of how you can add error handling and logging:
+use tracing::info; #[tracing::instrument(skip_all, fields(org_id = sql.org_id, stream_name = sql.stream_name))] pub(crate) async fn get_file_list_by_ids( _trace_id: &str, ids: &[i64], sql: &super::sql::Sql, stream_type: StreamType, partition_keys: &[StreamPartition], idx_file_list: &[IdxFileName], ) -> (Vec<FileKey>, usize) { let start = std::time::Instant::now(); - let file_list = query_by_ids(ids).await.unwrap_or_default(); + let file_list = match query_by_ids(ids).await { + Ok(files) => files, + Err(e) => { + tracing::error!("Failed to query files by IDs: {}", e); + return (Vec::new(), 0); + } + }; + + info!("Retrieved {} files by IDs", file_list.len()); // ... rest of the function ... + let elapsed = start.elapsed().as_millis() as usize; + info!("get_file_list_by_ids completed in {} ms", elapsed); + (files, elapsed) }
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (1)
- src/service/search/grpc/mod.rs (4 hunks)
Additional context used
Path-based instructions (1)
src/service/search/grpc/mod.rs (1)
Pattern
**/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestionsMake sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.
Additional comments not posted (5)
src/service/search/grpc/mod.rs (5)
25-25: LGTM!The import of
FileKey,StreamPartition, andStreamTypefromconfig::meta::streamis necessary for the new functionality.
34-34: LGTM!The import of
cluster_rpcandIdxFileNamefromprotois necessary for the new functionality.
38-38: LGTM!The import of
query_by_idsfromfile_listis necessary for the new functionality.
433-433: LGTM!The
ReleaseFileGuardstruct and itsDropimplementation ensure that the session data is cleared and WAL lock files are released when the guard goes out of scope. This is a good practice for resource management.
145-159: Verify the usage ofget_file_list_by_idsin the codebase.The new
get_file_list_by_idsfunction is called to retrieve the file list based on the provided IDs, which is a significant change to the file retrieval logic. Ensure that this change is compatible with the rest of the codebase and does not introduce any unintended side effects.Run the following script to verify the usage of
get_file_list_by_ids:Verification successful
To gather more information about the
get_file_list_by_idsfunction and its potential impact on the codebase, I'll run additional scripts to examine its implementation and search for related functions and tests.
Based on the analysis of the
get_file_list_by_idsfunction and its usage in the codebase, I can provide the final response:The
get_file_list_by_idsfunction appears to be compatible with the codebase, but lacks test coverage.The function is only used once in the file where it's defined (src/service/search/grpc/mod.rs), which suggests it's unlikely to introduce unintended side effects on existing functionality. However, to ensure its correctness and maintainability, consider the following recommendations:
- Add unit tests specifically for the
get_file_list_by_idsfunction to validate its behavior and edge cases.- Document the function's purpose, parameters, and return value to improve code clarity for other developers.
- If this function is intended to be used more widely in the future, consider moving it to a more central location in the codebase.
Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `get_file_list_by_ids` in the codebase. # Test: Search for the function usage. Expect: Only valid occurrences. rg --type rust -A 5 $'get_file_list_by_ids'Length of output: 785
Script:
#!/bin/bash # Description: Examine the implementation of get_file_list_by_ids and search for related functions and tests # Get the full implementation of get_file_list_by_ids echo "Implementation of get_file_list_by_ids:" ast-grep --lang rust --pattern $'pub(crate) async fn get_file_list_by_ids($_) { $$$ }' # Search for other file retrieval or processing functions echo -e "\nOther file retrieval or processing functions:" rg --type rust -g '!*test*' 'fn.*file.*' -A 2 # Search for relevant test files echo -e "\nRelevant test files:" fd -e rs test | xargs rg 'file_list|get_file'Length of output: 49651
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Outside diff range and nitpick comments (1)
src/service/search/mod.rs (1)
269-275: Improve logging by using structured logging.Instead of manually constructing the log message, consider using structured logging to make it easier to parse and analyze the logs later. For example:
log::info!( "search_partition: get file_list completed"; "trace_id" => trace_id, "time_range" => ?meta.meta.time_range, "num_files" => files.len(), "took_ms" => file_list_took, );This allows logging backends and tools to more easily extract and query the individual fields.
Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files selected for processing (2)
- src/service/search/grpc/mod.rs (4 hunks)
- src/service/search/mod.rs (2 hunks)
Files skipped from review as they are similar to previous changes (1)
- src/service/search/grpc/mod.rs
Additional context used
Path-based instructions (1)
src/service/search/mod.rs (1)
Pattern
**/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestionsMake sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.
Additional comments not posted (1)
src/service/search/mod.rs (1)
240-240: LGTM!Capturing the start time using
std::time::Instant::now()is a good way to measure the elapsed time for performance monitoring.
This basically keeps the external interface for search the same, but splits the internal interface into two parts - fetch ids and fetch full file data by the ids. Reasoning behind this is that -
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Documentation