feat: optimise file list querying for better search perf #4518

YashodhanJoshi1 · 2024-09-11T05:41:02Z

This basically keeps the external interface for search the same, but splits the internal interface into two parts - fetch ids and fetch full file data by the ids. Reasoning behind this is that -

Currently we fetch full file data while processing the search query on the leader
However, when file_list table is large, this takes a lot of time. Furthermore, sending this full data to individual workers also has network and ser/de costs that increases with number of items
Thus, this aims to optimize that by the leader only fetching the ids (thus making the initial sql query faster) and sending only those (thus reducing the network cost) to workers. Then workers fetching full file data only for the ids allocated to them.
The expectation is that because we only send ids, the network cost should get reduced, and as the workers query their file data in parallel, the divided time for this should be less than the single big query done in leader. Hence the net time for search query should get reduced.
With testing it is observed that the db fetch time is reduced considerably, for example a 35k records fetch from 450k records is reduced from 102ms to 32 ms. However, this is just the db fetch time, the end-to-end search time still still similar, as the data searching part is the same, and at least in the test env, no significant speed advantages are observed with offloading the data fetch to workers.

Summary by CodeRabbit

Release Notes

New Features
- Introduced a new asynchronous method for retrieving file lists by IDs, enhancing file management capabilities with improved filtering based on specific criteria.
- Enhanced search functionality to allow more granular control over file retrieval using index files.
- Added performance monitoring for the search operation, logging the time taken to retrieve file lists.
Bug Fixes
- Improved error handling and logic for processing work groups, reducing potential runtime issues.
Documentation
- Updated message types and service definitions to reflect new querying capabilities and structural changes in the search request.

Signed-off-by: Yashodhan Joshi <[email protected]>

coderabbitai · 2024-09-11T05:41:11Z

Walkthrough

The changes enhance the search functionality by introducing a new asynchronous method, get_file_list_by_ids, for retrieving FileKey objects based on specified IDs and filtering them according to the match_source criteria. The main search function's logic is revised to utilize a file_list_map, improving the handling of index files and file metadata. Additionally, a timing mechanism is implemented in the search_partition function to log the duration of file list retrieval.

Changes

File Path	Change Summary
`src/service/search/grpc/mod.rs`	Added `get_file_list_by_ids` function for retrieving `FileKey` objects by IDs, with enhanced logic to construct a `file_list_map` for improved filtering based on index files.
`src/service/search/mod.rs`	Introduced a timing mechanism in the `search_partition` function to log the duration of file list retrieval and modified the `step` variable calculation to ensure compliance with the minimum step requirement.

Possibly related PRs

fix: add more tracing for search #4534: Enhancements to the tracing and observability of the search function, which are directly related to the modifications in the main PR.
fix: optimise file list fetch from db #4573: Optimizations to the file list fetching logic, relevant to the changes in retrieving and filtering file lists based on IDs and criteria.

Suggested reviewers

hengfeiyang
oasisk

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
-- I pushed a fix in commit <commit_id>, please review it.
-- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
-- @coderabbitai generate unit testing code for this file.
-- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
-- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
-- @coderabbitai read src/utils.ts and generate unit testing code.
-- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
-- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

Signed-off-by: Yashodhan Joshi <[email protected]>

coderabbitai

Actionable comments posted: 0

Outside diff range and nitpick comments (1)

src/infra/src/file_list/sqlite.rs (1)

358-393: LGTM!

The query_by_ids method implementation looks good. It efficiently fetches file records by their IDs in chunks to handle SQLite's limitations.

Consider simplifying the code by using flat_map instead of iter and map:

-        Ok(ret
-            .iter()
-            .map(|r| {
-                (
-                    format!("files/{}/{}/{}", r.stream, r.date, r.file),
-                    r.into(),
-                )
-            })
-            .collect())
+        Ok(ret.into_iter().flat_map(|r| {
+            Some((format!("files/{}/{}/{}", r.stream, r.date, r.file), r.into()))
+        }).collect())

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 4c5642e and da1c33c.

Files ignored due to path filters (1)

Cargo.lock is excluded by !**/*.lock

Files selected for processing (7)

Cargo.toml (1 hunks)
src/infra/src/file_list/mod.rs (4 hunks)
src/infra/src/file_list/mysql.rs (1 hunks)
src/infra/src/file_list/postgres.rs (5 hunks)
src/infra/src/file_list/sqlite.rs (1 hunks)
src/service/search/cluster/mod.rs (16 hunks)
src/service/search/grpc/mod.rs (4 hunks)

Files skipped from review as they are similar to previous changes (3)

src/infra/src/file_list/mod.rs
src/infra/src/file_list/mysql.rs
src/infra/src/file_list/postgres.rs

Additional context used

Path-based instructions (3)

src/service/search/grpc/mod.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/file_list/sqlite.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/service/search/cluster/mod.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

Additional comments not posted (10)

Cargo.toml (1)

20-20: LGTM!

The version update from "0.11.0" to "0.12.0" aligns with the semantic versioning convention for a minor release. This increment suggests the introduction of new features, enhancements, or bug fixes while maintaining backward compatibility.

src/service/search/grpc/mod.rs (2)

145-153: Refactoring to optimize search performance and flexibility looks good!

The changes to use file_ids instead of file_list and retrieve the file list separately using get_file_list_by_ids is a good refactoring to optimize the search process. It allows fetching only the required file data based on IDs, reducing unnecessary data transfer.

Additionally, filtering and merging the retrieved file list with segment IDs from idx_files provides flexibility to narrow down the search to specific file segments, enabling more granular control over the search scope.

Overall, these optimizations should improve search performance and flexibility.

379-425: The new get_file_list_by_ids function is a well-designed and efficient implementation for retrieving the required file list.

The function encapsulates the logic for file list retrieval and filtering based on provided IDs, index files, and additional criteria. This promotes code reusability and maintainability.

Some key benefits of this implementation:

Querying the file list by IDs using query_by_ids ensures that only the required files are retrieved, optimizing the data retrieval process.

Filtering and merging segment IDs from idx_file_list allows for fine-grained control over which file segments are included in the search.

Further filtering the file list based on the match_source criteria ensures that only relevant files are returned based on the specified stream type and partition keys.

Sorting the file list by min_ts and deduplicating ensures a consistent and unique set of files.

Overall, this function provides an optimized and flexible way to retrieve the required file list, enhancing the search functionality.

src/infra/src/file_list/sqlite.rs (1)

395-438: LGTM!

The query_ids method implementation looks good. It provides a flexible way to retrieve file IDs based on various filtering conditions.

The use of the time_range parameter and the configuration-based upper bound calculation for max_ts allows for fine-grained control over the file IDs being retrieved.

The function returns FileId objects, which encapsulate the necessary information about the files, making it convenient for further processing.

src/service/search/cluster/mod.rs (6)

148-161: LGTM!

The change from retrieving full file objects to only file IDs is a good optimization. The variable renaming and log message update are consistent with this change.

169-187: Looks good!

The variable renaming and log message update are consistent with the fact that the file list is now filtered by the inverted index. Storing the filtered file list in req.idx_files for later use is a sensible approach.

Line range hint 279-371: Excellent!

The updates to variable names, function parameters, and the req.file_ids field are all consistent with the switch from using full file objects to file IDs. This change should help reduce the amount of data transferred to worker nodes during the search process.

872-891: Nice work!

The new get_file_id_list function looks good. Retrieving only the file IDs is a smart optimization that can help improve search performance. The function handles errors appropriately and returns the file IDs wrapped in an Option for safety.

Line range hint 893-909: Looks good to me!

Updating the partition_file_by_bytes function to accept FileId objects instead of FileKey objects is a sensible change that keeps the function consistent with the rest of the codebase. The partitioning logic remains the same, so the function should still perform as expected.

Line range hint 913-944: Looks great!

Updating the partition_file_by_hash function to accept FileId objects instead of FileKey objects keeps the function consistent with the rest of the codebase. The partitioning logic is unchanged, so the function should still distribute the file IDs among the querier nodes as expected.

coderabbitai

Actionable comments posted: 0

Outside diff range and nitpick comments (1)

src/service/search/grpc/mod.rs (1)

385-432: Consider adding error handling and logging.

The new get_file_list_by_ids function performs several operations, such as querying files by IDs, filtering files based on index files and partition keys, and sorting and deduplicating the file list. Consider adding error handling and logging statements to help diagnose issues and monitor the function's performance.

Here's an example of how you can add error handling and logging:

+use tracing::info;

 #[tracing::instrument(skip_all, fields(org_id = sql.org_id, stream_name = sql.stream_name))]
 pub(crate) async fn get_file_list_by_ids(
     _trace_id: &str,
     ids: &[i64],
     sql: &super::sql::Sql,
     stream_type: StreamType,
     partition_keys: &[StreamPartition],
     idx_file_list: &[IdxFileName],
 ) -> (Vec<FileKey>, usize) {
     let start = std::time::Instant::now();
-    let file_list = query_by_ids(ids).await.unwrap_or_default();
+    let file_list = match query_by_ids(ids).await {
+        Ok(files) => files,
+        Err(e) => {
+            tracing::error!("Failed to query files by IDs: {}", e);
+            return (Vec::new(), 0);
+        }
+    };
+
+    info!("Retrieved {} files by IDs", file_list.len());
     // ... rest of the function ...
+    let elapsed = start.elapsed().as_millis() as usize;
+    info!("get_file_list_by_ids completed in {} ms", elapsed);
+    (files, elapsed)
 }

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between da1c33c and 934a728.

Files selected for processing (1)

src/service/search/grpc/mod.rs (4 hunks)

Additional context used

Path-based instructions (1)

src/service/search/grpc/mod.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

Additional comments not posted (5)

src/service/search/grpc/mod.rs (5)
25-25: LGTM!

The import of FileKey, StreamPartition, and StreamType from config::meta::stream is necessary for the new functionality.

34-34: LGTM!

The import of cluster_rpc and IdxFileName from proto is necessary for the new functionality.

38-38: LGTM!

The import of query_by_ids from file_list is necessary for the new functionality.

433-433: LGTM!

The ReleaseFileGuard struct and its Drop implementation ensure that the session data is cleared and WAL lock files are released when the guard goes out of scope. This is a good practice for resource management.

145-159: Verify the usage of get_file_list_by_ids in the codebase.

The new get_file_list_by_ids function is called to retrieve the file list based on the provided IDs, which is a significant change to the file retrieval logic. Ensure that this change is compatible with the rest of the codebase and does not introduce any unintended side effects.

Run the following script to verify the usage of get_file_list_by_ids:

Verification successful

To gather more information about the get_file_list_by_ids function and its potential impact on the codebase, I'll run additional scripts to examine its implementation and search for related functions and tests.

Based on the analysis of the get_file_list_by_ids function and its usage in the codebase, I can provide the final response:

The get_file_list_by_ids function appears to be compatible with the codebase, but lacks test coverage.

The function is only used once in the file where it's defined (src/service/search/grpc/mod.rs), which suggests it's unlikely to introduce unintended side effects on existing functionality. However, to ensure its correctness and maintainability, consider the following recommendations:

Add unit tests specifically for the get_file_list_by_ids function to validate its behavior and edge cases.

Document the function's purpose, parameters, and return value to improve code clarity for other developers.

If this function is intended to be used more widely in the future, consider moving it to a more central location in the codebase.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the usage of `get_file_list_by_ids` in the codebase.

# Test: Search for the function usage. Expect: Only valid occurrences.
rg --type rust -A 5 $'get_file_list_by_ids'
Length of output: 785

Script:
#!/bin/bash
# Description: Examine the implementation of get_file_list_by_ids and search for related functions and tests

# Get the full implementation of get_file_list_by_ids
echo "Implementation of get_file_list_by_ids:"
ast-grep --lang rust --pattern $'pub(crate) async fn get_file_list_by_ids($_) {
  $$$
}'

# Search for other file retrieval or processing functions
echo -e "\nOther file retrieval or processing functions:"
rg --type rust -g '!*test*' 'fn.*file.*' -A 2

# Search for relevant test files
echo -e "\nRelevant test files:"
fd -e rs test | xargs rg 'file_list|get_file'
Length of output: 49651

coderabbitai

Actionable comments posted: 0

Outside diff range and nitpick comments (1)

src/service/search/mod.rs (1)
269-275: Improve logging by using structured logging.

Instead of manually constructing the log message, consider using structured logging to make it easier to parse and analyze the logs later. For example:
log::info!(
    "search_partition: get file_list completed";
    "trace_id" => trace_id,
    "time_range" => ?meta.meta.time_range,
    "num_files" => files.len(),
    "took_ms" => file_list_took,
);
This allows logging backends and tools to more easily extract and query the individual fields.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 934a728 and f5f51d1.

Files selected for processing (2)

src/service/search/grpc/mod.rs (4 hunks)
src/service/search/mod.rs (2 hunks)

Files skipped from review as they are similar to previous changes (1)

src/service/search/grpc/mod.rs

Additional context used

Path-based instructions (1)

src/service/search/mod.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

Additional comments not posted (1)

src/service/search/mod.rs (1)

240-240: LGTM!

Capturing the start time using std::time::Instant::now() is a good way to measure the elapsed time for performance monitoring.

YJDoc2 added 5 commits September 9, 2024 14:10

wip: initial half-baked changes

76877e1

Signed-off-by: Yashodhan Joshi <[email protected]>

feat: wip v2, compiles, but probably does not work

8520f9b

Signed-off-by: Yashodhan Joshi <[email protected]>

feat: fix key prefix when fetching ids from db

cb3870e

Signed-off-by: Yashodhan Joshi <[email protected]>

Merge branch 'main' into yj/file_list

75f30f8

Signed-off-by: Yashodhan Joshi <[email protected]>

feat: handle sql fetch by ids in batches

af8012c

Signed-off-by: Yashodhan Joshi <[email protected]>

github-actions bot added the ✏️ Feature label Sep 11, 2024

fix: fix incorrect variable name

f99a706

Signed-off-by: Yashodhan Joshi <[email protected]>

hengfeiyang self-requested a review September 11, 2024 12:00

YJDoc2 and others added 3 commits September 11, 2024 18:00

Merge branch 'main' into yj/file_list

64a1bb6

fix: fix lint errors

7d24b6e

Signed-off-by: Yashodhan Joshi <[email protected]>

Merge branch 'main' into yj/file_list

51bd310

YashodhanJoshi1 force-pushed the yj/file_list branch 2 times, most recently from c8b767f to 007455e Compare September 11, 2024 15:01

debug: add debug logs for file id list

7254f2c

Signed-off-by: Yashodhan Joshi <[email protected]>

YashodhanJoshi1 force-pushed the yj/file_list branch from 007455e to 7254f2c Compare September 11, 2024 15:42

YJDoc2 added 14 commits September 12, 2024 11:52

fix: use any in postgres query

932fa19

Signed-off-by: Yashodhan Joshi <[email protected]>

Merge branch 'main' into yj/file_list

5546b74

feat: move the file filtering by idx to worker nodes

f38d8a5

Signed-off-by: Yashodhan Joshi <[email protected]>

feat: remove key and size from id fetch db query

56d3cec

Signed-off-by: Yashodhan Joshi <[email protected]>

Merge branch 'main' into yj/file_list

f8bfb42

Merge branch 'main' into yj/file_list

3da9187

debug: remove debug logs

e3b3885

Signed-off-by: Yashodhan Joshi <[email protected]>

fix: remove file_size based partitioning completely

310b86c

Signed-off-by: Yashodhan Joshi <[email protected]>

fix: remove id from query_by_ids which is not required

d2c7b70

Signed-off-by: Yashodhan Joshi <[email protected]>

fix: cleanup some code

e298d88

Signed-off-by: Yashodhan Joshi <[email protected]>

fix: minor cleanup in service/file_list

f238e90

Signed-off-by: Yashodhan Joshi <[email protected]>

fix: cleanup proto definitions

bcebb00

Signed-off-by: Yashodhan Joshi <[email protected]>

fix: minor tracing name cleanup

cafabc2

Signed-off-by: Yashodhan Joshi <[email protected]>

Merge branch 'main' into yj/file_list

ec7619a

hengfeiyang added 3 commits September 18, 2024 22:42

fix: search with inverted index

622a958

fix: improve default setting of batch size

da1c33c

Merge branch 'main' into yj/file_list

37d7b60

coderabbitai bot reviewed Sep 18, 2024

View reviewed changes

hengfeiyang approved these changes Sep 19, 2024

View reviewed changes

perf: add took time for get file_list by ids

934a728

coderabbitai bot reviewed Sep 19, 2024

View reviewed changes

fix: add some debug logs

f5f51d1

coderabbitai bot reviewed Sep 19, 2024

View reviewed changes

hengfeiyang added 3 commits September 19, 2024 15:50

fix: search partition issue

49467b7

Merge branch 'main' into yj/file_list

d47ee69

Merge branch 'main' into yj/file_list

9c1abbe

hengfeiyang merged commit 8d5ebbb into main Sep 19, 2024

hengfeiyang deleted the yj/file_list branch September 19, 2024 14:45

This was referenced Sep 20, 2024

fix: skip fetching file list for aggregate query #4602

Merged

fix: add file list id cache #4643

Merged

This was referenced Oct 1, 2024

perf: query file_list in parallels by day #4669

Merged

fix: search history should not contain the search related to usage org #4673

Merged

feat: add metrics for db realated requests #4676

Merged

fix:usage stats & partial results reporting #4679

Closed

coderabbitai bot mentioned this pull request Oct 9, 2024

fix: report appropriate stream type #4744

Merged

coderabbitai bot mentioned this pull request Oct 22, 2024

fix: can't query data in the case with multiple memtable buckets #4867

Merged

This was referenced Nov 1, 2024

feat: refactor merge file on compactor #4971

Merged

fix: panic when the inverted index file is empty #5020

Merged

This was referenced Nov 15, 2024

fix: enterprise build #5075

Merged

perf: optimizer generate access plan #5088

Merged

feat: add tantivy mmap search #5098

Merged

This was referenced Nov 26, 2024

fix: enterprise build #5153

Merged

feat: add custom tokenizer #5177

Merged

fix: search with empty index #5209

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: optimise file list querying for better search perf #4518

feat: optimise file list querying for better search perf #4518

Uh oh!

YashodhanJoshi1 commented Sep 11, 2024 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 11, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: optimise file list querying for better search perf #4518

feat: optimise file list querying for better search perf #4518

Uh oh!

Conversation

YashodhanJoshi1 commented Sep 11, 2024 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Sep 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

YashodhanJoshi1 commented Sep 11, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 11, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)