fix: add file list id cache #4643

hengfeiyang · 2024-09-25T14:15:36Z

We only retrieve the id from the file_list on the leader. Then, we use this id to get the file name from the same list. By adding a local SQLite cache for each node, we can significantly reduce calls to the remote database.

Implement cache for querier nodes
Develop a garbage collection strategy
Fix search with FST

Summary by CodeRabbit

Release Notes

New Features
- Increased default maximum file size for compacting from 256 MB to 512 MB.
- Added new asynchronous methods for local cache management and file list operations, including get_local_cache, get_min_pk_value, and clean_by_min_pk_value.
- Enhanced file querying with IDs included in results and added batch addition of files with IDs across various implementations.
Bug Fixes
- Improved control flow in various functions to handle cases where parameters may be None.
Documentation
- Updated logging for better traceability and clarity in cache and file operations.
Refactor
- Simplified return types in several functions for improved clarity and error management.

coderabbitai · 2024-09-25T14:15:43Z

Walkthrough

This pull request introduces various changes across multiple files, primarily focusing on configuration updates, database access improvements, and modifications to file list management. Key updates include adding a new field for meta_id_batch_size in the configuration, enhancing caching mechanisms, and introducing new asynchronous methods for database interactions. Additionally, several functions have been refactored to improve error handling and logging, particularly in relation to caching and file querying processes.

Changes

File Path	Change Summary
`src/config/src/config.rs`	Added `meta_id_batch_size` field to `Limit` struct; updated `max_file_size` default from 256 MB to 512 MB; modified `init` and `check_common_config` functions for additional logic and checks.
`src/infra/src/db/mod.rs`	Added asynchronous functions `get_local_cache` and `init_local_cache` for local cache management.
`src/infra/src/file_list/mod.rs`	Renamed `connect` to `connect_default`, added `connect_local_cache`, updated `FileList` trait with new methods for primary key management, and modified `query_by_ids` return type to include primary key.
`src/infra/src/file_list/mysql.rs`	Added `batch_add_with_id` method; updated `query_by_ids` return type to include ID; introduced `get_min_pk_value` and `clean_by_min_pk_value` methods.
`src/infra/src/file_list/postgres.rs`	Similar updates as in `mysql.rs`, adding methods for batch addition and primary key management.
`src/infra/src/file_list/sqlite.rs`	Similar updates as in `mysql.rs`, enhancing ID handling and batch addition.

Possibly related PRs

feat: optimize sql to plan execution #4100: Related changes in the src/config/src/config.rs file, focusing on configuration settings.
feat: new search engine #4182: Introduces changes to the src/service/search/datafusion/exec.rs file, potentially relating to configuration handling.
feat: results based on multiple caches #4251: Adds new configuration fields, aligning with the main PR's focus on configuration management.
fix: add max_lifetime for db connection pool #4551: Introduces a configuration parameter for managing database connection lifetimes, relevant to the main PR's optimizations.
fix: optimise file list fetch from db #4573: Adds fields in the Limit struct for transaction timestamps, relating to the main PR's configuration changes.
fix: skip fetching file list for aggregate query #4602: Changes to fetching file lists for aggregate queries, potentially connecting to performance optimizations in the main PR.
feat: check indices exists before running create index commands #4618: Focus on checking indices before running create index commands, aligning with the main PR's database interaction optimizations.
fix: cache issue for multiple group by clause #4620: Modifications to caching logic for multiple group by clauses, relating to the main PR's improvements in caching mechanisms.

Suggested labels

🧹 Updates

Suggested reviewers

haohuaijin
oasisk

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 11

🧹 Outside diff range and nitpick comments (16)

src/infra/src/lib.rs (1)
30-31: LGTM! Consider enhancing error handling.

The addition of local cache initialization and garbage collection is well-placed in the initialization sequence and aligns with the PR objectives.

Consider adding more specific error handling or logging for these new operations to aid in troubleshooting if initialization fails. For example:
file_list::LOCAL_CACHE.create_table().await.map_err(|e| {
    log::error!("Failed to create local cache table: {}", e);
    e
})?;

file_list::local_cache_gc().await.map_err(|e| {
    log::error!("Failed to initialize local cache garbage collection: {}", e);
    e
})?;
This would provide more context in case of failures during the initialization process.
src/infra/src/dist_lock.rs (1)
58-60: Approve the early return for None locker, with a minor suggestion.

The addition of an early return when locker is None is a good optimization. It prevents unnecessary processing and aligns well with Rust's idiomatic use of Option.

Consider adding a debug log before the early return to maintain observability:
if locker.is_none() {
    log::debug!("[trace_id {trace_id}] No locker to unlock");
    return Ok(());
}
This will help in tracking the flow of the function, especially in cases where None lockers are expected.
src/job/mod.rs (1)
182-185: LGTM with a minor suggestion for error handling.

The addition of infra_file_list::LOCAL_CACHE.create_table_index() aligns well with the PR objective of implementing a local SQLite cache. The placement and error handling are consistent with the existing code.

Consider using a more descriptive error message in the expect() call to differentiate it from the remote cache index creation:
infra_file_list::LOCAL_CACHE
    .create_table_index()
    .await
    .expect("local file list cache: create table index failed");
src/infra/src/db/mod.rs (2)
44-46: LGTM: Addition of get_local_cache function

The get_local_cache function is well-implemented and consistent with the existing code structure. It correctly uses get_or_init for lazy initialization of the local cache.

Consider adding a brief documentation comment explaining the purpose of this function and when it should be used, similar to other public functions in the module.

81-83: LGTM: Addition of init_local_cache function

The init_local_cache function is correctly implemented to create a default SQLite database instance for the local cache, aligning with the PR objectives.

Consider adding error handling to this function. If the SQLite database initialization fails, it might be helpful to log the error or return a Result type instead of panicking. This would allow for more graceful error handling in the get_local_cache function.

Example:
async fn init_local_cache() -> Result<Box<dyn Db>> {
    Box::<sqlite::SqliteDb>::default().map_err(|e| {
        log::error!("Failed to initialize local cache: {}", e);
        Error::DbError(DbError::InitializationError(e.to_string()))
    })
}
Then update get_local_cache to handle potential errors:
pub async fn get_local_cache() -> Result<&'static Box<dyn Db>> {
    LOCAL_CACHE.get_or_try_init(init_local_cache).await
}
src/service/db/file_list/remote.rs (1)
106-106: LGTM! Consider adding a log message for the local cache index creation.

The addition of infra_file_list::LOCAL_CACHE.create_table_index().await? is a good implementation of the local caching mechanism described in the PR objectives. It's correctly placed after the main index creation.

Consider adding a log message after this line to indicate the successful creation of the local cache index, similar to the log message for the main index. This would improve observability:
log::info!("Load file_list create local cache table index done");
src/infra/src/file_list/postgres.rs (3)

81-83: Implement the batch_add_with_id method.

The batch_add_with_id method is currently unimplemented. This could lead to runtime panics if called.

Would you like assistance in implementing this method? I can provide a basic implementation or open a GitHub issue to track this task.

510-512: Implement the clean_by_min_pk_value method.

The clean_by_min_pk_value method is currently unimplemented. This could lead to runtime panics if called.

Would you like assistance in implementing this method? I can provide a basic implementation or open a GitHub issue to track this task.

Line range hint 1-1293: Overall review: Good progress, but some methods need implementation.

The changes made to this file, particularly the query_by_ids and get_min_pk_value methods, are well-implemented and consistent with the codebase. However, the batch_add_with_id and clean_by_min_pk_value methods are still unimplemented, which could lead to runtime issues if called.

Consider adding appropriate error handling or temporary implementations for these methods to prevent potential runtime panics. Additionally, it might be helpful to add TODO comments explaining the intended functionality of these methods for future implementation.

src/config/src/config.rs (2)

1429-1430: Approved: Extended external meta store detection

The check_common_config function has been updated to include MySQL and PostgreSQL as potential external meta stores. This change aligns with the PR objectives of enhancing data retrieval processes.

Consider adding error handling for cases where the necessary connection details for MySQL or PostgreSQL are not provided, similar to the existing checks for these databases later in the function.

Line range hint 1-1730: Consider improving code organization and documentation

While the changes made are appropriate and the existing code is well-structured, consider the following suggestions to further improve the file:

Group related configuration structs and their implementation blocks together to improve readability.

Add more inline documentation for complex configuration options, explaining their purpose and potential impacts.

Consider breaking down large functions like init() into smaller, more focused functions to improve maintainability.

Add unit tests for new configuration options and edge cases introduced by these changes.

These suggestions are not directly related to the current changes but could improve the overall quality and maintainability of the configuration code.
src/infra/src/file_list/mod.rs (2)
429-430: Clarify comment regarding interval.tick().await

The call to interval.tick().await delays the first tick to after the interval duration, preventing immediate execution. Update the comment to reflect this behavior.

Apply this diff to correct the comment:
-        interval.tick().await; // the first tick is immediate
+        interval.tick().await; // delay the first tick to after the interval duration
449-461: Review the use of #[sqlx(default)] attributes

Ensure that the #[sqlx(default)] attributes on the id, deleted, and flattened fields are necessary and align with the database schema.
src/infra/src/file_list/sqlite.rs (1)
965-966: Refactor to reduce code duplication

Lines 965-966 show that inner_add now calls inner_add_with_id with None as the ID:
self.inner_add_with_id(table, None, file, meta).await
While this is a valid approach, consider whether both methods are necessary or if inner_add_with_id can be used exclusively, simplifying the codebase.
src/infra/src/file_list/mysql.rs (2)

81-83: Unimplemented method batch_add_with_id

The method batch_add_with_id is currently unimplemented and contains a todo!() macro. If this method is required for the cache implementation or future functionality, please implement it or provide a timeline for its completion.

Would you like assistance in implementing this method or opening a GitHub issue to track its progress?

496-498: Unimplemented method clean_by_min_pk_value

The method clean_by_min_pk_value is currently unimplemented and contains a todo!() macro. Please implement this method to handle cache cleanup based on the minimum primary key value or provide an estimated time for its completion.

Would you like assistance in implementing this method or opening a GitHub issue to track this task?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 80908c5 and 861d4d1.

📒 Files selected for processing (14)

src/config/src/config.rs (2 hunks)
src/infra/src/db/mod.rs (2 hunks)
src/infra/src/dist_lock.rs (1 hunks)
src/infra/src/file_list/mod.rs (8 hunks)
src/infra/src/file_list/mysql.rs (11 hunks)
src/infra/src/file_list/postgres.rs (5 hunks)
src/infra/src/file_list/sqlite.rs (9 hunks)
src/infra/src/lib.rs (1 hunks)
src/job/mod.rs (1 hunks)
src/service/db/file_list/remote.rs (1 hunks)
src/service/file_list.rs (8 hunks)
src/service/search/cluster/flight.rs (3 hunks)
src/service/search/grpc/flight.rs (5 hunks)
src/service/search/grpc/storage.rs (1 hunks)

🧰 Additional context used

📓 Path-based instructions (14)

src/config/src/config.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/db/mod.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/dist_lock.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/file_list/mod.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/file_list/mysql.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/file_list/postgres.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/file_list/sqlite.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/lib.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/job/mod.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/service/db/file_list/remote.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/service/file_list.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/service/search/cluster/flight.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/service/search/grpc/flight.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/service/search/grpc/storage.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

🔇 Additional comments not posted (27)

src/infra/src/db/mod.rs (2)

36-36: LGTM: Addition of LOCAL_CACHE static variable

The addition of the LOCAL_CACHE static variable using OnceCell is consistent with the existing code structure and aligns with the PR objective of implementing a local cache. This approach ensures thread-safe lazy initialization.

Line range hint 1-359: Overall assessment: Changes look good with minor suggestions

The implemented changes for adding a local SQLite cache are well-structured and align with the PR objectives. The code is consistent with the existing patterns in the file. Consider implementing the suggested improvements for documentation and error handling to further enhance the robustness of the implementation.
src/service/search/grpc/storage.rs (1)
555-559: Improved logging for bitmap results

The change replaces logging the entire bitmap with logging its length, which is a good improvement for large bitmaps. This change reduces log verbosity and potential performance impact for large result sets.

However, consider the following suggestions:

Add a debug-level log that includes the full bitmap for troubleshooting purposes.

Consider using a structured logging approach for better parsing and analysis of logs.

Here's an example of how you could implement these suggestions:
 log::info!(
-    "[trace_id {}] search->storage: Final bitmap for fts_terms {:?} and index_terms: {:?} length {}",
-    query.trace_id,
-    *full_text_terms,
-    index_terms,
-    res.len(),
+    "Final bitmap results for search terms";
+    "trace_id" => query.trace_id,
+    "fts_terms" => ?*full_text_terms,
+    "index_terms" => ?index_terms,
+    "bitmap_length" => res.len()
 );
+log::debug!(
+    "Full bitmap result";
+    "trace_id" => query.trace_id,
+    "bitmap" => ?res
+);
This change maintains the performance improvement while providing more detailed logging when needed.
src/infra/src/file_list/postgres.rs (2)

Line range hint 362-394: LGTM: query_by_ids method update looks good.

The changes to the query_by_ids method, including the updated return type and the inclusion of the id in the query and result mapping, are well-implemented and consistent.

501-508: LGTM: New get_min_pk_value method is well-implemented.

The get_min_pk_value method is correctly implemented to retrieve the minimum primary key value from the file_list table. This could be useful for various operations requiring knowledge of the oldest record.

src/config/src/config.rs (1)

991-992: Approved: Increased maximum file size for compaction

The default value for max_file_size has been increased from 256 MB to 512 MB. This change aligns with the PR objectives of enhancing efficiency in file retrieval processes. However, be aware that this may impact memory usage and processing time for larger files during compaction.

src/service/file_list.rs (2)

68-128: New caching logic in query_by_ids enhances performance

The updated query_by_ids function introduces local caching, effectively reducing remote database calls. The implementation correctly handles cached and uncached IDs and updates the local cache as needed.

228-236: Improved error handling in delete_parquet_file_s3 function

Mapping errors to Error::Message provides clearer error reporting, aligning with the overall error handling strategy.

src/service/search/grpc/flight.rs (4)

Line range hint 155-165: Correctly updated function call with error handling

The get_file_list_by_ids function is now called with the additional trace_id parameter, and the returned Result is appropriately handled using .await?.

Line range hint 273-282: Improved function signature for better traceability and error handling

The get_file_list_by_ids function now includes trace_id in its parameters, enhancing logging and traceability. The return type has been updated to Result<(Vec<FileKey>, usize), Error>, allowing for proper error propagation.

284-284: Proper error handling in query_by_ids call

The call to query_by_ids now includes trace_id and correctly handles errors using the ? operator, ensuring that any errors are propagated upwards.

326-326: Returning Result with accurate timing information

The function now returns a Result containing the files and the elapsed time in milliseconds, ensuring accurate performance metrics.

src/infra/src/file_list/mod.rs (6)

34-34: Initialization of LOCAL_CACHE is appropriate

The use of Lazy to initialize LOCAL_CACHE ensures thread-safe lazy initialization.

46-48: New function connect_local_cache added correctly

The function connect_local_cache correctly returns a default SqliteFileList, facilitating local caching.

58-58: New method batch_add_with_id added to FileList trait

The method batch_add_with_id is correctly defined in the FileList trait.

105-106: New methods get_min_pk_value and clean_by_min_pk_value added

The methods are correctly added to the FileList trait for cache garbage collection purposes.

306-308: Function get_min_pk_value correctly implemented

The function correctly delegates to CLIENT.get_min_pk_value().await.

421-445: Implementation of local_cache_gc is appropriate

The background task for garbage collection is correctly implemented using tokio::task::spawn and runs at the specified hourly interval.
src/service/search/cluster/flight.rs (3)
121-124: Verify the Correctness of use_fst_inverted_index Assignment

The variable use_fst_inverted_index is assigned from the get_inverted_index_file_lists function and then used to set req.use_inverted_index. Ensure that the logic determining use_fst_inverted_index is correct and aligns with the intended behavior, especially considering different configurations of inverted_index_type.

799-800: Ensure Early Return Handles All Cases Appropriately

The early return when !use_parquet_inverted_index is true results in:
return Ok((use_fst_inverted_index, vec![], 0, 0));
Verify that returning empty idx_file_list, and zero values for idx_scan_size and idx_took is appropriate in this context. Ensure that downstream code properly handles these values without causing unintended side effects.

822-827: Check Consistency of Return Values

At the end of the get_inverted_index_file_lists function, the return statement is:
Ok((
    use_fst_inverted_index,
    idx_file_list,
    idx_scan_size,
    idx_took,
))
Confirm that idx_file_list, idx_scan_size, and idx_took are correctly initialized in all execution paths, especially when use_parquet_inverted_index is false and the function returns early. This will prevent potential runtime errors due to uninitialized or default values.
src/infra/src/file_list/sqlite.rs (5)

Line range hint 366-396: Validate handling of empty result sets in query_by_ids

In the query_by_ids method, ensure that the function correctly handles cases where the provided ids do not exist in the database, returning an empty vector as expected.

You can verify this behavior by adding unit tests that pass non-existent IDs and checking that the method returns an empty result without errors.

505-511: Confirm correct default behavior in get_min_pk_value

The method get_min_pk_value returns the minimum primary key value or defaults to zero if none is found. Ensure that this default behavior aligns with the application's expectations when the table is empty.

Consider adding a unit test to validate that the method returns zero when the file_list table is empty.

513-521: Ensure proper error handling in clean_by_min_pk_value

In clean_by_min_pk_value, lines 513-521, ensure that the method handles scenarios where the deletion might fail due to database constraints or locks.

Add error handling to catch exceptions from the delete operation and log appropriate error messages. This will aid in troubleshooting issues in production environments.

968-974: Handle optional IDs appropriately in inner_add_with_id

In inner_add_with_id, ensure that when id is None, the database handles the auto-increment of the primary key as expected.

Review the database schema and constraints to confirm that inserting a NULL value for id results in the primary key being auto-generated.

72-78: Ensure consistency with existing methods in batch_add_with_id

The newly added batch_add_with_id method at lines 72-78 introduces functionality to batch add files with IDs. Ensure that this method aligns with the behavior of batch_add and properly handles cases where IDs might already exist in the database.

It's important to confirm that the method handles duplicate IDs gracefully. You can use the following script to check for any inconsistencies:

src/infra/src/file_list/mysql.rs (1)

297-299: Potential performance impact due to removal of FORCE INDEX

The removal of FORCE INDEX (file_list_stream_ts_idx) from multiple SQL queries may lead to suboptimal query performance if the database does not automatically choose the optimal index. It's important to ensure that the queries still utilize the appropriate indexes to maintain efficiency.

Please verify that the database query planner continues to use the desired indexes. If not, consider adding optimizer hints or analyzing the need for additional indexing.

Also applies to: 314-316, 328-330

src/job/mod.rs

src/service/file_list.rs

src/infra/src/file_list/mod.rs

src/infra/src/file_list/sqlite.rs

src/infra/src/file_list/mysql.rs

src/service/file_list.rs

YashodhanJoshi1

lgtm 👍

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

src/config/src/config.rs (2)
949-954: LGTM! Consider making meta_id_batch_size configurable.

The addition of meta_id_batch_size is a good improvement for batch processing of IDs. The default value of 5000 seems reasonable. However, to provide more flexibility, consider making this value configurable through an environment variable or configuration file.

You could modify the field definition to allow for configuration:
#[env_config(
    name = "ZO_META_ID_BATCH_SIZE",
    default = 5000,
    help = "Batch size for ID processing"
)]
pub meta_id_batch_size: usize,
1439-1440: LGTM! Consider using an enum for database types.

The addition of MySQL and PostgreSQL as potential external metadata stores is a good improvement, expanding the system's flexibility. This change correctly sets meta_store_external to true for these new database types.

To make the code more maintainable and less error-prone, consider using an enum for the database types. This could look like:
enum MetaStoreType {
    Sqlite,
    Etcd,
    MySQL,
    PostgreSQL,
}

impl MetaStoreType {
    fn is_external(&self) -> bool {
        matches!(self, MetaStoreType::MySQL | MetaStoreType::PostgreSQL)
    }
}

// Then you could use it like this:
let meta_store_type = MetaStoreType::from_str(&cfg.common.meta_store).unwrap();
cfg.common.meta_store_external = meta_store_type.is_external();
This approach would make it easier to add new database types in the future and reduce the risk of errors in string comparisons.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between b102fed and 2a3d67a.

📒 Files selected for processing (5)

src/config/src/config.rs (4 hunks)
src/infra/src/file_list/mod.rs (8 hunks)
src/infra/src/file_list/mysql.rs (11 hunks)
src/infra/src/file_list/postgres.rs (5 hunks)
src/infra/src/file_list/sqlite.rs (9 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

src/infra/src/file_list/mod.rs
src/infra/src/file_list/mysql.rs
src/infra/src/file_list/postgres.rs
src/infra/src/file_list/sqlite.rs

🧰 Additional context used

📓 Path-based instructions (1)

src/config/src/config.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

🔇 Additional comments (1)

src/config/src/config.rs (1)

1337-1340: LGTM! Good default value handling.

The addition of a check for meta_id_batch_size in the init function is a good practice. It ensures that the value is always set to a reasonable default (5000) if not specified or set to 0. This prevents potential issues that could arise from an invalid batch size.

fix: add file list id cache

5251888

Merge branch 'main' into fix/query-filelist

7f98236

github-actions bot added the ☢️ Bug Something isn't working label Sep 25, 2024

hengfeiyang requested a review from YashodhanJoshi1 September 25, 2024 14:15

github-actions bot added the ✏️ Feature label Sep 25, 2024

hengfeiyang requested review from YashodhanJoshi1 and oasisk and removed request for YashodhanJoshi1 September 25, 2024 14:15

hengfeiyang marked this pull request as draft September 25, 2024 14:20

hengfeiyang added 4 commits September 25, 2024 23:23

feat: add file list local cache gc

70a70d2

Merge branch 'main' into fix/query-filelist

3bbc2f3

fix: search with fst

253613f

fix: remove some big debug message

86737e7

hengfeiyang requested a review from haohuaijin September 25, 2024 15:55

hengfeiyang marked this pull request as ready for review September 26, 2024 00:21

Merge branch 'main' into fix/query-filelist

861d4d1

coderabbitai bot reviewed Sep 26, 2024

View reviewed changes

YashodhanJoshi1 reviewed Sep 26, 2024

View reviewed changes

src/infra/src/file_list/mysql.rs Show resolved Hide resolved

src/infra/src/file_list/mysql.rs Show resolved Hide resolved

src/service/file_list.rs Show resolved Hide resolved

src/service/file_list.rs Outdated Show resolved Hide resolved

hengfeiyang added 2 commits September 26, 2024 13:31

fix: optimize code

b102fed

perf: optimize query batch size

f881639

hengfeiyang requested a review from YashodhanJoshi1 September 26, 2024 10:04

YashodhanJoshi1 approved these changes Sep 26, 2024

View reviewed changes

fix: let the id batch size become configure

2a3d67a

coderabbitai bot reviewed Sep 26, 2024

View reviewed changes

YashodhanJoshi1 approved these changes Sep 26, 2024

View reviewed changes

Merge branch 'main' into fix/query-filelist

c20e656

hengfeiyang merged commit dca8a1c into main Sep 26, 2024

hengfeiyang deleted the fix/query-filelist branch September 26, 2024 12:40

coderabbitai bot mentioned this pull request Sep 30, 2024

perf: query file_list in parallels by day #4669

Merged

This was referenced Oct 1, 2024

feat: add metrics for db realated requests #4676

Merged

feat: short url service #4697

Merged

coderabbitai bot mentioned this pull request Oct 15, 2024

fix: increase file field size in file_list* tables #4789

Merged

This was referenced Oct 25, 2024

fix: bugs of v0.13.0-rc1 #4916

Merged

feat: add jobs for compact old data #4952

Merged

coderabbitai bot mentioned this pull request Nov 7, 2024

fix: panic when the inverted index file is empty #5020

Merged

coderabbitai bot mentioned this pull request Nov 20, 2024

feat: implement tantivy index cache #5109

Merged

fix: add file list id cache #4643

fix: add file list id cache #4643

Uh oh!

Conversation

hengfeiyang commented Sep 25, 2024 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Sep 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

YashodhanJoshi1 left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hengfeiyang commented Sep 25, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 25, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)