Skip to content

Conversation

@hengfeiyang
Copy link
Contributor

@hengfeiyang hengfeiyang commented Sep 30, 2024

Summary by CodeRabbit

Release Notes

  • New Features

    • Streamlined file ID retrieval process for enhanced efficiency in search functionality.
    • Improved logging for better traceability during inverted index searches.
    • Enhanced handling of time-based queries with new partitioning logic for file metadata retrieval.
    • Added support for multi-threaded processing of file list queries.
    • Upgraded dependencies to version 0.12.0 for improved performance and compatibility.
  • Bug Fixes

    • Enhanced error handling in file ID retrieval to ensure robustness.
  • Refactor

    • Restructured search functionality to optimize query processing and data retrieval.
    • Updated internal logic for handling time-based queries and filtering mechanisms.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 30, 2024

Walkthrough

The pull request implements significant changes across multiple files, focusing on updating dependencies and enhancing file querying and search functionalities. Key modifications include the refactoring of the query_ids function to accept an optional time range, the removal of the get_file_id_list and get_file_list functions in favor of direct calls to query_ids, and improvements in logging for better traceability. Additionally, various dependencies in Cargo.toml have been updated to newer versions, reflecting a comprehensive enhancement of the project's architecture.

Changes

File Path Change Summary
Cargo.toml Updated package version to 0.12.0, upgraded several dependencies including rayon, actix-web, datafusion, and arrow to newer versions, and added new dependencies like async-walkdir and dotenvy.
src/service/file_list.rs Updated query_ids function to accept an optional time range parameter, enhanced query_by_ids for local cache checks, and modified delete_parquet_file to conditionally delete files based on configuration.
src/service/search/cluster/flight.rs Removed get_file_id_list function, replaced with direct calls to query_ids. Updated get_inverted_index_file_lists for improved handling of inverted indexes and refined SQL query construction.
src/service/search/mod.rs Removed get_ts_col function from imports, replaced with a call to cache::cacher::get_ts_col. Removed get_file_list function, replaced with direct calls to query_ids. Updated search_partition_multi for better record aggregation.
src/service/search/super_cluster/follower.rs Replaced get_file_id_list function call with query_ids, explicitly passing partition_time_level. Enhanced logging in get_inverted_index_file_lists for better traceability regarding inverted index usage.
src/service/search/grpc/flight.rs Introduced parallel sorting for the files vector in get_file_list_by_ids using par_sort_unstable_by, improving performance during sorting operations.
src/config/src/config.rs Renamed meta_id_batch_size to file_list_id_batch_size, added file_list_multi_thread, and updated initialization and validation logic to reflect these changes.
src/infra/src/file_list/mysql.rs Refactored query_ids to support daily time partitioning for efficient querying, added new internal methods for adding files, and improved error handling in database operations.
src/infra/src/file_list/postgres.rs Enhanced query_ids for better time range handling with daily partitions, added utility functions for time manipulation, and updated table creation logic.
src/infra/src/file_list/sqlite.rs Updated query_ids to handle time partitions, refactored database interaction methods, and improved error handling in job processing.

Possibly related PRs

  • perf: use parquet file static cache #3604: The changes in Cargo.toml regarding dependency updates, particularly the async-nats version, are related to the main PR's focus on dependency updates in Cargo.toml.
  • feat: sync hotfix from v0.10.6-rc3 #3738: Updates to src/config/src/config.rs include changes to dependency management and configuration, which align with the main PR's updates to dependencies and configurations.
  • feat: add custom table provider #3846: The introduction of a new dependency on rayon for parallel processing in src/service/search/grpc/flight.rs relates to the main PR's updates to the rayon dependency in Cargo.toml.
  • feat: secondary index #3870: The changes in Cargo.toml for updating dependencies like datafusion, arrow, and parquet are directly related to the main PR's updates to these dependencies.
  • fix: caching to consider query fn when caching #4062: The modifications to caching mechanisms in src/service/search/cache/mod.rs and src/service/search/mod.rs may relate to the overall improvements in performance and efficiency that the main PR aims to achieve.
  • feat: optimize sql to plan execution #4100: The changes in Cargo.toml for version updates and the introduction of datafusion-proto are relevant to the main PR's focus on enhancing functionality through dependency updates.
  • feat: return multiple partition when the query have _timestamp #4152: The changes in src/service/search/mod.rs regarding the handling of partitions and query functions relate to the main PR's focus on improving search functionality.
  • feat: new search engine #4182: The introduction of a new search engine and modifications to the search function in src/service/search/grpc/mod.rs align with the main PR's objectives of enhancing search capabilities.
  • fix: target partition for noCache #4484: The changes in src/service/search/grpc/storage.rs regarding the handling of search results and the introduction of new return types are relevant to the main PR's focus on optimizing search performance.
  • feat: optimise file list querying for better search perf #4518: The enhancements to the search operation in src/service/search/grpc/mod.rs and the introduction of new functions for file retrieval are related to the main PR's objectives of improving search performance.
  • fix: search partition loop #4591: The performance logging and adjustments in the search_partition function in src/service/search/mod.rs are relevant to the main PR's focus on optimizing search functionality.
  • feat: check indices exists before running create index commands #4618: The changes in index management and the introduction of checks for existing indices before running create index commands are relevant to the main PR's focus on improving database operations.
  • fix: add file list id cache #4643: The implementation of a local SQLite cache for file IDs in src/config/src/config.rs and related files aligns with the main PR's objectives of enhancing data retrieval processes.

Suggested labels

🧹 Updates, ✏️ Feature

Suggested reviewers

  • oasisk
  • haohuaijin

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (6)
src/config/src/utils/time.rs (2)

218-226: LGTM: end_of_the_day function is well-implemented

The function correctly calculates the end of the day timestamp. Consider adding a brief comment explaining the logic for future maintainers.

 pub fn end_of_the_day(timestamp: i64) -> i64 {
+    // Calculate the start of the next day and subtract 1 microsecond
     let t = Utc.timestamp_nanos((timestamp + DAY_MICRO_SECS) * 1000);
     let t_next_day_zero = Utc
         .with_ymd_and_hms(t.year(), t.month(), t.day(), 0, 0, 0)
         .unwrap()
         .timestamp_micros();
     t_next_day_zero - 1
 }

391-398: LGTM: Test case for end_of_the_day is well-implemented

The test case covers multiple scenarios effectively. Consider adding a test case for a timestamp that's already at the end of a day to ensure edge case handling.

 #[test]
 fn test_end_of_the_day() {
-    let t = [1609459200000000, 1727740800000000];
-    let d = [1609545599999999, 1727827199999999];
+    let t = [1609459200000000, 1727740800000000, 1609545599999999];
+    let d = [1609545599999999, 1727827199999999, 1609545599999999];
     for i in 0..t.len() {
         assert_eq!(end_of_the_day(t[i]), d[i]);
     }
 }
src/infra/src/file_list/mod.rs (2)

Line range hint 449-461: LGTM. Consider updating documentation.

The changes to the FileRecord struct look good. The addition of the stream field and the #[sqlx(default)] attribute on flattened improve the struct's flexibility.

Consider updating the documentation for this struct to reflect these changes, especially explaining the purpose of the new stream field and the default behavior of flattened.


Line range hint 518-522: LGTM. Consider using a more specific type for offsets.

The MergeJobRecord struct is a good addition for managing merge jobs. It contains appropriate fields and implements sqlx::FromRow for database operations.

Consider using a more specific type for the offsets field, such as u64 or chrono::DateTime<Utc>, instead of i64. This would make the intent clearer and potentially prevent misuse.

src/infra/src/file_list/postgres.rs (2)

437-464: Consider limiting the number of concurrent tasks

When spawning tasks for each day partition (lines 437-464), a large time range may result in creating a high number of concurrent tasks, which could lead to resource exhaustion. Consider using a concurrency limiter, such as tokio::sync::Semaphore, to cap the number of tasks running simultaneously and prevent overwhelming the system.


440-440: Avoid unnecessary cloning of stream_key

In line 440, cloning stream_key inside the loop may not be necessary if stream_key is an Arc or if you can pass it by reference. This can help reduce overhead and improve performance.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 7825b8e and 3f97b2e.

📒 Files selected for processing (5)
  • src/config/src/utils/time.rs (4 hunks)
  • src/infra/src/file_list/mod.rs (1 hunks)
  • src/infra/src/file_list/mysql.rs (2 hunks)
  • src/infra/src/file_list/postgres.rs (2 hunks)
  • src/infra/src/file_list/sqlite.rs (2 hunks)
🧰 Additional context used
📓 Path-based instructions (5)
src/config/src/utils/time.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/file_list/mod.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/file_list/mysql.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/file_list/postgres.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/file_list/sqlite.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

🔇 Additional comments (5)
src/config/src/utils/time.rs (1)

26-26: LGTM: Constant DAY_MICRO_SECS is well-defined

The constant DAY_MICRO_SECS is correctly calculated and improves code readability.

src/infra/src/file_list/mod.rs (4)

Line range hint 486-494: LGTM. Well-structured addition.

The StatsRecord struct is a well-designed addition for capturing comprehensive stream statistics. It includes all necessary fields with appropriate types and implements sqlx::FromRow for database operations.


Line range hint 511-516: LGTM. Appropriate structure for deleted file records.

The FileDeletedRecord struct is a well-designed addition for tracking deleted files. It contains all necessary information and implements sqlx::FromRow for database operations.


Line range hint 539-542: LGTM. Concise and focused struct.

The FileId struct is a well-designed, focused addition. It contains essential information (id and original size) and implements appropriate traits (sqlx::FromRow and Default) for its intended use.


Line range hint 449-542: Overall, the changes look good and improve the file management system.

The additions and modifications to the structs in this file enhance the data model for file management. The new structs (StatsRecord, FileDeletedRecord, MergeJobRecord, and FileId) and the updates to FileRecord provide more comprehensive and flexible data structures for handling various aspects of file management.

A few minor suggestions were made:

  1. Update documentation for the FileRecord struct.
  2. Consider using a more specific type for the offsets field in MergeJobRecord.

These changes should improve the overall functionality and maintainability of the file list management system.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
src/handler/http/request/traces/mod.rs (1)

338-342: Use a constant for time unit conversion to improve readability.

Instead of hardcoding the value 1000, consider defining a constant for converting nanoseconds to microseconds. This enhances code readability and maintainability.

Apply this diff to define a constant and use it:

+const NANOS_PER_MICRO: i64 = 1_000;

            // trace time is in nanoseconds, need to compare to microseconds
-           if trace_start_time / 1000 < start_time {
-               start_time = trace_start_time / 1000;
+           if trace_start_time / NANOS_PER_MICRO < start_time {
+               start_time = trace_start_time / NANOS_PER_MICRO;
            }
-           if trace_end_time / 1000 > end_time {
-               end_time = trace_end_time / 1000;
+           if trace_end_time / NANOS_PER_MICRO > end_time {
+               end_time = trace_end_time / NANOS_PER_MICRO;
            }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 3f97b2e and 18e55bf.

📒 Files selected for processing (1)
  • src/handler/http/request/traces/mod.rs (1 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/handler/http/request/traces/mod.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between a22650d and 31d5282.

📒 Files selected for processing (9)
  • src/infra/src/db/sqlite.rs (1 hunks)
  • src/infra/src/file_list/mod.rs (3 hunks)
  • src/infra/src/file_list/mysql.rs (2 hunks)
  • src/infra/src/file_list/postgres.rs (2 hunks)
  • src/infra/src/file_list/sqlite.rs (2 hunks)
  • src/service/file_list.rs (1 hunks)
  • src/service/search/cache/cacher.rs (1 hunks)
  • src/service/search/cluster/flight.rs (2 hunks)
  • src/service/search/mod.rs (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • src/infra/src/file_list/mod.rs
  • src/infra/src/file_list/mysql.rs
  • src/infra/src/file_list/sqlite.rs
🧰 Additional context used
📓 Path-based instructions (6)
src/infra/src/db/sqlite.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/file_list/postgres.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/service/file_list.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/service/search/cache/cacher.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/service/search/cluster/flight.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/service/search/mod.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

🔇 Additional comments (4)
src/service/search/cluster/flight.rs (1)

26-26: Import of QueryPartitionStrategy is appropriate.

The addition of QueryPartitionStrategy to the imports ensures that the enum is available for partitioning strategies used later in the code.

src/service/search/mod.rs (2)

295-304: ⚠️ Potential issue

Missing the stream_type and partition_time_level parameters in query_ids function call.

The query_ids function requires stream_type and partition_time_level as parameters. Omitting them may lead to incorrect file retrieval or runtime errors.

Apply this diff to include the missing parameters:

 let stream_files = crate::service::file_list::query_ids(
     &sql.org_id,
+    stream_type,
     stream,
+    partition_time_level,
     time_min,
     time_max,
 )
 .await
 .unwrap_or_default();

Likely invalid or redundant comment.


284-284: Verify that all references to get_ts_col are updated to cache::cacher::get_ts_col.

Ensure that there are no remaining usages of get_ts_col without the cache::cacher:: prefix to prevent potential inconsistencies or build errors.

Run the following script to check for any outdated references:

✅ Verification successful

All references to get_ts_col are correctly updated with the cache::cacher:: prefix.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for any occurrences of `get_ts_col` not prefixed with `cache::cacher::`

# Expected: No occurrences returned
rg --type rust -P '(^|\s)get_ts_col\(' | grep -v 'cache::cacher::get_ts_col'

Length of output: 326


Script:

#!/bin/bash
# Description: Search for any occurrences of `get_ts_col` not prefixed with `cache::cacher::`, excluding internal references in cache/cacher.rs

# Expected: No occurrences returned
rg --type rust -P '(^|\s)get_ts_col\(' --glob '!src/service/search/cache/cacher.rs' | grep -v 'cache::cacher::get_ts_col'

Length of output: 121

src/infra/src/file_list/postgres.rs (1)

22-26: Use statements correctly import necessary utilities

The added use statements for hash::Sum64, parquet::parse_file_key_columns, and time::{end_of_the_day, DAY_MICRO_SECS} appropriately import the required utilities for the code's functionality.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
src/service/search/super_cluster/follower.rs (2)

119-127: LGTM: Improved file list retrieval with better error handling

The change to use crate::service::file_list::query_ids aligns with the PR's objective of optimizing file list queries. The explicit use of partition_time_level and the safe fallback with unwrap_or_default() are good practices.

Consider logging the error if query_ids fails, to aid in debugging:

let file_id_list = crate::service::file_list::query_ids(
    &req.org_id,
    req.stream_type,
    stream_name,
    partition_time_level,
    req.time_range,
)
.await
.unwrap_or_else(|e| {
    log::error!("[trace_id {trace_id}] Failed to query file IDs: {}", e);
    Vec::new()
});

Line range hint 252-257: LGTM: Enhanced logging for inverted index usage

The addition of logging statements improves traceability for inverted index usage decisions. The inclusion of trace_id in the log message is good for log correlation.

Consider using a structured logging approach for better parsing:

log::info!(
    "[trace_id {trace_id}] flight->follower_leader: inverted index usage",
    trace_id = trace_id,
    use_inverted_index = use_inverted_index,
    format = "parquet"
);

This format can make it easier to parse and analyze logs in production environments.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 31d5282 and 4dd66db.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (4)
  • src/service/file_list.rs (1 hunks)
  • src/service/search/cluster/flight.rs (2 hunks)
  • src/service/search/mod.rs (4 hunks)
  • src/service/search/super_cluster/follower.rs (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/service/file_list.rs
  • src/service/search/cluster/flight.rs
🧰 Additional context used
📓 Path-based instructions (2)
src/service/search/mod.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/service/search/super_cluster/follower.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

🔇 Additional comments (5)
src/service/search/super_cluster/follower.rs (2)

38-40: LGTM: Import addition for inverted index functionality

The addition of get_inverted_index_file_list to the imports is consistent with the changes in inverted index handling within the module.


Line range hint 144-151: LGTM: Updated destructuring for inverted index function return values

The addition of _use_inverted_index and _idx_took to the destructuring is consistent with changes in the get_inverted_index_file_lists function. The use of underscores as prefixes for unused variables is a good Rust practice.

src/service/search/mod.rs (3)

26-26: Addition of necessary imports

The inclusion of usage::{RequestStats, UsageType} is appropriate, as these types are utilized in the code for tracking request statistics and usage.


284-285: Correct usage of cache::cacher::get_ts_col

Replacing the previous get_ts_col function with cache::cacher::get_ts_col ensures that the timestamp column is retrieved through the cache module, which may improve performance due to caching mechanisms.


341-341: Setting compressed_size to zero is acceptable

Since there is no compressed size available in the file list, initializing compressed_size to 0 is appropriate. Ensure that any downstream logic accounts for this and handles a zero value correctly to prevent potential division by zero errors or misleading calculations.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 4dd66db and de5cd0f.

📒 Files selected for processing (3)
  • src/service/search/cluster/flight.rs (2 hunks)
  • src/service/search/mod.rs (4 hunks)
  • src/service/search/super_cluster/follower.rs (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/service/search/cluster/flight.rs
  • src/service/search/super_cluster/follower.rs
🧰 Additional context used
📓 Path-based instructions (1)
src/service/search/mod.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

🔇 Additional comments (5)
src/service/search/mod.rs (5)

26-26: Import addition looks good.

The addition of FileKey import is appropriate and likely necessary for the changes in the rest of the file.


284-284: Caching optimization for get_ts_col

The change to use cache::cacher::get_ts_col is a good optimization that likely improves performance by caching the timestamp column.


332-334: Efficient calculation of records and original_size

The use of a single fold operation to calculate both records and original_size is an improvement in code efficiency. It reduces the number of iterations over the files collection.

Note: A previous review comment suggested further optimization using iterator methods. Consider implementing that suggestion if it hasn't been addressed yet.


340-340: Explicit setting of compressed_size with clear explanation

Setting compressed_size to 0 with a comment explaining the reason improves code clarity and prevents potential misunderstandings or bugs related to this value.


Line range hint 1-1: Verify complete migration from get_file_list to query_ids

The removal of the get_file_list function is consistent with the earlier change to use query_ids. However, it's crucial to ensure that all previous uses of get_file_list throughout the codebase have been properly updated to use query_ids.

Run the following script to check for any remaining uses of get_file_list:

#!/bin/bash
# Description: Check for any remaining uses of get_file_list

# Test: Search for any remaining calls to get_file_list
rg --type rust 'get_file_list'

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
src/service/search/grpc/flight.rs (1)

325-326: LGTM: Parallel sorting for improved performance

The change to par_sort_unstable_by is a good optimization that should improve performance, especially for large file lists. The unstable sort is appropriate here as we're sorting by unique keys.

Consider combining the sort and dedup operations for a slight optimization:

files.par_sort_unstable_by(|a, b| a.key.cmp(&b.key));
files.dedup_by(|a, b| a.key == b.key);

This change maintains the current behavior while potentially reducing the number of comparisons.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between de5cd0f and 60c6361.

📒 Files selected for processing (3)
  • Cargo.toml (1 hunks)
  • src/service/file_list.rs (2 hunks)
  • src/service/search/grpc/flight.rs (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/service/file_list.rs
🧰 Additional context used
📓 Path-based instructions (1)
src/service/search/grpc/flight.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

🔇 Additional comments (8)
Cargo.toml (6)

Line range hint 16-16: LGTM: Package version update

The package version has been incremented to 0.12.0, which is in line with semantic versioning practices. This update likely reflects the new features or changes introduced in this PR.


Line range hint 1-279: Summary of Cargo.toml changes

The updates in this Cargo.toml file reflect a significant upgrade to the project's dependencies, including:

  1. Package version update to 0.12.0
  2. Updates to key libraries like rayon, actix-web, datafusion, and arrow ecosystems
  3. Refinement of feature specifications for some dependencies

These changes are likely to bring performance improvements, bug fixes, and new capabilities to the project. However, given the scope of the updates, particularly the major version bumps for datafusion and arrow, it's crucial to:

  1. Thoroughly test all affected components
  2. Review changelogs for all updated dependencies
  3. Address any deprecation warnings or breaking changes
  4. Update documentation to reflect any API changes

To ensure a smooth transition with these updates, please run a full test suite and perform a careful review of all affected code paths. Consider creating a checklist of all major changes and their potential impacts on the codebase.


Line range hint 243-249: Major updates to datafusion and arrow ecosystems

The datafusion and arrow ecosystems have received significant updates:

  • datafusion, datafusion-proto, and datafusion-functions-aggregate-common to version 42.0.0
  • arrow, arrow-flight, arrow-json, and arrow-schema to version 53.0.0

These major version updates likely bring substantial improvements in query processing and data handling capabilities. However, they may also introduce breaking changes.

It is crucial to ensure compatibility with these major updates. Please perform the following actions:

  1. Review the changelogs for both datafusion and arrow to identify any breaking changes.
  2. Run comprehensive tests focusing on query processing and data handling functionalities.
  3. Check for any deprecation warnings or compile-time errors by running:
cargo check --all-targets
  1. Verify that all custom implementations using these libraries are still compatible:
rg --type rust 'use (datafusion|arrow)' src/

279-279: LGTM: Tonic update with feature specifications

The tonic dependency has been updated to version 0.12.0, with the "prost" and "gzip" features explicitly specified. This update likely brings improvements to the gRPC framework implementation in the project.

To ensure that the feature changes don't impact existing functionality, please:

  1. Review the tonic changelog for any changes related to the "prost" and "gzip" features.
  2. Verify that all gRPC-related code still functions as expected by running relevant tests.
  3. Check for any usage of tonic features that might have changed:
rg --type rust 'use tonic::(.*)(prost|gzip)' src/

276-276: LGTM: Rayon dependency update

The rayon dependency has been updated to version 1.10. This update likely brings performance improvements and bug fixes for parallel processing. As this is a minor version bump, it should maintain backwards compatibility.

To ensure compatibility, please run the following command to check for any breaking changes or deprecations:


Line range hint 68-69: LGTM: Actix-web ecosystem updates

The actix-web dependency has been updated to version 4.8, and actix-web-prometheus has been updated to version 0.1 with the addition of the "process" feature. These updates likely bring performance improvements and new features to the web server and monitoring capabilities.

To ensure proper usage of the new "process" feature in actix-web-prometheus, please run the following command to check for any new APIs or changes:

src/service/search/grpc/flight.rs (2)

34-34: LGTM: Rayon import for parallel processing

The addition of the Rayon crate for parallel slice mutation is a good choice for improving performance in sorting operations.


Line range hint 1-328: Summary: Effective performance optimization

The changes in this file effectively address the PR's objective of improving performance in file list querying. The introduction of parallel sorting using Rayon is a well-chosen optimization that should significantly improve performance for large datasets. The changes are minimal and focused, which reduces the risk of introducing new issues while achieving the desired performance gains.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
src/service/search/mod.rs (1)

Line range hint 284-301: Good optimization, but consider error handling.

The replacement of get_file_list with a direct call to query_ids is a good optimization. Including sql.time_range in the query allows for more precise file filtering, which should improve performance.

However, consider adding explicit error handling for the query_ids call instead of using ? operator, which propagates the error. This would allow for more granular error management and potentially better error reporting.

Consider updating the error handling:

-            .await?;
+            .await
+            .map_err(|e| Error::ErrorCode(ErrorCodes::ServerInternalError(format!("Failed to query file IDs: {}", e))))?;
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 60c6361 and cd7620e.

📒 Files selected for processing (2)
  • src/service/search/grpc/flight.rs (2 hunks)
  • src/service/search/mod.rs (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/service/search/grpc/flight.rs
🧰 Additional context used
📓 Path-based instructions (1)
src/service/search/mod.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

🔇 Additional comments (4)
src/service/search/mod.rs (4)

26-26: Import addition looks good.

The addition of FileKey to the import list is appropriate, indicating its usage in this file.


332-334: Existing refactoring suggestion is still valid.

The current implementation can be simplified as per the previous review comment.


684-687: Effective performance optimization.

The addition of a fast path check in the match_file function is a good performance optimization. It quickly returns true for cases where detailed matching is unnecessary, potentially reducing computation time for many scenarios.


689-689: Helpful comment for code clarity.

The addition of the "slow path" comment effectively separates the fast and slow execution paths, improving code readability and helping developers understand the performance implications of different scenarios.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between cd7620e and 071f86a.

📒 Files selected for processing (4)
  • src/config/src/config.rs (2 hunks)
  • src/infra/src/file_list/mysql.rs (3 hunks)
  • src/infra/src/file_list/postgres.rs (3 hunks)
  • src/infra/src/file_list/sqlite.rs (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/infra/src/file_list/mysql.rs
  • src/infra/src/file_list/sqlite.rs
🧰 Additional context used
📓 Path-based instructions (2)
src/config/src/config.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/file_list/postgres.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

🔇 Additional comments (3)
src/infra/src/file_list/postgres.rs (1)

22-26: Imports are appropriate.

The added imports for end_of_the_day and DAY_MICRO_SECS are necessary for the new time manipulation functionality.

src/config/src/config.rs (2)

1345-1346: Ensure consistency of default values for 'file_list_id_batch_size'.

In the initialization logic, file_list_id_batch_size is set to 5000 if it is 0. Verify that this default value is consistent across the configuration and that it aligns with the intended behavior.

Run the following script to confirm the default value is consistently applied:

#!/bin/bash
# Description: Check that 'file_list_id_batch_size' defaults to 5000 where applicable.

rg 'file_list_id_batch_size.*= 5000'

950-960: Verify the update of configuration parameters.

The new configuration options file_list_id_batch_size and file_list_multi_thread have been introduced to replace meta_id_batch_size. Please ensure:

  • All occurrences of meta_id_batch_size have been updated to file_list_id_batch_size.
  • The new variables are correctly utilized throughout the codebase.

Run the following script to check for any remaining references to meta_id_batch_size:

Additionally, list all usages of the new variables to confirm correct implementation:

✅ Verification successful

Verification Successful: All meta_id_batch_size references have been updated and the new configuration parameters are correctly utilized throughout the codebase.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for any remaining instances of 'meta_id_batch_size'. Expect: No matches should be found.

rg 'meta_id_batch_size'

Length of output: 207


Script:

#!/bin/bash
# Description: List all occurrences of 'file_list_id_batch_size' and 'file_list_multi_thread'.

rg 'file_list_id_batch_size'

rg 'file_list_multi_thread'

Length of output: 946


Script:

#!/bin/bash
# Description: Search for any remaining instances of 'meta_id_batch_size'. Expect: No matches should be found.

rg 'meta_id_batch_size'

Length of output: 23

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (3)
src/infra/src/file_list/sqlite.rs (3)

Line range hint 1057-1084: Approve changes to get_pending_jobs_count with minor suggestion

The updates to get_pending_jobs_count improve efficiency by using a single query to fetch all statuses. The grouping by stream and status provides more detailed information.

Consider using entry API for both levels of the HashMap to simplify the code:

*job_status
    .entry(org)
    .or_default()
    .entry(stream_type)
    .or_default() += counts;

This eliminates the need for the separate and_modify and or_insert calls.


Line range hint 1170-1189: Approve table creation changes with suggestion for error handling

The updates to create_table function are well-structured, ensuring backward compatibility by adding new columns when necessary. The approach of using add_column function to handle potential existing columns is a good practice.

Consider improving error handling in the add_column function:

if let Err(e) = sqlx::query(&check_sql).execute(client).await {
    if !e.to_string().contains("duplicate column name") {
        log::warn!("Failed to add column {}: {}", column, e);
        // Optionally, you might want to return the error here
        // return Err(e.into());
    }
}

This will provide more detailed logging about failed column additions without breaking the function execution.


Line range hint 1191-1270: Approve index creation changes with suggestion for progress logging

The updates to create_table_index function are well-implemented. The addition of new indices will improve query performance, and the handling of potential duplicates when creating unique indices is a good approach to ensure data integrity.

Consider improving the progress logging for deleting duplicate records:

for (i, r) in ret.iter().enumerate() {
    // ... existing code ...
    if (i + 1) % 1000 == 0 || i + 1 == ret.len() {
        log::warn!("[SQLITE] delete duplicate records: {}/{}", i + 1, ret.len());
    }
}

This will ensure that the final progress log is always printed, even if the total number of records is not a multiple of 1000.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 071f86a and 68a9fd4.

📒 Files selected for processing (3)
  • src/infra/src/file_list/mysql.rs (5 hunks)
  • src/infra/src/file_list/postgres.rs (5 hunks)
  • src/infra/src/file_list/sqlite.rs (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/infra/src/file_list/mysql.rs
  • src/infra/src/file_list/postgres.rs
🧰 Additional context used
📓 Path-based instructions (1)
src/infra/src/file_list/sqlite.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

🔇 Additional comments (2)
src/infra/src/file_list/sqlite.rs (2)

22-25: LGTM: New imports for time operations

The addition of end_of_the_day and DAY_MICRO_SECS imports from the config::utils::time module is appropriate. These are likely used in the updated query_ids function to handle time-based partitioning.


423-437: Verify the day partitioning logic

The implementation of day partitioning is a good approach to improve query performance. However, there's a potential off-by-one error in the partitioning logic. The increment start = end_of_day + 1 might skip a microsecond between partitions.

Consider using start = end_of_day + 1_000_000 (one second) to ensure no data is missed between partitions.

#!/bin/bash
# Verify the day partitioning logic
ast-grep --lang rust --pattern 'start = end_of_day + 1'

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (2)
src/service/search/mod.rs (2)

363-366: Approved: Limiting maximum number of partitions

Adding a limit to the maximum number of partitions is a good practice to prevent potential performance issues from an excessive number of partitions.

Consider making this limit configurable, perhaps through a constant or configuration setting, to allow for easier adjustments in the future if needed.


688-691: Approved: Fast path optimization in match_file

The addition of a fast path for simple cases is an excellent optimization that can significantly improve performance.

Consider adding a brief comment explaining the fast path condition, such as:

// Fast path: If there are no partition keys or the source key doesn't contain '=',
// we can skip the more complex matching logic.

This will help future maintainers understand the purpose of this optimization.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 68a9fd4 and 1352ded.

📒 Files selected for processing (1)
  • src/service/search/mod.rs (6 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
src/service/search/mod.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

🔇 Additional comments (2)
src/service/search/mod.rs (2)

284-284: Improved timestamp column retrieval

The use of cache::cacher::get_ts_col for retrieving the timestamp column is a good optimization. It likely improves performance by caching the result.


Line range hint 693-707: Improved filter generation in match_file

The use of generate_filter_from_equal_items to consolidate filter generation logic is a good refactoring. It improves code organization and potentially reusability.

@hengfeiyang hengfeiyang merged commit 309cdda into main Oct 1, 2024
@hengfeiyang hengfeiyang deleted the fix/query-filelist branch October 1, 2024 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants