feat: check indices exists before running create index commands #4618

YashodhanJoshi1 · 2024-09-23T07:48:38Z

fixes : #4601

This runs a query to get and cache existing indices before running queries to create indices. This way we only run the create index... command if the index does not exist.

Note that as we create indices at the start/init of the application, this does not add the newly created indices in the cache, because once that index is create at init, there is not other case which will re-try creating the same index. The issue is only at the pod start, i.e. the first application init, so this only cache pre-existing indices , i.e. ones that exist before the pod start.

As this cache is in-memory, it will be per-pod / per-node, hence when starting on a clean DB, this can still cause lock-ups if all nodes try to create the indices in the blank table simultaneously. Have not handled that as index creation on blannk table is not that time consuming anyways, and to implement that we will need to use etcd or something similar to coordinate the cache.

PS: Maybe we can also get the solution if we add IF NOT EXIST in all places, currently only some queries use this.

Summary by CodeRabbit

New Features
- Introduced a mechanism to check for existing database indices before creating or dropping them across MySQL, PostgreSQL, and SQLite implementations.
- Added new SQL index creation logic for various tables, ensuring that indices are only created if they do not already exist.
Bug Fixes
- Enhanced error handling during index creation and deletion to prevent redundant operations and improve reliability.
Documentation
- Updated logging statements for clarity regarding index creation and error contexts.

Signed-off-by: Yashodhan Joshi <[email protected]>

coderabbitai · 2024-09-23T07:48:48Z

Walkthrough

The changes introduce new functionality for managing database indices across multiple database implementations (MySQL, PostgreSQL, SQLite). This includes the addition of a DBIndex struct, caching mechanisms for indices, and logic to check for existing indices before creating or dropping them in various database operations. The modifications enhance the structure and maintainability of index management.

Changes

File(s)	Change Summary
`src/infra/src/db/mod.rs`	Added `DBIndex` struct, public static variable `INDICES`, and caching for database indices.
`src/infra/src/db/sqlite.rs`	Implemented `create_index`, `delete_index`, and `cache_indices` functions for index management.
`src/infra/src/db/mysql.rs`	Added `cache_indices`, `create_index`, and `delete_index` functions for index management.
`src/infra/src/db/postgres.rs`	Introduced `cache_indices`, `create_index`, and `delete_index` functions for index management.
`src/infra/src/file_list/mysql.rs`	Refactored `create_table_index` to use `create_index` for managing indices.
`src/infra/src/file_list/postgres.rs`	Updated `create_table_index` to utilize `create_index` for index creation.
`src/infra/src/file_list/sqlite.rs`	Refactored `create_table_index` to call `create_index` for creating indices.
`src/infra/src/scheduler/mysql.rs`	Modified `create_table_index` to call `create_index` for index creation.
`src/infra/src/scheduler/postgres.rs`	Updated `create_table_index` to utilize `create_index` for index management.
`src/infra/src/scheduler/sqlite.rs`	Refactored `create_table_index` to use `create_index` for creating indices.
`src/infra/src/schema/history/mysql.rs`	Replaced SQL statements in `create_table_index` with calls to `create_index`.
`src/infra/src/schema/history/postgres.rs`	Updated `create_table_index` to use `create_index` for index creation.
`src/infra/src/schema/history/sqlite.rs`	Refactored `create_table_index` to call `create_index` for managing indices.

Possibly related PRs

feat: upgrade to datafusion 41 #4180: The changes in this PR involve upgrading the datafusion library, which is relevant to the database indexing and query execution improvements in the main PR.
fix: use empty table for generate plan #4261: This PR focuses on generating partition keys, which aligns with the main PR's modifications related to database indices and their management.
fix: generate partition key with null if the field is not exists #4624: The changes in this PR involve generating partition keys with null values if fields do not exist, which is directly related to the database indexing functionality introduced in the main PR.

Suggested labels

☢️ Bug

Suggested reviewers

Subhra264

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
-- I pushed a fix in commit <commit_id>, please review it.
-- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
-- @coderabbitai generate unit testing code for this file.
-- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
-- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
-- @coderabbitai read src/utils.ts and generate unit testing code.
-- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
-- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 16

Outside diff range and nitpick comments (15)

src/infra/src/schema/history/sqlite.rs (1)
112-134: LGTM: Improved index naming and existence checking

The changes improve the code in two ways:

More descriptive index names enhance readability and maintainability.

The new logic for checking existing indices prevents redundant index creation, aligning with the PR objectives.

Minor suggestion: Consider extracting the index names into constants at the top of the file for easier maintenance and reuse.
const SCHEMA_HISTORY_ORG_IDX: &str = "schema_history_org_idx";
const SCHEMA_HISTORY_STREAM_IDX: &str = "schema_history_stream_idx";
const SCHEMA_HISTORY_STREAM_VERSION_IDX: &str = "schema_history_stream_version_idx";
src/infra/src/db/mod.rs (4)
245-257: LGTM: cache_indices_mysql function with suggestion.

The cache_indices_mysql function is well-implemented. It correctly queries the MySQL database for index information and handles potential errors.

Consider adding error logging when the query fails. This could help with debugging in production environments:
Err(e) => {
    log::error!("Failed to cache MySQL indices: {}", e);
    HashSet::new()
}
259-271: LGTM: cache_indices_pg function with suggestion.

The cache_indices_pg function is well-implemented. It correctly queries the PostgreSQL database for index information and handles potential errors.

Consider adding error logging when the query fails, similar to the suggestion for the MySQL function:
Err(e) => {
    log::error!("Failed to cache PostgreSQL indices: {}", e);
    HashSet::new()
}
273-285: LGTM: cache_indices_sqlite function with suggestion.

The cache_indices_sqlite function is well-implemented. It correctly queries the SQLite database for index information and handles potential errors.

Consider adding error logging when the query fails, similar to the suggestions for the MySQL and PostgreSQL functions:
Err(e) => {
    log::error!("Failed to cache SQLite indices: {}", e);
    HashSet::new()
}
239-285: Overall implementation aligns well with PR objectives.

The introduced changes effectively implement the caching mechanism for database indices across MySQL, PostgreSQL, and SQLite. This aligns well with the PR objective of checking for index existence before running create index commands.

Some points to consider:

The caching mechanism is implemented as described in the PR summary, operating during the initial startup of the application.

The implementation is consistent across all three database types, which is good for maintainability.

As mentioned in the PR summary, this approach may not capture indices created during initialization. Consider adding a comment in the code to highlight this limitation.

To address the potential issue of lock-ups in a multi-node scenario, consider implementing a distributed locking mechanism or exploring the use of IF NOT EXISTS in all relevant queries, as suggested in the PR summary.
src/infra/src/scheduler/postgres.rs (1)

85-95: Redundant index existence checks

The code checks for existing indices before creating them, and the SQL statements already include IF NOT EXISTS. This duplication might be unnecessary. Consider removing the code-level checks and relying solely on the SQL IF NOT EXISTS clause.

src/infra/src/scheduler/sqlite.rs (1)

95-95: Consider renaming variable idx to index_name for clarity

The variable idx represents an index name, not an index position. Renaming it to index_name or name will enhance readability and reduce potential confusion.
src/infra/src/db/postgres.rs (1)
Line range hint 614-629: Incorrect condition when creating index

In the add_start_dt_column function, the condition for creating the index meta_module_start_dt_idx is incorrect. Currently, it checks if the index exists and then proceeds to create it. The logic should be inverted to create the index only if it doesn't already exist.

Apply this diff to fix the condition:
-if indices.contains(&DBIndex {
+if !indices.contains(&DBIndex {
     name: "meta_module_start_dt_idx".into(),
     table: "meta".into(),
 }) {
     // Proceed to create the index
 }
This ensures that the index is only created if it does not already exist, aligning with the intended functionality.
src/infra/src/db/sqlite.rs (2)
Line range hint 743-751: Match specific error types instead of error message strings

In add_start_dt_column, the code checks for a duplicate column error by matching the error message string:
if !e.to_string().contains("duplicate column name")
Relying on error message strings is brittle and may fail if messages change or differ across database versions or locales. It's better to match on the specific error types or codes provided by sqlx or the database driver.

Apply this diff to improve error handling:
 if let Err(e) =
     sqlx::query(r#"ALTER TABLE meta ADD COLUMN start_dt INTEGER NOT NULL DEFAULT 0;"#)
         .execute(client)
         .await
 {
-    // Check if the error is about the duplicate column
-    if !e.to_string().contains("duplicate column name") {
-        // If the error is not about the duplicate column, return it
-        return Err(e.into());
+    if let sqlx::Error::Database(db_err) = &e {
+        if db_err.message().contains("duplicate column name") {
+            // Column already exists, ignore the error
+        } else {
+            // Return other database errors
+            return Err(e.into());
+        }
+    } else {
+        // Return non-database errors
+        return Err(e.into());
     }
 }
Line range hint 689-775: Ensure indices cache is updated after modifying indices

After creating or dropping indices in create_table and add_start_dt_column, the INDICES cache may become stale. To prevent inconsistencies, consider refreshing the INDICES cache after modifying indices.

Apply this diff to update the indices cache:
     }
 
+    // Refresh indices cache after modifying indices
+    *INDICES.get_mut().unwrap() = cache_indices_sqlite(client).await;
 
     Ok(())
 }
src/infra/src/db/mysql.rs (1)

593-593: Potential issue with indices caching mechanism.

Caching the indices at the beginning of the function may cause the INDICES cache to become outdated if indices are created or dropped afterward within the same runtime. Since the cache is in-memory and not refreshed, subsequent index operations may operate on stale data. Consider refreshing the INDICES cache after index creation or deletion to ensure it reflects the current state of the database.

Also applies to: 664-664
src/infra/src/file_list/postgres.rs (3)
Line range hint 1254-1277: Incorrect logging condition in duplicate record deletion loop.

The condition if i / 1000 == 0 will always be true when i is less than 1000, causing excessive logging.

Update the condition to correctly log progress every 1000 iterations.

Apply this diff to fix the logging condition:
- if i / 1000 == 0 {
+ if i % 1000 == 0 {
     log::warn!("[POSTGRES] delete duplicate records: {}/{}", i, ret.len());
}
Line range hint 1254-1286: Inefficient deletion of duplicate records.

Deleting duplicates one by one can be time-consuming for large datasets.

Consider performing the deletion in a single SQL query using DELETE ... WHERE id IN (...) or using a CTE (Common Table Expression) to improve performance.

Line range hint 1249-1286: Lack of transactional integrity during index recreation.

The process of deleting duplicates and recreating the index is not enclosed in a transaction, which may lead to data inconsistency if an error occurs midway.

Wrap the deletion and index recreation within a transaction to ensure atomicity.

Apply this diff to include transaction handling:
+ let mut tx = pool.begin().await?;
  // delete duplicate records
  log::warn!("[POSTGRES] starting delete duplicate records");
  let ret = sqlx::query(
      r#"SELECT stream, date, file, min(id) as id FROM file_list GROUP BY stream, date, file HAVING COUNT(*) > 1;"#
  )
  .fetch_all(&mut *tx).await?;
  // ... existing deletion logic ...
  // create index again
  sqlx::query(unique_index_sql).execute(&mut *tx).await?;
+ tx.commit().await?;
src/infra/src/file_list/mysql.rs (1)

1280-1283: Use specific error codes instead of parsing error messages

Relying on error message strings like "Duplicate key" is fragile and may not handle localization or changes in error messages. Consider checking for specific database error codes to handle duplicate key errors more robustly.

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between ba70302 and ae6a31e.

Files selected for processing (13)

src/infra/src/db/mod.rs (3 hunks)
src/infra/src/db/mysql.rs (5 hunks)
src/infra/src/db/postgres.rs (6 hunks)
src/infra/src/db/sqlite.rs (4 hunks)
src/infra/src/file_list/mysql.rs (3 hunks)
src/infra/src/file_list/postgres.rs (3 hunks)
src/infra/src/file_list/sqlite.rs (2 hunks)
src/infra/src/scheduler/mysql.rs (2 hunks)
src/infra/src/scheduler/postgres.rs (2 hunks)
src/infra/src/scheduler/sqlite.rs (2 hunks)
src/infra/src/schema/history/mysql.rs (2 hunks)
src/infra/src/schema/history/postgres.rs (2 hunks)
src/infra/src/schema/history/sqlite.rs (2 hunks)

Additional context used

Path-based instructions (13)

src/infra/src/db/mod.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/db/mysql.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/db/postgres.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/db/sqlite.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/file_list/mysql.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/file_list/postgres.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/file_list/sqlite.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/scheduler/mysql.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/scheduler/postgres.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/scheduler/sqlite.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/schema/history/mysql.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/schema/history/postgres.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/schema/history/sqlite.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

Additional comments not posted (32)

src/infra/src/schema/history/sqlite.rs (3)

21-21: LGTM: New import aligns with PR objectives

The added import brings in the necessary components for the index caching mechanism, which is consistent with the PR's goal of implementing index existence checks.

136-136: LGTM: Improved error logging

The error logging message has been updated to be more specific, mentioning the "schema_history" table. This change enhances error traceability and makes debugging easier.

Line range hint 1-141: Overall assessment: Changes align with PR objectives and improve code quality

The modifications in this file successfully implement the index existence checking mechanism as described in the PR objectives. The changes improve code readability, prevent redundant index creation, and enhance error logging. The only suggestion is a minor refactoring to extract index names as constants for easier maintenance.

src/infra/src/db/mod.rs (2)

16-16: LGTM: New imports and static variable.

The new imports and static variable are correctly added to support the new functionality for managing database indices.

Also applies to: 22-22, 39-39

239-243: LGTM: DBIndex struct definition.

The DBIndex struct is well-defined with appropriate fields and derives the necessary traits for its intended use in a HashSet.
src/infra/src/schema/history/postgres.rs (5)
21-21: Updated imports to include index caching functionality

The import statement now includes cache_indices_pg, DBIndex, and INDICES, which are required for caching indices and checking their existence before creation.

110-110: Cache existing indices before creating new ones

Initializing indices with INDICES.get_or_init(|| cache_indices_pg(&pool)).await; ensures that existing indices are cached and checked before attempting to create new ones. This enhances efficiency and prevents redundant operations.

113-121: Use specific index names with conditional creation

The SQL statements now define specific index names and use CREATE INDEX IF NOT EXISTS, which prevents errors from attempting to create indices that already exist.

125-131: Check for existing indices before creation

The loop now checks if each index exists in indices before attempting to create it:
if indices.contains(&DBIndex {
    name: idx.into(),
    table: "schema_history".into(),
}) {
    continue;
}
This prevents unnecessary creation attempts and potential errors.

133-133: Improve error logging specificity

The error message now includes context:
log::error!("[POSTGRES] create table schema_history index error: {}", e);
This makes it clearer where the error occurred, aiding in debugging.
src/infra/src/schema/history/mysql.rs (5)

21-21: Imported modules are appropriate

The added imports are necessary for the index caching functionality and are correctly specified.

109-109: Efficient initialization of the indices cache

Initializing indices using INDICES.get_or_init(|| cache_indices_mysql(&pool)).await ensures that the index cache is populated once and reused, which improves performance by avoiding redundant database queries.

112-120: Index creation statements are correctly defined

The index names and corresponding SQL statements for the schema_history table are accurately specified, aligning with the intended schema.

124-130: Index existence check logic is accurate

The check for existing indices using the indices.contains() method before creating new ones effectively prevents attempts to create duplicate indices, which could cause errors.

136-136: Consistent error logging

The error message provides clear context for any index creation errors related to the schema_history table, aiding in troubleshooting.

src/infra/src/scheduler/sqlite.rs (1)

80-80: Verify that get_or_init supports asynchronous initialization

The use of INDICES.get_or_init(|| cache_indices_sqlite(&*client)).await may not work as intended because get_or_init from OnceCell does not support asynchronous initializers out of the box. This could lead to race conditions or panics if accessed concurrently.

Please confirm that INDICES is an async-aware cell that properly supports asynchronous initialization. If not, consider using Lazy from once_cell::sync::Lazy with an async block or refactor the initialization logic to ensure thread safety and correct behavior.
src/infra/src/scheduler/mysql.rs (4)
23-23: Importing necessary database components

The addition of cache_indices_mysql, DBIndex, and INDICES to the import statements is appropriate for managing index caching.

81-81: Proper initialization of index cache

The code correctly initializes and caches the existing indices using INDICES.get_or_init(|| cache_indices_mysql(&pool)).await;, ensuring that index existence checks are efficient.

84-96: Index definitions are correctly specified

The index creation queries for the scheduled_jobs table are well-defined, which will help optimize query performance:

scheduled_jobs_key_idx on (module_key)

scheduled_jobs_org_key_idx on (org, module_key)

scheduled_jobs_org_module_key_idx on (org, module, module_key)

98-103: Efficient check for existing indices before creation

The logic to check if an index already exists before attempting creation is implemented correctly, preventing redundant operations:
if indices.contains(&DBIndex {
    name: idx.into(),
    table: "scheduled_jobs".into(),
}) {
    continue;
}
src/infra/src/db/postgres.rs (1)

29-29: Importing necessary modules

The import of cache_indices_pg, DBIndex, and INDICES is appropriate for the new index caching functionality.

src/infra/src/db/mysql.rs (4)

29-29: Import statements are appropriate.

The necessary modules cache_indices_mysql, DBIndex, and INDICES are correctly imported.

593-593: Indices initialization is correctly implemented.

The indices are properly cached and initialized asynchronously.

664-664: Indices initialization is correctly implemented.

The indices are properly cached and initialized asynchronously.

686-692: Consider refactoring to reduce code duplication.

As previously mentioned, the index checking and creation logic for meta_module_start_dt_idx is repetitive. Refactoring this into a reusable function can improve maintainability.

src/infra/src/file_list/sqlite.rs (2)

28-32: Import necessary database modules

The added imports correctly bring in the modules for caching and index management.

1223-1230: Efficiently skip existing indices

The logic to cache and check for existing indices before creation is efficient and prevents unnecessary operations.

src/infra/src/file_list/postgres.rs (5)

28-28: Approved import of INDICES for index caching.

The addition of INDICES to the imports ensures that the index cache is accessible in this module.

1172-1172: Asynchronous initialization of INDICES is appropriate.

Using get_or_init with an asynchronous initializer allows for lazy initialization of the index cache without blocking.

1175-1225: Efficient index creation by checking for existing indices before creation.

The addition of index names and tables to the sqls vector, along with the existence check against indices, prevents redundant index creation and improves performance.

1230-1236: Correctly skipping existing indices during index creation loop.

The loop properly checks if each index exists and continues to the next iteration if it does, avoiding unnecessary SQL execution.

Line range hint 1172-1286: Concurrency considerations with INDICES cache initialization.

Ensure that INDICES is thread-safe and suitable for asynchronous contexts to prevent race conditions during initialization.

Confirm that INDICES is using an appropriate async-aware synchronization primitive like tokio::sync::OnceCell.

src/infra/src/scheduler/postgres.rs

src/infra/src/scheduler/sqlite.rs

src/infra/src/db/postgres.rs

src/infra/src/file_list/sqlite.rs

src/infra/src/file_list/postgres.rs

src/infra/src/file_list/mysql.rs

Signed-off-by: Yashodhan Joshi <[email protected]>

coderabbitai

Actionable comments posted: 0

Outside diff range and nitpick comments (12)

src/infra/src/scheduler/mysql.rs (1)
85-109: LGTM: Improved index creation implementation

The changes to the create_table_index function look good. The use of the create_index function improves efficiency by handling index existence checks internally. The indices being created are consistent with the previous implementation, and the unique constraint is correctly applied.

Consider extracting the table name "scheduled_jobs" into a constant to improve readability and maintainability. This would make it easier to update the table name if needed in the future.
const SCHEDULED_JOBS_TABLE: &str = "scheduled_jobs";

// Then use it in the function calls:
create_index(
    &pool,
    "scheduled_jobs_key_idx",
    SCHEDULED_JOBS_TABLE,
    false,
    &["module_key"],
)
.await?;
// ... and so on for the other create_index calls
src/infra/src/db/postgres.rs (3)
606-617: Improved index management in add_start_dt_column function

The modifications to the add_start_dt_column function enhance consistency by using the new create_index and delete_index functions. This approach improves safety and maintainability.

Consider adding error handling for the create_index and delete_index calls to ensure the function fails gracefully if these operations encounter issues.
- create_index(
-     &pool,
-     "meta_module_start_dt_idx",
-     "meta",
-     true,
-     &["module", "key1", "key2", "start_dt"],
- )
- .await?;
- delete_index(&pool, "meta_module_key2_idx", "meta").await?;
+ if let Err(e) = create_index(
+     &pool,
+     "meta_module_start_dt_idx",
+     "meta",
+     true,
+     &["module", "key1", "key2", "start_dt"],
+ )
+ .await
+ {
+     log::error!("[POSTGRES] Error creating meta_module_start_dt_idx: {}", e);
+     return Err(e.into());
+ }
+ if let Err(e) = delete_index(&pool, "meta_module_key2_idx", "meta").await {
+     log::error!("[POSTGRES] Error deleting meta_module_key2_idx: {}", e);
+     return Err(e.into());
+ }
648-674: Well-implemented create_index function

The create_index function is efficiently implemented, using a cache to avoid unnecessary database queries and constructing a safe SQL query. The logging is appropriate and informative.

Consider adding more detailed error logging to provide context about which index creation failed:
- sqlx::query(&sql).execute(client).await?;
+ sqlx::query(&sql).execute(client).await.map_err(|e| {
+     log::error!("[POSTGRES] Failed to create index {}: {}", idx_name, e);
+     e
+ })?;
676-689: Well-implemented delete_index function

The delete_index function is efficiently implemented, using a cache to avoid unnecessary database queries and constructing a safe SQL query. The logging is appropriate and informative.

Similar to the create_index function, consider adding more detailed error logging:
- sqlx::query(&sql).execute(client).await?;
+ sqlx::query(&sql).execute(client).await.map_err(|e| {
+     log::error!("[POSTGRES] Failed to delete index {}: {}", idx_name, e);
+     e
+ })?;
src/infra/src/db/mysql.rs (2)
54-67: Consider using a more descriptive variable name for the SQL query.

The variable name var_name for the SQL query string is not very descriptive. Consider using a more meaningful name like index_query or fetch_indices_sql to improve code readability.
-    let var_name = r#"SELECT INDEX_NAME,TABLE_NAME FROM information_schema.statistics;"#;
-    let sql = var_name;
+    let fetch_indices_sql = r#"SELECT INDEX_NAME,TABLE_NAME FROM information_schema.statistics;"#;
+    let sql = fetch_indices_sql;
Line range hint 659-690: LGTM: Improved index management in add_start_dt_column.

The modifications to use the new create_index and delete_index functions are well-implemented. However, there's an outdated comment that needs to be addressed.

Please update or remove the following comment as it's no longer relevant to the current implementation:
-        // Check for the specific MySQL error code for duplicate column
+        // Check for existing column to avoid duplicate column error
src/infra/src/db/sqlite.rs (5)
724-741: LGTM: Improved index creation using new create_index function

The changes to the create_table function look good. Using the new create_index function for creating indices improves code consistency and maintainability. The indices created ("meta_module_idx", "meta_module_key1_idx", and "meta_module_start_dt_idx") seem appropriate for the table structure.

Consider extracting the index names and table name into constants at the top of the file for easier maintenance and to avoid magic strings.

760-768: LGTM: Improved index management in add_start_dt_column

The changes to the add_start_dt_column function look good. Using the new create_index and delete_index functions improves code consistency. Creating a new index with the start_dt column and deleting the old one is a good practice for maintaining optimal database performance.

Consider adding a comment explaining why the "meta_module_key2_idx" index is being deleted, as it might not be immediately obvious to future maintainers.

797-823: LGTM: Well-implemented create_index function

The create_index function is well-implemented. It efficiently uses the INDICES cache to avoid unnecessary index creation attempts, handles both unique and non-unique indices, and includes appropriate logging.

Consider adding error logging in case the index creation fails. This would help with debugging in production environments.
 sqlx::query(&sql).execute(client).await?;
 log::info!("[SQLITE] index {} created successfully", idx_name);
+log::error!("[SQLITE] failed to create index {}: {}", idx_name, e);
 Ok(())
825-838: LGTM: Well-implemented delete_index function

The delete_index function is well-implemented. It efficiently uses the INDICES cache to avoid unnecessary index deletion attempts and includes appropriate logging.

Consider adding error logging in case the index deletion fails. Also, there's a minor typo in the success log message (missing space after "index").
 let sql = format!("DROP INDEX IF EXISTS {};", idx_name,);
-sqlx::query(&sql).execute(client).await?;
-log::info!("[SQLITE] index {}deleted successfully", idx_name);
+match sqlx::query(&sql).execute(client).await {
+    Ok(_) => log::info!("[SQLITE] index {} deleted successfully", idx_name),
+    Err(e) => log::error!("[SQLITE] failed to delete index {}: {}", idx_name, e),
+}
 Ok(())
Line range hint 100-838: Overall: Good improvements to index management, with room for minor enhancements

The changes introduce well-implemented functionality for managing database indices, including caching existing indices for performance optimization. The code is generally well-structured and follows good practices.

Consider creating a separate module or struct for index management, which could encapsulate the INDICES cache and provide methods for creating, deleting, and checking indices. This would further improve code organization and make it easier to maintain and potentially extend index-related functionality in the future.

Example structure:
mod index_manager {
    use once_cell::sync::Lazy;
    use sqlx::{Pool, Sqlite};
    use std::collections::HashSet;

    static INDICES: Lazy<RwLock<HashSet<DBIndex>>> = Lazy::new(|| RwLock::new(HashSet::new()));

    pub struct IndexManager;

    impl IndexManager {
        pub async fn init(client: &Pool<Sqlite>) {
            let indices = cache_indices(client).await;
            *INDICES.write().await = indices;
        }

        pub async fn create_index(/* ... */) -> Result<()> {
            // Implementation
        }

        pub async fn delete_index(/* ... */) -> Result<()> {
            // Implementation
        }

        pub async fn index_exists(/* ... */) -> bool {
            // Implementation
        }

        // Other index-related methods...
    }

    async fn cache_indices(client: &Pool<Sqlite>) -> HashSet<DBIndex> {
        // Implementation
    }
}
This structure would allow for better encapsulation of index-related functionality and make the code more modular and easier to maintain.
src/infra/src/file_list/mysql.rs (1)
1247-1299: Special case handling for file_list table index

The implementation includes a special case for creating a unique index on the file_list table. It handles potential duplicate entries by deleting them before creating the index. This approach ensures data integrity but has some considerations:

The error message check on line 1257 might be too broad. Consider using a more specific error code check.

The duplicate record deletion process could potentially be optimized for large datasets.

Consider the following improvements:

Use a more specific error code check:
-if !e.to_string().contains("Duplicate entry") {
+if let Some(db_err) = e.as_database_error() {
+    if db_err.code() != Some("1062".into()) { // MySQL error code for duplicate entry
         return Err(e);
+    }
+} else {
+    return Err(e);
}
For large datasets, consider batching the delete operations:
let chunk_size = 1000;
for chunk in ret.chunks(chunk_size) {
    let mut query_builder: QueryBuilder<MySql> = QueryBuilder::new(
        "DELETE FROM file_list WHERE (stream, date, file, id) IN "
    );
    query_builder.push_tuples(chunk, |mut b, r| {
        b.push_bind(r.get::<String, _>("stream"))
         .push_bind(r.get::<String, _>("date"))
         .push_bind(r.get::<String, _>("file"))
         .push_bind(r.get::<i64, _>("id"));
    });
    query_builder.push(" AND id != VALUES(id)");
    query_builder.build().execute(&pool).await?;
    log::warn!("[MYSQL] Deleted {} duplicate records", chunk.len());
}

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 5f148a4 and 8ec0336.

Files selected for processing (13)

src/infra/src/db/mod.rs (3 hunks)
src/infra/src/db/mysql.rs (6 hunks)
src/infra/src/db/postgres.rs (6 hunks)
src/infra/src/db/sqlite.rs (6 hunks)
src/infra/src/file_list/mysql.rs (2 hunks)
src/infra/src/file_list/postgres.rs (4 hunks)
src/infra/src/file_list/sqlite.rs (4 hunks)
src/infra/src/scheduler/mysql.rs (2 hunks)
src/infra/src/scheduler/postgres.rs (2 hunks)
src/infra/src/scheduler/sqlite.rs (2 hunks)
src/infra/src/schema/history/mysql.rs (2 hunks)
src/infra/src/schema/history/postgres.rs (2 hunks)
src/infra/src/schema/history/sqlite.rs (2 hunks)

Files skipped from review as they are similar to previous changes (8)

src/infra/src/db/mod.rs
src/infra/src/file_list/postgres.rs
src/infra/src/file_list/sqlite.rs
src/infra/src/scheduler/postgres.rs
src/infra/src/scheduler/sqlite.rs
src/infra/src/schema/history/mysql.rs
src/infra/src/schema/history/postgres.rs
src/infra/src/schema/history/sqlite.rs

Additional context used

Path-based instructions (5)

src/infra/src/db/mysql.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/db/postgres.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/db/sqlite.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/file_list/mysql.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/scheduler/mysql.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

Additional comments not posted (10)

src/infra/src/scheduler/mysql.rs (1)

23-26: LGTM: Import statement updated correctly

The import statement has been appropriately updated to include the create_index function, which is consistent with its usage in the create_table_index function.

src/infra/src/db/postgres.rs (2)

54-66: New function cache_indices looks good

The cache_indices function efficiently retrieves and caches existing database indices. It uses a SQL query to fetch index names and table names, then maps the results to a HashSet<DBIndex>. The error handling is appropriate, returning an empty HashSet if the query fails.

570-584: Improved index creation in create_table function

The modifications to the create_table function enhance code maintainability by using the new create_index function for creating indices. This centralized approach allows for better error handling and logging. The changes are well-structured and consistent.

src/infra/src/db/mysql.rs (4)

635-649: LGTM: Improved index creation using the new create_index function.

The modifications to use the new create_index function for creating indices are well-implemented. This approach is more consistent and allows for better error handling and index existence checking.

740-766: LGTM: Well-implemented create_index function.

The new create_index function is well-designed and implemented. It efficiently checks for existing indices using the cache, constructs the appropriate SQL statement, and handles the index creation process with proper logging and error handling.

768-781: LGTM: Well-implemented delete_index function.

The new delete_index function is well-designed and implemented. It efficiently checks for existing indices using the cache, constructs the appropriate SQL statement, and handles the index deletion process with proper logging and error handling.

Line range hint 1-781: Summary: Improved index management with caching mechanism.

The changes introduced in this file significantly improve the efficiency of index management for the MySQL database. Key improvements include:

A new caching mechanism for existing indices.

New create_index and delete_index functions that utilize the cache to prevent unnecessary operations.

Updates to existing functions like create_table and add_start_dt_column to use the new index management functions.

These changes should result in more efficient database operations, especially when dealing with index creation and deletion. The code is well-structured, consistent, and includes proper error handling and logging.

src/infra/src/db/sqlite.rs (1)

100-112: LGTM: Efficient caching of database indices

The cache_indices function is well-implemented. It efficiently retrieves existing indices from the database and returns them as a HashSet<DBIndex>. This approach is good for performance, as it allows for quick lookups when checking if an index exists.

src/infra/src/file_list/mysql.rs (2)

1196-1228: Improved index creation structure

The new implementation uses a vector of tuples to define indices, which improves code organization and readability. This approach makes it easier to add or modify indices in the future.

1230-1245: Separation of unique indices

Separating unique indices from regular indices is a good practice. It clearly distinguishes between the two types and allows for different handling if needed in the future.

a param Signed-off-by: Yashodhan Joshi <[email protected]>

coderabbitai

Actionable comments posted: 0

Outside diff range and nitpick comments (4)

src/infra/src/db/sqlite.rs (4)
793-820: LGTM: Efficient index creation with room for minor improvement

The create_index function is well-implemented, efficiently checking for existing indices before creation and supporting unique indices. The use of proper logging is commendable.

Consider adding error logging in case the index creation fails. You can modify the last few lines like this:
-    sqlx::query(&sql).execute(&*client).await?;
-    log::info!("[SQLITE] index {} created successfully", idx_name);
-    Ok(())
+    match sqlx::query(&sql).execute(&*client).await {
+        Ok(_) => {
+            log::info!("[SQLITE] index {} created successfully", idx_name);
+            Ok(())
+        }
+        Err(e) => {
+            log::error!("[SQLITE] Failed to create index {}: {}", idx_name, e);
+            Err(e.into())
+        }
+    }
822-837: LGTM: Efficient index deletion with room for minor improvement

The delete_index function is well-implemented, efficiently checking for existing indices before deletion. The use of proper logging is commendable.

For consistency with the create_index function, consider adding error logging in case the index deletion fails. You can modify the last few lines like this:
-    sqlx::query(&sql).execute(&*client).await?;
-    log::info!("[SQLITE] index {}deleted successfully", idx_name);
-    Ok(())
+    match sqlx::query(&sql).execute(&*client).await {
+        Ok(_) => {
+            log::info!("[SQLITE] index {} deleted successfully", idx_name);
+            Ok(())
+        }
+        Err(e) => {
+            log::error!("[SQLITE] Failed to delete index {}: {}", idx_name, e);
+            Err(e.into())
+        }
+    }
693-694: LGTM: Improved data safety with backup creation

The changes to the add_start_dt_column method in the SqliteDb implementation improve data safety by creating a backup before modifying the table structure. The use of await ensures proper asynchronous execution.

Consider adding error handling to ensure that if the backup creation fails, the column addition is not performed. You can modify the code like this:
-        create_meta_backup().await?;
-        add_start_dt_column().await?;
+        create_meta_backup().await.map_err(|e| {
+            log::error!("[SQLITE] Failed to create backup before adding start_dt column: {}", e);
+            e
+        })?;
+        add_start_dt_column().await
This ensures that if the backup creation fails, an error is logged, and the column addition is not attempted.

Dropping the "meta_module_key2_idx" index may impact query performance.

Several queries in sqlite.rs, postgres.rs, and mysql.rs utilize the key2 field in their WHERE clauses, which may have previously benefited from the now-dropped index. This change could lead to slower query performance if these operations relied on the index for efficiency.

Analysis chain

Line range hint 737-762: Improved index management, but verify impact of dropped index

The changes to add_start_dt_column improve consistency by using the new create_index and delete_index functions. However, dropping the "meta_module_key2_idx" index might impact query performance if it was previously used.

Please verify that dropping the "meta_module_key2_idx" index doesn't negatively impact query performance. Run the following script to check for queries that might have used this index:
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check for queries that might have used the dropped index
rg --type rust 'WHERE.*key2.*=' src/
Length of output: 2682

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 2ec3a0a and cb39e69.

Files selected for processing (2)

src/infra/src/db/sqlite.rs (7 hunks)
src/infra/src/file_list/sqlite.rs (3 hunks)

Files skipped from review as they are similar to previous changes (1)

src/infra/src/file_list/sqlite.rs

Additional context used

Path-based instructions (1)

src/infra/src/db/sqlite.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

Additional comments not posted (3)

src/infra/src/db/sqlite.rs (3)

100-113: LGTM: Efficient implementation of index caching

The cache_indices function is well-implemented. It efficiently retrieves existing indices from the database and handles potential errors gracefully by returning an empty HashSet if an error occurs. The use of a HashSet for storing indices is a good choice for efficient lookups.

Line range hint 766-791: LGTM: Simplified function signature and consistent client usage

The changes to create_meta_backup improve consistency by using the CLIENT_RW lazy static variable instead of passing the client as a parameter. This simplifies the function signature and ensures consistent use of the write client across the module.

718-733: LGTM: Improved consistency in index creation

The changes to the create_table function improve consistency by using the new create_index function to create indices. The indices created ("meta_module_idx", "meta_module_key1_idx", and "meta_module_start_dt_idx") seem appropriate for the table structure and should improve query performance.

YJDoc2 added 4 commits September 23, 2024 12:24

feat: add index check for creating indices in meta

b061af4

Signed-off-by: Yashodhan Joshi <[email protected]>

feat: add existing index check to infra::file_list

33c7db5

Signed-off-by: Yashodhan Joshi <[email protected]>

feat: add index chec for infra/scheduler scheduled_jobs

5e49864

Signed-off-by: Yashodhan Joshi <[email protected]>

feat: add index check in infra/schema/history schema_history

ae6a31e

Signed-off-by: Yashodhan Joshi <[email protected]>

github-actions bot added the ✏️ Feature label Sep 23, 2024

coderabbitai bot reviewed Sep 23, 2024

View reviewed changes

YJDoc2 and others added 7 commits September 23, 2024 14:11

fix: fix linter errors

b9faaff

Signed-off-by: Yashodhan Joshi <[email protected]>

Merge branch 'main' into yj/issue-4601

5f148a4

Merge branch 'main' into yj/issue-4601

1782bdd

Merge branch 'main' into yj/issue-4601

63ecc12

fix: move caching functions to respective drivers

a854061

Signed-off-by: Yashodhan Joshi <[email protected]>

feat: extract a common function for index creation in sql

8eeadc0

Signed-off-by: Yashodhan Joshi <[email protected]>

Merge branch 'main' into yj/issue-4601

8ec0336

coderabbitai bot reviewed Sep 24, 2024

View reviewed changes

hengfeiyang approved these changes Sep 24, 2024

View reviewed changes

YJDoc2 added 2 commits September 24, 2024 18:43

fix: clone client directly in the creat_index instead of taking it as

c0a7ffb

a param Signed-off-by: Yashodhan Joshi <[email protected]>

Merge branch 'main' into yj/issue-4601

2ec3a0a

hengfeiyang approved these changes Sep 24, 2024

View reviewed changes

fix: deadlock for sqlite

cb39e69

hengfeiyang force-pushed the yj/issue-4601 branch from 8adc10b to cb39e69 Compare September 24, 2024 14:36

coderabbitai bot reviewed Sep 24, 2024

View reviewed changes

hengfeiyang added 2 commits September 24, 2024 22:59

fix: create indexes

49cd41c

Merge branch 'main' into yj/issue-4601

ba96a83

YashodhanJoshi1 merged commit a71abee into main Sep 25, 2024

YashodhanJoshi1 deleted the yj/issue-4601 branch September 25, 2024 04:23

This was referenced Sep 26, 2024

fix: add file list id cache #4643

Merged

feat: add metrics for db realated requests #4676

Merged

perf: query file_list in parallels by day #4669

Merged

coderabbitai bot mentioned this pull request Oct 5, 2024

feat: short url service #4697

Merged

coderabbitai bot mentioned this pull request Oct 7, 2024

fix: removed index on original_url for short_url service #4720

Merged

This was referenced Oct 15, 2024

fix: increase file field size in file_list* tables #4789

Merged

feat: unix timestamp support short urls #4805

Closed

coderabbitai bot mentioned this pull request Oct 24, 2024

feat: use seaorm to refactor short_urls #4862

Merged

coderabbitai bot mentioned this pull request Dec 4, 2024

feat: add separate table for work_group #5252

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: check indices exists before running create index commands #4618

feat: check indices exists before running create index commands #4618

Uh oh!

YashodhanJoshi1 commented Sep 23, 2024 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 23, 2024 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: check indices exists before running create index commands #4618

feat: check indices exists before running create index commands #4618

Uh oh!

Conversation

YashodhanJoshi1 commented Sep 23, 2024 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Possibly related PRs

Suggested labels

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

YashodhanJoshi1 commented Sep 23, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 23, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)