feat: add custom table provider #3846

hengfeiyang · 2024-06-28T11:28:46Z

Summary by CodeRabbit

New Features
- Introduced various user-defined functions (UDFs) for DataFusion processing, including array operations, casting, date formatting, matching, and regular expressions.
Improvements
- Added new utility functions and structs to improve partition management and file scanning in DataFusion.
- Implemented a new NewListingTable struct for efficient file listing and scanning in file systems.
- Introduced a test object store for DataFusion testing.
Bug Fixes
- Updated default value for the started_at column in the file_list_jobs table across multiple databases (MySQL, PostgreSQL, SQLite) to ensure consistent default behavior.

coderabbitai · 2024-06-28T11:28:58Z

Walkthrough

These changes primarily focus on restructuring and enhancing the DataFusion integration within the project. Key updates include reorganizing user-defined functions (UDFs) under a new udf module, introducing new table provider utilities for partitioning and filtering, and modifying SQL data type definitions to include default values. These adjustments aim to improve code organization, clarity, and functionality in handling data processing tasks.

Changes

File Path	Change Summary
src/handler/http/request/status/mod.rs	Updated import statement for `DEFAULT_FUNCTIONS` to new `udf` module.
src/service/promql/engine.rs	Updated paths for regex matching UDFs to reflect new module structure.
...datafusion/mod.rs	Removed multiple UDF-related modules; added new modules for `table_provider` and `udf`.
...datafusion/table_provider/helpers.rs	Introduced helper functions for handling partitions and expressions in DataFusion.
...datafusion/table_provider/mod.rs	Introduced `NewListingTable` struct implementing `TableProvider` trait for file system listing and scanning.
...datafusion/table_provider/test_util.rs	Added test object store for DataFusion testing with utility function.
...datafusion/udf/mod.rs	Added various UDF modules, constants, and a function for stringifying JSON values.
...infra/src/file_list/mysql.rs	Modified `started_at` column data type to include default value in `file_list_jobs` table.
...infra/src/file_list/postgres.rs	Modified `started_at` column data type to include default value in `file_list_jobs` table.
...infra/src/file_list/sqlite.rs	Modified `started_at` column data type to include default value in `file_list_jobs` table.

Tip

Early access features

OpenAI gpt-4o model for reviews and chat.

Note:

You can disable early access features from the CodeRabbit UI or by setting early_access: false in the CodeRabbit configuration file.
Please join our Discord Community to provide feedback and report issues.
OSS projects are always opted into early access features.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

Share

Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai generate interesting stats about this repository and render them as a table.
- @coderabbitai show all the console.log statements in this repository.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (invoked as PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Additionally, you can add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.

CodeRabbit Configration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 7

Outside diff range and nitpick comments (1)

src/service/search/datafusion/udf/arrzip_udf.rs (1)
Line range hint 161-161: Consider handling errors gracefully in arr_zip_impl.

The function uses .expect which could cause panics if the assumptions are violated. It would be safer to handle these potential errors gracefully and return a DataFusionError if necessary.
- json::from_str(arr_field1).expect("Failed to deserialize arrzip field1");
+ json::from_str(arr_field1).map_err(|e| DataFusionError::Execution(format!("Failed to deserialize arrzip field1: {}", e)))?;

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 62bd7b2 and 8d343a7.

Files selected for processing (13)

src/handler/http/request/status/mod.rs (1 hunks)
src/service/promql/engine.rs (1 hunks)
src/service/search/datafusion/exec.rs (7 hunks)
src/service/search/datafusion/mod.rs (2 hunks)
src/service/search/datafusion/table_provider/helpers.rs (1 hunks)
src/service/search/datafusion/table_provider/mod.rs (1 hunks)
src/service/search/datafusion/table_provider/test_util.rs (1 hunks)
src/service/search/datafusion/udf/arrjoin_udf.rs (2 hunks)
src/service/search/datafusion/udf/arrzip_udf.rs (2 hunks)
src/service/search/datafusion/udf/cast_to_arr_udf.rs (2 hunks)
src/service/search/datafusion/udf/mod.rs (1 hunks)
src/service/search/datafusion/udf/spath_udf.rs (2 hunks)
src/service/search/datafusion/udf/transform_udf.rs (1 hunks)

Files skipped from review due to trivial changes (3)

src/handler/http/request/status/mod.rs
src/service/search/datafusion/mod.rs
src/service/search/datafusion/udf/transform_udf.rs

Additional context used

Path-based instructions (10)

src/service/search/datafusion/table_provider/test_util.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/service/search/datafusion/udf/mod.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/service/search/datafusion/udf/cast_to_arr_udf.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/service/search/datafusion/udf/spath_udf.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/service/search/datafusion/udf/arrjoin_udf.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/service/search/datafusion/udf/arrzip_udf.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/service/search/datafusion/table_provider/mod.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/service/search/datafusion/table_provider/helpers.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/service/promql/engine.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/service/search/datafusion/exec.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

Additional comments not posted (12)

src/service/search/datafusion/table_provider/test_util.rs (1)

32-37: Consider handling errors more gracefully in test environments.

While using unwrap() in test environments is common, it's generally a good practice to handle errors gracefully even in tests to avoid panics and to make test failures more informative.
src/service/search/datafusion/udf/spath_udf.rs (1)
89-89: Optimize JSON path processing in spath_impl.

The function currently re-parses the JSON object for each path segment. Consider maintaining the parsed JSON object and traversing it according to the path, which could improve performance.
[REFACTOR_SUGGESTion]
- let mut field: json::Value = json::from_str(field).expect("Failed to deserialize arrzip field1");
+ let mut field: json::Value = json::from_str(field).map_err(|e| DataFusionError::Execution(format!("JSON parsing error: {}", e)))?;
src/service/search/datafusion/table_provider/mod.rs (1)

83-106: Ensure robust error handling in try_new.

The function uses ? for error propagation, which is good. However, consider adding more descriptive error messages or handling specific cases differently to improve the clarity and maintainability of error handling.

src/service/search/datafusion/table_provider/helpers.rs (1)

129-149: Optimize the split_files function.

The function handles file partitioning well. Consider adding more comments to explain the logic, especially around the calculation of chunk_size which might not be immediately clear to other developers.

src/service/promql/engine.rs (2)

1029-1029: Updated import path for REGEX_MATCH_UDF is correct.

The updated import path aligns with the restructuring of the udf module as described in the AI-generated summary.

1036-1037: Updated import path for REGEX_NOT_MATCH_UDF is correct.

The updated import path correctly points to the new location of REGEX_NOT_MATCH_UDF within the regexp_udf sub-module of udf, consistent with the restructuring described.

src/service/search/datafusion/exec.rs (6)

63-65: Review of imports and module usage.

The imports from super have been updated to include NewListingTable and get_all_transform functions, which aligns with the restructuring mentioned in the PR summary. This is a positive change as it helps in keeping the code modular and organized.

620-620: Review of NewListingTable instantiation.

The instantiation of NewListingTable is done correctly according to the new table_provider module's API. It's good to see that error handling is properly implemented with the ? operator, which will propagate any errors upwards.

987-987: Review of NewListingTable instantiation in a different context.

Similar to the previous comment, the instantiation here is done correctly. This consistency in using the NewListingTable across different parts of the codebase is good for maintainability.

1093-1093: Review of NewListingTable registration in a session context.

The registration of NewListingTable in the session context is handled properly. This part of the code also demonstrates good practice by cloning the table before registering, which ensures that the original object isn't modified unexpectedly elsewhere.

1214-1230: Review of UDF registrations.

The registration of multiple UDFs is done in a clean and concise manner. It's important to ensure that all these UDFs are indeed implemented and tested as they play a critical role in the functionality of the system. The use of clone() here is necessary due to the ownership rules in Rust, which is correctly applied.

1311-1311: Review of conditional cache handling in NewListingTable.

This line demonstrates an important aspect of performance optimization by conditionally adding a cache based on the storage type. This is a smart use of resources and helps in optimizing performance for different deployment configurations.

src/service/search/datafusion/udf/arrzip_udf.rs

src/service/search/datafusion/table_provider/mod.rs

src/service/search/datafusion/udf/mod.rs

src/service/search/datafusion/udf/cast_to_arr_udf.rs

src/service/search/datafusion/udf/arrjoin_udf.rs

src/service/search/datafusion/table_provider/mod.rs

src/service/search/datafusion/table_provider/helpers.rs

coderabbitai

Actionable comments posted: 0

Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

Commits

Files that changed from the base of the PR and between 8d343a7 and 27f1e33.

Files selected for processing (3)

src/infra/src/file_list/mysql.rs (1 hunks)
src/infra/src/file_list/postgres.rs (1 hunks)
src/infra/src/file_list/sqlite.rs (1 hunks)

Additional context used

Path-based instructions (3)

src/infra/src/file_list/postgres.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/file_list/sqlite.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

src/infra/src/file_list/mysql.rs (1)

Pattern **/*.rs: You are a smart rustlang pull request reviewer.
You are going to review all the rustlang files.
Be concise, and add a brief explanation to your suggestions

Make sure the responses are not too verbose and keep the suggestions to the point i.e. actionable.

Additional comments not posted (3)

src/infra/src/file_list/postgres.rs (1)
1001-1001: Ensure compatibility with existing data when changing default column values.

Changing the default value of the started_at column to 0 might affect existing operations expecting a NULL or different default. It's crucial to verify that this change aligns with all uses of this column in the application to prevent unexpected behaviors.

Verification successful

Changing the default value of the started_at column to 0 is compatible with existing operations.

The INSERT statement already uses 0 for started_at, and the UPDATE statement explicitly sets this column, ensuring no conflicts with the default value change.

src/infra/src/file_list/postgres.rs, Line 1001
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Search for usages of the `started_at` column in the PostgreSQL-related files.
rg --type rust --glob '*postgres.rs' 'started_at'
Length of output: 627
src/infra/src/file_list/sqlite.rs (1)
1009-1009: Check implications of the default value change on existing SQLite data.

The addition of a default value of 0 for started_at in SQLite mirrors the change in the PostgreSQL file. It's important to assess how this default value impacts existing records and queries, especially in cases where NULL might have been expected.

Verification successful

Default value change for started_at verified

The change to set the default value of the started_at column to 0 in SQLite does not impact existing records or queries, as there are no instances of NULL handling for this column in the codebase.

Instances of started_at are consistently set to the current timestamp, and no NULL values are expected.
Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Search for usages of the `started_at` column in the SQLite-related files.
rg --type rust --glob '*sqlite.rs' 'started_at'
Length of output: 592

Script:
#!/bin/bash
# Description: Search for instances where the `started_at` column is set to NULL or queries that handle NULL values for this column.
rg --type rust --glob '*sqlite.rs' 'started_at.*NULL'
Length of output: 55

Script:
#!/bin/bash
# Description: Search for usages of the `started_at` column in the entire codebase to identify any potential handling of NULL values or implications of the default value change.
rg --type rust 'started_at'
Length of output: 4667
src/infra/src/file_list/mysql.rs (1)

1044-1044: Good practice to ensure data consistency.

Setting a default value of 0 for the started_at column in the file_list_jobs table is a good practice. It ensures that every record has a predictable initial state, which is crucial for processes that depend on this timestamp. This change enhances the robustness of the database schema by preventing null values in this column.

hengfeiyang added 2 commits June 28, 2024 19:28

feat: move udf to udf directory

9b7125d

feat: add custom table provider

2b08160

hengfeiyang requested a review from haohuaijin June 28, 2024 11:28

github-actions bot added the ✏️ Feature label Jun 28, 2024

update license

8d343a7

coderabbitai bot reviewed Jun 28, 2024

View reviewed changes

fix: file list job upgrade

27f1e33

coderabbitai bot reviewed Jun 28, 2024

View reviewed changes

oasisk approved these changes Jun 28, 2024

View reviewed changes

Merge branch 'main' into fix/datafusion-listtable

a13cc85

haohuaijin approved these changes Jun 29, 2024

View reviewed changes

hengfeiyang added 2 commits June 29, 2024 16:07

Merge branch 'main' into fix/datafusion-listtable

61617fa

Merge branch 'main' into fix/datafusion-listtable

d5dfe6d

hengfeiyang merged commit ac4b804 into main Jun 29, 2024

hengfeiyang deleted the fix/datafusion-listtable branch June 29, 2024 12:08

taimingl pushed a commit that referenced this pull request Jul 12, 2024

feat: add custom table provider (#3846)

fdec5ba

This was referenced Sep 30, 2024

perf: query file_list in parallels by day #4669

Merged

feat: add metrics for db realated requests #4676

Merged

coderabbitai bot mentioned this pull request Nov 1, 2024

feat: refactor merge file on compactor #4971

Merged

coderabbitai bot mentioned this pull request Nov 18, 2024

perf: optimizer generate access plan #5088

Merged

coderabbitai bot mentioned this pull request Nov 26, 2024

fix: sort by time enable in follow #5151

Merged

This was referenced Jan 8, 2025

feat: support limit join's right side #5616

Merged

feat: impl join right table only match once #5661

Merged

feat: ignore time_range when join on enrichment_tables #5684

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add custom table provider #3846

feat: add custom table provider #3846

Uh oh!

hengfeiyang commented Jun 28, 2024 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Jun 28, 2024 •

edited

Loading

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (`.coderabbit.yaml`)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

feat: add custom table provider #3846

feat: add custom table provider #3846

Uh oh!

Conversation

hengfeiyang commented Jun 28, 2024 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jun 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Chat

CodeRabbit Commands (invoked as PR comments)

CodeRabbit Configration File (.coderabbit.yaml)

Documentation and Community

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hengfeiyang commented Jun 28, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jun 28, 2024 •

edited

Loading

CodeRabbit Configration File (`.coderabbit.yaml`)