-
Notifications
You must be signed in to change notification settings - Fork 715
perf: use parquet file static cache #3604
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughThe recent changes primarily focus on updating dependencies, modifying configuration structures, enhancing cache management, and refining DataFusion execution logic. The Changes
Sequence Diagram(s) (Beta)sequenceDiagram
participant User
participant HTTPHandler as HTTP Handler
participant CacheManager as Cache Manager
participant DataFusion as DataFusion Engine
User->>HTTPHandler: Send request to check cache status
HTTPHandler->>CacheManager: Request cache statistics
CacheManager->>HTTPHandler: Return file statistics cache data
HTTPHandler->>User: Respond with cache status
User->>DataFusion: Submit query
DataFusion->>CacheManager: Check file statistics cache
alt Cache Hit
CacheManager->>DataFusion: Return cached statistics
else Cache Miss
DataFusion->>CacheManager: Update cache with new statistics
end
DataFusion->>User: Return query results
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (invoked as PR comments)
Additionally, you can add CodeRabbit Configration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Outside diff range and nitpick comments (1)
src/service/search/datafusion/storage/file_list.rs (1)
Line range hint
33-47: Refactor thesetfunction to improve key management and data insertion.The addition of
schema_keyenhances the granularity of file metadata management. Ensure that all dependent functions and modules are updated to handle this new key structure effectively.
Review Details
Configuration used: CodeRabbit UI
Review profile: CHILL
Files ignored due to path filters (1)
Cargo.lockis excluded by!**/*.lock
Files selected for processing (8)
- Cargo.toml (1 hunks)
- src/config/src/config.rs (2 hunks)
- src/handler/http/request/status/mod.rs (2 hunks)
- src/infra/src/cache/mod.rs (1 hunks)
- src/service/search/datafusion/exec.rs (8 hunks)
- src/service/search/datafusion/storage/file_list.rs (2 hunks)
- src/service/search/datafusion/storage/file_statistics_cache.rs (1 hunks)
- src/service/search/datafusion/storage/mod.rs (1 hunks)
Files skipped from review due to trivial changes (3)
- Cargo.toml
- src/config/src/config.rs
- src/infra/src/cache/mod.rs
Additional comments not posted (8)
src/service/search/datafusion/storage/mod.rs (1)
23-23: Successfully addedfile_statistics_cachemodule to enhance modularity.This change promotes better separation of concerns and encapsulation within the system architecture.
src/service/search/datafusion/storage/file_statistics_cache.rs (1)
26-140: Implement robust caching mechanism withFileStatisticsCache.The use of
DashMapfor concurrency andVecDequefor eviction policy is well thought out. Ensure that the cache eviction logic is thoroughly tested to prevent potential data loss or inconsistencies.src/handler/http/request/status/mod.rs (1)
313-316: Enhance monitoring capabilities by addingfile_statistics_cachestatistics.The integration of
file_statistics_cachestatistics into thecache_statusfunction is a valuable addition for better visibility and monitoring of cache performance.src/service/search/datafusion/exec.rs (5)
1240-1247: Thewith_collect_statoption has been added to the listing options for both JSON and PARQUET file types. Ensure this change aligns with the intended data collection and performance metrics.This change should help in better statistics collection for performance tuning.
27-27: Ensure the new importschema_ext::SchemaExtis utilized effectively in the code.Verification successful
The
SchemaExttrait is indeed utilized in various parts of the codebase, including the filesrc/service/search/datafusion/exec.rs. This confirms that the new importschema_ext::SchemaExtis effectively used.
Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the usage of `schema_ext::SchemaExt` throughout the codebase. # Test: Search for the usage of `SchemaExt`. Expect: At least one usage in the code. ast-grep --lang rust --pattern $'use $_ { $$$, SchemaExt, $$$ }'Length of output: 65
Script:
#!/bin/bash # Description: Verify the usage of `SchemaExt` throughout the codebase. # Test: Search for any references to `SchemaExt`. Expect: At least one usage in the code. rg 'SchemaExt'Length of output: 2064
Line range hint
1141-1176: Review the configuration of theRuntimeEnv. Ensure that the memory pool settings and cache manager configurations are correctly applied based on the system's configuration.Verification successful
The recent script outputs provide useful insights into the configuration of the
RuntimeEnv, specifically regarding memory pool settings and cache manager configurations. Here are the key findings:
Memory Pool Settings:
- The
memory_poolreferences insrc/service/search/datafusion/exec.rsindicate that bothGreedyMemoryPoolandFairSpillPoolare being configured based on theCONFIG.memory_cache.datafusion_memory_poolsetting.- The
datafusion_memory_poolconfiguration is defined insrc/config/src/config.rs.Cache Manager Settings:
- The
cache_managerreferences insrc/service/search/datafusion/exec.rsshow that theCacheManagerConfigis being used and applied torn_config.These findings confirm that the memory pool settings and cache manager configurations are being correctly applied based on the system's configuration.
src/service/search/datafusion/exec.rs:
- Memory pool settings are configured with
GreedyMemoryPoolandFairSpillPool.- Cache manager settings are applied using
CacheManagerConfig.
src/config/src/config.rs:
- The
datafusion_memory_poolconfiguration is defined.Based on this information, the review comment is verified.
Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the correct application of memory pool settings and cache manager configurations. # Test: Search for the usage of `RuntimeEnv` configuration. Expect: Correct settings based on system configuration. ast-grep --lang rust --pattern $'let $_ = RuntimeEnv::new($_)'Length of output: 63
Script:
#!/bin/bash # Description: Verify the correct application of memory pool settings and cache manager configurations. # Test: Search for the usage of `RuntimeEnv` configuration and related settings. # Search for any instantiation or configuration of RuntimeEnv ast-grep --lang rust --pattern $'let $_ = RuntimeEnv::new($_)' # Search for any method calls on RuntimeEnv that might configure it rg 'RuntimeEnv::with_' # Search for any references to memory pool settings rg 'memory_pool' # Search for any references to cache manager settings rg 'cache_manager'Length of output: 1016
Line range hint
1256-1291: The handling of different storage types (Memory,Wal,Tmpfs) in theregister_tablefunction has been updated. Ensure that these changes do not affect the existing functionality and that the schema key generation and file list management are correctly implemented.Verification successful
The
register_tablefunction insrc/service/search/datafusion/exec.rshandles different storage types (Memory,Wal,Tmpfs) and ensures schema key generation and file list management are correctly implemented. The function also includes error handling for unsupported file types and URL parsing errors.
- The function correctly sets the prefix based on the storage type.
- It manages the file list for
MemoryandWalstorage types.- It handles schema inference and caching based on configuration.
The changes appear to be correctly implemented without affecting existing functionality.
Scripts Executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the handling of different storage types in the `register_table` function. # Test: Search for the usage of storage type handling in `register_table`. Expect: Correct implementation without affecting existing functionality. ast-grep --lang rust --pattern $'let $_ = register_table($_, $_, $_, $_, $_, $_)'Length of output: 82
Script:
#!/bin/bash # Search for the definition of the `register_table` function in the codebase ast-grep --lang rust --pattern 'fn register_table($_, $_, $_, $_, $_, $_) -> $_ { $$$ }' # Search for the usage of the `register_table` function in the codebase rg 'register_table\('Length of output: 3508
Script:
#!/bin/bash # Extract the definition of the `register_table` function from `src/service/search/datafusion/exec.rs` ast-grep --lang rust --pattern 'pub async fn register_table($_, $_, $_, $_, $_, $_) -> $_ { $$$ }' src/service/search/datafusion/exec.rsLength of output: 6298
45-45: The addition ofCacheManagerConfigshould be aligned with the overall caching strategy. Confirm that this configuration is properly integrated and utilized.
Last PR i added a new ENV
ZO_DATAFUSION_PARQUET_STAT_DISABLE_FOR_AGGS, but now we can use parquet file statistics cache to do better, so i deleted the ENV.Query for last 3 days within 100GB data:
This is for all the query not only for aggregation.
Also added a new ENV: