-
Notifications
You must be signed in to change notification settings - Fork 715
perf: improve get_min_ts to reduce db calls #8861
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Summary
This PR optimizes retention operations by removing the get_min_ts database query method across all file_list implementations (MySQL, Postgres, SQLite). The retention logic now uses date_end directly as the new minimum timestamp instead of querying the database, reducing DB load during retention operations.
Key Changes:
- Retention optimization: Updated
delete_by_dateinretention.rsto calculatemin_tsfromdate_end.timestamp_micros()instead of querying file_list for the actual minimum timestamp. This is a performance-accuracy tradeoff - slightly conservative estimates for better performance. - Code cleanup: Removed unused
get_min_tsmethod from the FileList trait and all three database backend implementations (22 lines removed per implementation). - Bug fix: Fixed double-counting in
get_stream_schema_statusby removing counter increments when iterating overSTREAM_SCHEMAS_LATEST, which is a separate cache from the versionedSTREAM_SCHEMAS.
The change is safe because the retention logic only updates doc_time_min when min_ts > stats.doc_time_min, preventing incorrect backward updates.
Confidence Score: 4/5
- This PR is safe to merge with minimal risk - removes unused code and optimizes retention performance
- The changes are straightforward: removing an unused method across multiple implementations and updating retention logic with a performance optimization. The retention change trades query accuracy for performance, but includes a safeguard (only updating when min_ts increases). The status endpoint fix corrects a double-counting bug. No security concerns identified.
- src/service/compact/retention.rs should be monitored to ensure the simplified min_ts calculation doesn't cause issues with streams that have data gaps after retention
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/service/compact/retention.rs | 4/5 | Updated min_ts calculation to use date_end directly instead of querying database, reducing DB calls while maintaining reasonable accuracy |
| src/infra/src/file_list/mod.rs | 5/5 | Removed unused get_min_ts trait method and public function wrapper |
| src/handler/http/request/status/mod.rs | 5/5 | Fixed double-counting bug by removing counter increments for STREAM_SCHEMAS_LATEST which should not be included in stream counts |
Sequence Diagram
sequenceDiagram
participant Retention as Retention Process
participant FileList as File List DB
participant Stats as Stream Stats
participant Cache as Stats Cache
Note over Retention: delete_by_date called
Retention->>FileList: Delete files in date range
Note over Retention: OLD: Query DB for min_ts
Note over Retention: NEW: Use date_end directly
Retention->>Stats: reset_stream_stats_min_ts(date_end)
Stats->>FileList: Update min_ts in DB
Retention->>Retention: Check if min_ts > cached min
alt min_ts > cached doc_time_min
Retention->>Cache: Update doc_time_min
else
Note over Retention: Skip cache update
end
6 files reviewed, no comments
|
| Status | Total | Passed | Failed | Skipped | Flaky | Pass Rate | Duration |
|---|---|---|---|---|---|---|---|
| All tests passed | 365 | 342 | 0 | 19 | 4 | 94% | 5m 15s |
User description
This pull request removes the now-unused
get_min_tsmethod and its related code from the file list infrastructure, simplifying both the trait and its implementations for MySQL, Postgres, and SQLite. Additionally, the logic for updating the minimum document timestamp in stream stats during retention is updated to use the retention end date directly.Retention logic update:
delete_by_dateinretention.rsto set the minimum document timestamp (min_ts) for stream stats directly fromdate_end.timestamp_micros()instead of querying for the minimum timestamp. This streamlines retention logic and ensures consistency.Minor code cleanup:
get_stream_schema_statusto prevent misleading statistics.PR Type
Enhancement, Bug fix
Description
Remove unused file-list
get_min_tsSimplify FileList trait and impls
Reduce DB reads during retention
Fix schema status counters inflation
Diagram Walkthrough
File Walkthrough
mod.rs
Fix overcount in latest stream schema statussrc/handler/http/request/status/mod.rs
stream_numandstream_schema_numfor latest schemas.mod.rs
Drop get_min_ts from file-list interfacesrc/infra/src/file_list/mod.rs
get_min_tsfromFileListtrait.get_min_tswrapper function.mysql.rs
Remove MySQL get_min_ts implementationsrc/infra/src/file_list/mysql.rs
get_min_tsimplementation and related query/metrics.postgres.rs
Remove Postgres get_min_ts implementationsrc/infra/src/file_list/postgres.rs
get_min_tsimplementation and related query/metrics.sqlite.rs
Remove SQLite get_min_ts implementationsrc/infra/src/file_list/sqlite.rs
get_min_tsimplementation and related query.retention.rs
Use date_end for retention min_ts updatesrc/service/compact/retention.rs
min_tsdirectly fromdate_end.get_min_ts.