Skip to content

Conversation

@hengfeiyang
Copy link
Contributor

@hengfeiyang hengfeiyang commented Oct 22, 2025

User description

This pull request removes the now-unused get_min_ts method and its related code from the file list infrastructure, simplifying both the trait and its implementations for MySQL, Postgres, and SQLite. Additionally, the logic for updating the minimum document timestamp in stream stats during retention is updated to use the retention end date directly.

Retention logic update:

  • Updated delete_by_date in retention.rs to set the minimum document timestamp (min_ts) for stream stats directly from date_end.timestamp_micros() instead of querying for the minimum timestamp. This streamlines retention logic and ensures consistency.

Minor code cleanup:

  • Removed unused counters for stream and schema numbers in get_stream_schema_status to prevent misleading statistics.

PR Type

Enhancement, Bug fix


Description

  • Remove unused file-list get_min_ts

  • Simplify FileList trait and impls

  • Reduce DB reads during retention

  • Fix schema status counters inflation


Diagram Walkthrough

flowchart LR
  Trait["FileList trait"] -- remove --> MinTsMethod["get_min_ts()"]
  ImplMy["MySQL impl"] -- drop method --> MinTsMethod
  ImplPg["Postgres impl"] -- drop method --> MinTsMethod
  ImplSql["SQLite impl"] -- drop method --> MinTsMethod
  Retention["Retention delete_by_date"] -- use --> DateEnd["date_end.timestamp_micros() as min_ts"]
  Status["HTTP status schema"] -- remove --> ExtraCounters["extra counters increments"]
Loading

File Walkthrough

Relevant files
Bug fix
mod.rs
Fix overcount in latest stream schema status                         

src/handler/http/request/status/mod.rs

  • Stop incrementing stream_num and stream_schema_num for latest schemas.
  • Prevents double counting in memory/stat reporting.
+0/-2     
Enhancement
mod.rs
Drop get_min_ts from file-list interface                                 

src/infra/src/file_list/mod.rs

  • Remove get_min_ts from FileList trait.
  • Remove public get_min_ts wrapper function.
+0/-11   
mysql.rs
Remove MySQL get_min_ts implementation                                     

src/infra/src/file_list/mysql.rs

  • Delete get_min_ts implementation and related query/metrics.
+0/-22   
postgres.rs
Remove Postgres get_min_ts implementation                               

src/infra/src/file_list/postgres.rs

  • Delete get_min_ts implementation and related query/metrics.
+0/-22   
sqlite.rs
Remove SQLite get_min_ts implementation                                   

src/infra/src/file_list/sqlite.rs

  • Delete get_min_ts implementation and related query.
+0/-19   
retention.rs
Use date_end for retention min_ts update                                 

src/service/compact/retention.rs

  • Set stream stats min_ts directly from date_end.
  • Remove DB call to file-list get_min_ts.
+2/-6     

@github-actions
Copy link
Contributor

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Recommended focus areas for review

Logic Change

Using date_end.timestamp_micros() as the new minimum document timestamp changes semantics vs previously querying actual min_ts and falling back to cached stat. Validate that date_end always aligns with the earliest retained data boundary to avoid overstating min_ts and accidentally skipping valid data in downstream processes.

// we use date_end as the new min doc time
let min_ts = date_end.timestamp_micros();
infra_file_list::reset_stream_stats_min_ts(
    org_id,
    format!("{org_id}/{stream_type}/{stream_name}").as_str(),
    min_ts,
Metrics Accuracy

The removal of increments for stream_num and stream_schema_num in the latest schemas loop changes reported counts. Confirm dashboards and consumers expect counts to exclude latest-schema entries to prevent regressions in observability.

for (key, schema) in r.iter() {
    mem_size += std::mem::size_of::<String>() + key.len();
    mem_size += schema.size();
}
drop(r);

@github-actions
Copy link
Contributor

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Possible issue
Restore missing counters

The loop no longer increments stream_num and stream_schema_num, which likely causes
undercounting in reported stats. Restore the appropriate counters to reflect items
in STREAM_SCHEMAS_LATEST.

src/handler/http/request/status/mod.rs [507-510]

 for (key, schema) in r.iter() {
+    stream_num += 1;
+    stream_schema_num += 1;
     mem_size += std::mem::size_of::<String>() + key.len();
     mem_size += schema.size();
 }
Suggestion importance[1-10]: 7

__

Why: The existing_code matches the new hunk and the PR removed increments to stream_num and stream_schema_num; restoring them may fix undercounted stats, but without full context it's an assumption rather than a confirmed bug.

Medium
Prevent regressing min timestamp

Setting the minimum doc time to date_end can move the window forward beyond existing
data and cause data before date_end to be ignored. Fall back to the existing
stats.doc_time_min when it is lower than date_end to avoid losing visibility of
older data.

src/service/compact/retention.rs [433-438]

-// we use date_end as the new min doc time
-let min_ts = date_end.timestamp_micros();
+// prefer existing min if it's earlier than the new retention boundary
+let new_min_ts = date_end.timestamp_micros();
+let min_ts = if stats.doc_time_min > 0 {
+    new_min_ts.max(stats.doc_time_min)
+} else {
+    new_min_ts
+};
 infra_file_list::reset_stream_stats_min_ts(
     org_id,
     format!("{org_id}/{stream_type}/{stream_name}").as_str(),
     min_ts,
Suggestion importance[1-10]: 5

__

Why: The code sets min_ts to date_end, and proposing to keep the max of existing stats.doc_time_min and the new boundary is a reasonable safeguard; however, the PR intentionally simplified logic to use date_end, so impact is moderate and context-dependent.

Low

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Overview

Summary

This PR optimizes retention operations by removing the get_min_ts database query method across all file_list implementations (MySQL, Postgres, SQLite). The retention logic now uses date_end directly as the new minimum timestamp instead of querying the database, reducing DB load during retention operations.

Key Changes:

  • Retention optimization: Updated delete_by_date in retention.rs to calculate min_ts from date_end.timestamp_micros() instead of querying file_list for the actual minimum timestamp. This is a performance-accuracy tradeoff - slightly conservative estimates for better performance.
  • Code cleanup: Removed unused get_min_ts method from the FileList trait and all three database backend implementations (22 lines removed per implementation).
  • Bug fix: Fixed double-counting in get_stream_schema_status by removing counter increments when iterating over STREAM_SCHEMAS_LATEST, which is a separate cache from the versioned STREAM_SCHEMAS.

The change is safe because the retention logic only updates doc_time_min when min_ts > stats.doc_time_min, preventing incorrect backward updates.

Confidence Score: 4/5

  • This PR is safe to merge with minimal risk - removes unused code and optimizes retention performance
  • The changes are straightforward: removing an unused method across multiple implementations and updating retention logic with a performance optimization. The retention change trades query accuracy for performance, but includes a safeguard (only updating when min_ts increases). The status endpoint fix corrects a double-counting bug. No security concerns identified.
  • src/service/compact/retention.rs should be monitored to ensure the simplified min_ts calculation doesn't cause issues with streams that have data gaps after retention

Important Files Changed

File Analysis

Filename Score Overview
src/service/compact/retention.rs 4/5 Updated min_ts calculation to use date_end directly instead of querying database, reducing DB calls while maintaining reasonable accuracy
src/infra/src/file_list/mod.rs 5/5 Removed unused get_min_ts trait method and public function wrapper
src/handler/http/request/status/mod.rs 5/5 Fixed double-counting bug by removing counter increments for STREAM_SCHEMAS_LATEST which should not be included in stream counts

Sequence Diagram

sequenceDiagram
    participant Retention as Retention Process
    participant FileList as File List DB
    participant Stats as Stream Stats
    participant Cache as Stats Cache
    
    Note over Retention: delete_by_date called
    Retention->>FileList: Delete files in date range
    Note over Retention: OLD: Query DB for min_ts
    Note over Retention: NEW: Use date_end directly
    Retention->>Stats: reset_stream_stats_min_ts(date_end)
    Stats->>FileList: Update min_ts in DB
    Retention->>Retention: Check if min_ts > cached min
    alt min_ts > cached doc_time_min
        Retention->>Cache: Update doc_time_min
    else
        Note over Retention: Skip cache update
    end
Loading

6 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@testdino-playwright-reporter
Copy link

⚠️ Test Run Unstable


Author: hengfeiyang | Branch: fix/compactor-min-ts | Commit: 6d723a2

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
All tests passed 365 342 0 19 4 94% 5m 15s

View Detailed Results

@hengfeiyang hengfeiyang merged commit 845dd15 into main Oct 22, 2025
33 of 34 checks passed
@hengfeiyang hengfeiyang deleted the fix/compactor-min-ts branch October 22, 2025 08:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants