Skip to content

Conversation

@oasisk
Copy link
Contributor

@oasisk oasisk commented Nov 20, 2025

No description provided.

@github-actions
Copy link
Contributor

Failed to generate code suggestions for PR

@oasisk oasisk force-pushed the feat-sampling branch 2 times, most recently from 7a90e21 to c7db38a Compare November 20, 2025 12:42
@oasisk oasisk requested review from haohuaijin and hengfeiyang and removed request for haohuaijin November 20, 2025 13:17
@oasisk oasisk added this to the v0.20.0 milestone Nov 20, 2025
@hengfeiyang
Copy link
Contributor

@oasisk can you add description for the API change, how to use it.

@hengfeiyang hengfeiyang marked this pull request as draft November 21, 2025 04:24
@oasisk oasisk requested a review from hengfeiyang November 21, 2025 05:45
@oasisk oasisk force-pushed the feat-sampling branch 2 times, most recently from b04ddb9 to 99ba204 Compare November 24, 2025 07:39
@Shrinath-O2 Shrinath-O2 modified the milestones: v0.20.0, v0.20.1 Nov 25, 2025
@oasisk oasisk marked this pull request as ready for review November 28, 2025 06:11
@github-actions
Copy link
Contributor

Failed to generate code suggestions for PR

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 28, 2025

Greptile Overview

Greptile Summary

This PR adds sampling support for search APIs and pattern extraction, implementing both a simplified user-facing API (sampling_ratio) and internal enterprise sampling infrastructure (SamplingConfig).

Key Changes

Backend (Rust):

  • Added sampling_ratio (Optional) to SearchQuery proto message for simplified user API
  • Created SamplingConfig proto message with fields for sampling_ratio, sampling_mode, temporal_strategy, and num_time_strata
  • Integrated sampling throughout the search execution pipeline: SQL parsing → query planning → distributed execution → parquet file access
  • Pattern extraction endpoint now applies default sampling ratio from O2 enterprise config
  • Sampling explicitly disabled for /_search and /_search_stream endpoints (backwards compatibility)
  • WAL data is never sampled - only parquet files are subject to sampling
  • Enhanced generate_access_plan() to support both row-group-level (enterprise) and row-level sampling based on BitVec size

Frontend (Vue):

  • Removed manual scan size configuration from pattern statistics UI
  • Pattern extraction now delegates to backend default sampling configuration
  • Simplified UI by removing info tooltips and editing controls
  • Updated tests to match new behavior

Architecture

The implementation follows a clean separation: users specify sampling_ratio (0.0-1.0), and enterprise backend converts this to a full SamplingConfig with optimal defaults. Community edition logs warnings when sampling is requested but doesn't fail.

Confidence Score: 3/5

  • Safe to merge with verification needed for Hash trait removal on proto types
  • The implementation is well-structured with proper enterprise/community separation and backwards compatibility. However, removing Hash and Eq derives from SearchQuery and SearchInfo (because f64 doesn't implement Hash) could break compilation if these types are used as HashMap keys elsewhere in the codebase
  • src/proto/src/generated/cluster.rs - Verify SearchQuery and SearchInfo aren't used in hash-based collections

Important Files Changed

File Analysis

Filename Score Overview
src/proto/proto/cluster/search.proto 4/5 Added sampling_ratio field to SearchQuery and new SamplingConfig message for sampling feature
src/proto/src/generated/cluster.rs 3/5 Generated code removes Hash/Eq from structs with f64 fields - potential breaking change if used as HashMap keys
src/config/src/meta/search.rs 5/5 Added sampling_ratio and sampling_config fields to Query and related structs with proper defaults
src/handler/http/request/patterns/mod.rs 5/5 Pattern extraction endpoint applies default sampling ratio from O2 config, integrated with search streaming
src/service/search/sql/mod.rs 5/5 Added sampling_config field to Sql struct and parse_sampling_config method (enterprise-only)
src/service/search/streaming/mod.rs 5/5 Updated pattern extraction to track total_scan_records and improved logging for sampling mode
src/service/search/datafusion/table_provider/helpers.rs 5/5 Enhanced generate_access_plan to support both row-group and row-level sampling based on BitVec size
src/service/search/grpc/flight.rs 5/5 Added sampling application to file lists in gRPC search, with clear comment that WAL is never sampled

Sequence Diagram

sequenceDiagram
    participant UI as Frontend UI
    participant API as Pattern Extract API
    participant Stream as Search Streaming
    participant SQL as SQL Parser
    participant Plan as Query Planner
    participant Exec as Query Executor
    participant File as File Storage

    UI->>API: POST /patterns/extract<br/>(size=-1, sampling_ratio unset)
    API->>API: Apply default sampling_ratio<br/>from O2 config if not provided
    API->>Stream: process_search_stream_request()<br/>(extract_patterns=true)
    Stream->>SQL: Sql::new_with_options()<br/>(extract_patterns flag)
    
    alt Enterprise Feature Enabled
        SQL->>SQL: parse_sampling_config()<br/>Convert ratio to SamplingConfig
        Note over SQL: Defaults:<br/>- mode: "system" (row group)<br/>- strategy: "stratified"<br/>- strata: auto-calculated
    else Community Edition
        SQL->>SQL: Log warning about<br/>enterprise feature
    end
    
    SQL->>Plan: Build execution plan<br/>(includes sampling_config)
    Plan->>Plan: Propagate sampling_config<br/>to RemoteScanNodes
    Plan->>Exec: Execute distributed query<br/>(sampling in SearchInfo)
    
    alt Sampling Enabled
        Exec->>File: Apply sampling to file_list<br/>(stratified temporal sampling)
        Note over File: Row-group-level sampling:<br/>BitVec size == row_group_count
        File->>Exec: Sampled file access plans
    else No Sampling
        Exec->>File: Access all files
        File->>Exec: Full access plans
    end
    
    Exec->>Exec: Read parquet files<br/>(with ParquetAccessPlan)
    Note over Exec: WAL data NEVER sampled<br/>(always full)
    
    Exec->>Stream: Return search results<br/>(scan_records count)
    Stream->>Stream: Accumulate results<br/>Track total_scan_records
    Stream->>Stream: Extract patterns<br/>from accumulated logs
    Stream->>API: PatternExtractionResult
    API->>UI: Return patterns + statistics
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

27 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

@oasisk oasisk merged commit 0b4e66c into main Nov 28, 2025
35 checks passed
@oasisk oasisk deleted the feat-sampling branch November 28, 2025 10:43
@Shrinath-O2 Shrinath-O2 modified the milestones: v0.20.1, v0.20.0 Dec 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants