Skip to content

Conversation

@hengfeiyang
Copy link
Contributor

@hengfeiyang hengfeiyang commented Nov 21, 2025

Metrics super cluster v1

This version we implement a simple solution that leader region fetch data from other regions and only compute the result in leader region.

The mainly logic is:

  1. User fire a http PromQL request and go to one region we call it leader region
  2. Leader region partition by time range and dispatch the request to all the queriers in this region
  3. Each querier load data from local region also load data from other regions with same time range.
  4. Calculate the result back to region leader
  5. Region leader merge the result back to user

What we changed

  1. We added a gRPC service named metrics.data that allow you get metrics data from other region.
  2. Add super cluster check and load data from other region for querier.

API changes

POST /api/{org}/prometheus/api/v1/query_range

Form paramaters

Field descriptions:

Field Type Description
query string PromQL expression
start string Start timestamp, inclusive (RFC3339 or Unix timestamp)
end string End timestamp, inclusive (RFC3339 or Unix timestamp)
step string Query resolution step width in duration format (e.g., 15s, 1m) or float number of seconds
timeout string Evaluation timeout (e.g., 30s, 1m)
use_cache boolean Whether to use cache
*use_streaming boolean Whether to use streaming output
*search_type string Search event type
*regions string[] Regions to query. Default: all regions. Use , for multiple region, e.g. c1,c2
*clusters string[] Clusters to query. Default: all clusters. Use , for multiple region, e.g. c1,c2

* new parameter

@hengfeiyang hengfeiyang marked this pull request as draft November 21, 2025 07:31
@hengfeiyang hengfeiyang force-pushed the feat/metrics-supercluster branch from 6c96941 to 5070245 Compare November 26, 2025 10:20
@hengfeiyang hengfeiyang marked this pull request as ready for review November 27, 2025 05:09
@github-actions
Copy link
Contributor

Failed to generate code suggestions for PR

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 27, 2025

Greptile Overview

Greptile Summary

implemented metrics super cluster v1 support that enables cross-region query aggregation by having each querier fetch data from both local and remote regions, then computing results in the leader region

Key Changes

  • added new gRPC Metrics.Data streaming endpoint that allows queriers to fetch metrics data from other regions
  • introduced QueryContext struct to encapsulate query metadata (org_id, trace_id, is_super_cluster, etc.) for cleaner parameter passing throughout the codebase
  • refactored PromQL engine to spawn async tasks that fetch remote region data in parallel with local data loading when super cluster is enabled
  • added is_super_cluster flag to protobuf definitions and HTTP request handling
  • improved database table operations to handle unique constraint violations gracefully in concurrent scenarios

Issues Found

  • potential panic in gRPC handler when query field is None (line 90 in src/handler/grpc/request/metrics/querier.rs)
  • import style inconsistency in src/infra/src/table/organizations.rs (doesn't follow custom instruction about consolidating imports)

Confidence Score: 4/5

  • This PR is safe to merge with one critical fix needed for the potential panic in the gRPC handler
  • The implementation is well-structured with proper error handling, metrics tracking, and follows Rust best practices. The super cluster logic is cleanly separated with feature flags and only executes in enterprise mode. However, there's one critical issue where unwrap() could cause a panic if the query field is None. The database unique constraint handling improvements are a nice defensive programming addition. Once the panic issue is fixed, this should be safe to deploy.
  • Pay close attention to src/handler/grpc/request/metrics/querier.rs line 90 - fix the potential panic before merging

Important Files Changed

File Analysis

Filename Score Overview
src/service/promql/engine.rs 4/5 Refactored to support super cluster by adding QueryContext struct containing query metadata, spawning async tasks to fetch data from remote regions, and merging results with local data
src/service/promql/search/grpc/mod.rs 5/5 Added new data streaming endpoint that enables remote regions to fetch metrics data, supports efficient data transfer for super cluster queries
src/handler/grpc/request/metrics/querier.rs 5/5 Implemented gRPC server-side handlers for Metrics.Query and Metrics.Data endpoints with proper error handling and metrics tracking
src/proto/proto/cluster/metrics.proto 5/5 Added is_super_cluster flag and Data streaming RPC endpoint, plus query_data and label_selector fields to support super cluster feature
src/handler/http/request/promql/mod.rs 4/5 Added super cluster enablement check based on enterprise config, passes is_super_cluster flag to search layer
src/config/src/meta/promql/value.rs 5/5 Added QueryContext struct to encapsulate query metadata (trace_id, org_id, super cluster flag, etc.) for cleaner parameter passing

Sequence Diagram

sequenceDiagram
    participant User
    participant LeaderRegion as Leader Region (HTTP Handler)
    participant Querier1 as Querier (Region 1)
    participant Querier2 as Querier (Region 2)
    participant LocalStorage as Local Storage
    participant RemoteRegion as Remote Region (gRPC)
    
    User->>LeaderRegion: POST /api/{org}/prometheus/api/v1/query_range
    Note over User,LeaderRegion: Query params: query, start, end, step, regions, clusters
    
    LeaderRegion->>LeaderRegion: Parse parameters and check super_cluster enabled
    LeaderRegion->>LeaderRegion: Partition time range by querier count
    
    par Parallel Querier Dispatch
        LeaderRegion->>Querier1: gRPC Metrics.Query (time partition 1)
        LeaderRegion->>Querier2: gRPC Metrics.Query (time partition 2)
    end
    
    par Each Querier Processing
        Querier1->>LocalStorage: Load local region data
        LocalStorage-->>Querier1: Local metrics data
        
        alt Super Cluster Enabled
            Querier1->>RemoteRegion: gRPC Metrics.Data (same time range)
            RemoteRegion->>RemoteRegion: Query metrics from remote region
            RemoteRegion-->>Querier1: Stream metrics data
            Querier1->>Querier1: Merge local + remote data
        end
        
        Querier1->>Querier1: Execute PromQL computation
        Querier1-->>LeaderRegion: Return computed results
        
        Querier2->>LocalStorage: Load local region data
        LocalStorage-->>Querier2: Local metrics data
        
        alt Super Cluster Enabled
            Querier2->>RemoteRegion: gRPC Metrics.Data (same time range)
            RemoteRegion-->>Querier2: Stream metrics data
            Querier2->>Querier2: Merge local + remote data
        end
        
        Querier2->>Querier2: Execute PromQL computation
        Querier2-->>LeaderRegion: Return computed results
    end
    
    LeaderRegion->>LeaderRegion: Merge results from all queriers
    LeaderRegion->>LeaderRegion: Apply final aggregations
    LeaderRegion-->>User: Return final PromQL result
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

18 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements super cluster support for metrics queries, enabling a leader region to fetch and aggregate data from multiple regions. The implementation follows a distributed query pattern where the leader region partitions requests by time range, dispatches them to queriers across all regions, and merges the results.

Key changes:

  • Added new gRPC streaming service metrics.data for cross-region data retrieval
  • Introduced QueryContext struct to encapsulate query execution parameters
  • Added API parameters: search_type, regions, and clusters for region/cluster selection
  • Migrated from std::collections::HashSet to hashbrown::HashSet across promql modules

Reviewed changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/service/search/super_cluster/leader.rs Renamed variables from "nodes" to "clusters" for clarity in super cluster context
src/service/promql/search/mod.rs Refactored cache logic to use boolean use_cache instead of negated cache_disabled; added job ID generation from trace_id
src/service/promql/search/grpc/mod.rs Added new data() function for streaming metrics responses with time-range partitioning
src/service/promql/search/grpc/wal.rs Changed label_selector parameter from Option<HashSet> to HashSet
src/service/promql/engine.rs Integrated super cluster data loading with local cluster data; refactored to use QueryContext; updated all tests
src/service/promql/exec.rs Refactored PromqlContext to use new QueryContext struct for better parameter organization
src/service/promql/utils.rs Simplified apply_label_selector to accept HashSet directly instead of Option<HashSet>
src/service/promql/mod.rs Added new fields to MetricsQueryRequest for super cluster support
src/service/alerts/mod.rs Updated alert evaluation to check for super cluster configuration
src/proto/proto/cluster/metrics.proto Added new gRPC Data streaming method and new fields for super cluster configuration
src/proto/src/generated/cluster.rs Generated code from proto changes
src/handler/grpc/request/metrics/querier.rs Implemented new data() gRPC method for streaming metrics data
src/handler/grpc/mod.rs Updated conversion logic for MetricsQueryRequest
src/handler/http/request/promql/mod.rs Added super cluster detection and new API parameters handling
src/config/src/meta/promql/value.rs Added QueryContext struct for query execution parameters
src/config/src/meta/promql/mod.rs Added custom deserializer for comma-separated or array regions/clusters parameters
src/config/src/meta/cluster.rs Added is_local() method to NodeInfo trait
src/config/src/cluster.rs Refactored to extract get_local_http_addr() and get_local_grpc_addr() helper functions
src/common/infra/cluster/nats.rs Used new helper functions for consistent address generation
src/infra/src/table/users.rs Made user inserts idempotent by treating unique constraint violations as success
src/infra/src/table/organizations.rs Made organization inserts idempotent by treating unique constraint violations as success
src/infra/src/table/org_users.rs Made org_user inserts idempotent by treating unique constraint violations as success

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hengfeiyang hengfeiyang added the Needs-Testing Needs-Testing label Nov 27, 2025
@hengfeiyang hengfeiyang merged commit fa0672a into main Nov 27, 2025
37 of 41 checks passed
@hengfeiyang hengfeiyang deleted the feat/metrics-supercluster branch November 27, 2025 12:39
@hengfeiyang hengfeiyang mentioned this pull request Dec 4, 2025
16 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants