Skip to content

[cluster telemetry] Internal telemetry endpoint#7672

Merged
coszio merged 4 commits intodevfrom
internal-telemetry
Dec 23, 2025
Merged

[cluster telemetry] Internal telemetry endpoint#7672
coszio merged 4 commits intodevfrom
internal-telemetry

Conversation

@coszio
Copy link
Copy Markdown
Contributor

@coszio coszio commented Dec 2, 2025

Depends on #7631

Implements the internal service endpoint

@coszio coszio changed the title Internal telemetry [cluster telemetry] Internal telemetry endpoint Dec 2, 2025
@coszio coszio force-pushed the internal-telemetry branch from fed31a7 to 28636e4 Compare December 4, 2025 20:03
@coszio coszio force-pushed the telemetry-conversions branch from dc482ed to 2477dc0 Compare December 8, 2025 19:58
@coszio coszio force-pushed the internal-telemetry branch from 28636e4 to b36246c Compare December 8, 2025 20:02
@coszio coszio force-pushed the telemetry-conversions branch from 2477dc0 to d3b27aa Compare December 8, 2025 20:14
@coszio coszio force-pushed the internal-telemetry branch from b36246c to b3a25aa Compare December 8, 2025 20:14
@coszio coszio force-pushed the telemetry-conversions branch from 0af693e to a412db6 Compare December 9, 2025 13:55
@coszio coszio force-pushed the internal-telemetry branch from b3a25aa to 45e92dc Compare December 9, 2025 13:56
@coszio coszio added this to the Cluster Telemetry milestone Dec 9, 2025
@coszio coszio requested a review from timvisee December 9, 2025 17:27
@coszio coszio force-pushed the telemetry-conversions branch from a412db6 to 94c4271 Compare December 9, 2025 18:00
@coszio coszio force-pushed the internal-telemetry branch from 45e92dc to d0f8b28 Compare December 9, 2025 18:00
@coszio coszio force-pushed the telemetry-conversions branch from 756b4e8 to 063c5c3 Compare December 16, 2025 23:07
@coszio coszio force-pushed the internal-telemetry branch from 99ec597 to 5244eda Compare December 17, 2025 13:54
@coszio coszio marked this pull request as ready for review December 17, 2025 13:55
Base automatically changed from telemetry-conversions to dev December 23, 2025 17:00
@coszio coszio force-pushed the internal-telemetry branch from 5244eda to 6e4eabf Compare December 23, 2025 17:02
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 23, 2025

📝 Walkthrough

Walkthrough

This pull request refactors the internal telemetry API by renaming gRPC methods and messages from GetPeerTelemetry* to GetTelemetry*, reorganizing protobuf field tags, and adding a timeout parameter. The changes propagate through generated Rust code, consensus module parameters, and telemetry initialization logic. A new QdrantInternalService implementation is introduced to handle telemetry collection with proper async mutex wrapping, while telemetry collectors are separated into general-purpose and tonic-specific variants in the wiring layer.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

release:1.16.1

Suggested reviewers

  • timvisee
  • generall

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.15% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title '[cluster telemetry] Internal telemetry endpoint' clearly and specifically describes the main change: implementing an internal telemetry service endpoint for cluster operations.
Description check ✅ Passed The PR description relates to the changeset by mentioning implementation of the internal service endpoint, which aligns with the substantial changes across proto definitions, gRPC service implementation, and telemetry routing shown in the commit messages and file changes.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch internal-telemetry

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/tonic/api/qdrant_internal_api.rs (1)

77-83: Redundant .max(2) call after validation.

Line 77-81 already validates that details_level >= 2, so the .max(2) on line 83 is redundant. Consider removing it for clarity.

🔎 Suggested simplification
-        let details_level = DetailsLevel::from(request.details_level.max(2) as usize);
+        let details_level = DetailsLevel::from(request.details_level as usize);
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 64c806d and 6e4eabf.

📒 Files selected for processing (9)
  • lib/api/src/grpc/proto/qdrant_internal_service.proto
  • lib/api/src/grpc/proto/telemetry_internal.proto
  • lib/api/src/grpc/qdrant.rs
  • src/common/telemetry.rs
  • src/consensus.rs
  • src/main.rs
  • src/tonic/api/mod.rs
  • src/tonic/api/qdrant_internal_api.rs
  • src/tonic/mod.rs
💤 Files with no reviewable changes (1)
  • src/common/telemetry.rs
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust
In new code, don't use transmute_from_u8, transmute_to_u8, transmute_from_u8_to_slice, transmute_from_u8_to_mut_slice, transmute_to_u8_slice - use bytemuck or zerocopy crates instead
Prefer explicit exhaustive matches over catch-all _ arm in Rust match expressions, so we don't miss handling new enum variants when they are added later (exceptions: tests and benchmarks, or when sure new variants won't affect logic)
Prefer explicit field ignoring using : _ over using .. in Rust struct destructuring (exceptions: tests and benchmarks, or when sure new fields won't affect logic)

Files:

  • src/tonic/api/mod.rs
  • src/tonic/api/qdrant_internal_api.rs
  • src/consensus.rs
  • src/main.rs
  • src/tonic/mod.rs
  • lib/api/src/grpc/qdrant.rs
🧠 Learnings (8)
📓 Common learnings
Learnt from: generall
Repo: qdrant/qdrant PR: 7183
File: lib/api/src/grpc/qdrant.rs:4263-4273
Timestamp: 2025-09-16T19:14:17.614Z
Learning: In qdrant, lib/api/src/grpc/qdrant.rs is auto-generated by prost-build; do not edit it directly. Make changes in lib/api/src/grpc/proto/points.proto (e.g., add [deprecated=true], doc comments, or encoding options), then regenerate the Rust code.
📚 Learning: 2025-03-03T15:54:45.553Z
Learnt from: generall
Repo: qdrant/qdrant PR: 6088
File: lib/segment/src/index/field_index/mod.rs:18-18
Timestamp: 2025-03-03T15:54:45.553Z
Learning: In the Qdrant codebase, modules can be organized as directories with their own `mod.rs` file, rather than single files, following standard Rust module organization patterns.

Applied to files:

  • src/tonic/api/mod.rs
📚 Learning: 2025-09-16T19:14:17.614Z
Learnt from: generall
Repo: qdrant/qdrant PR: 7183
File: lib/api/src/grpc/qdrant.rs:4263-4273
Timestamp: 2025-09-16T19:14:17.614Z
Learning: In qdrant, lib/api/src/grpc/qdrant.rs is auto-generated by prost-build; do not edit it directly. Make changes in lib/api/src/grpc/proto/points.proto (e.g., add [deprecated=true], doc comments, or encoding options), then regenerate the Rust code.

Applied to files:

  • src/tonic/api/mod.rs
  • lib/api/src/grpc/proto/telemetry_internal.proto
  • lib/api/src/grpc/proto/qdrant_internal_service.proto
  • src/tonic/api/qdrant_internal_api.rs
  • src/tonic/mod.rs
  • lib/api/src/grpc/qdrant.rs
📚 Learning: 2025-05-12T12:54:27.872Z
Learnt from: timvisee
Repo: qdrant/qdrant PR: 6503
File: lib/segment/src/index/field_index/geo_index/mmap_geo_index.rs:60-74
Timestamp: 2025-05-12T12:54:27.872Z
Learning: In the Qdrant codebase, using `pub(super)` visibility is preferred when fields need to be accessed by sibling modules (particularly for index type conversions), as it provides the necessary access without bloating the interface with numerous getters.

Applied to files:

  • src/tonic/api/mod.rs
📚 Learning: 2025-08-10T18:25:16.206Z
Learnt from: generall
Repo: qdrant/qdrant PR: 7006
File: docs/redoc/master/openapi.json:10626-10634
Timestamp: 2025-08-10T18:25:16.206Z
Learning: In Qdrant, docs/redoc/master/openapi.json is autogenerated from the Rust REST schema definitions (e.g., lib/api/src/rest/schema.rs). To change OpenAPI descriptions, update the Rust struct field doc comments/attributes rather than editing the JSON directly.

Applied to files:

  • src/tonic/api/mod.rs
📚 Learning: 2025-08-10T18:26:33.017Z
Learnt from: generall
Repo: qdrant/qdrant PR: 7006
File: docs/redoc/master/openapi.json:10393-10401
Timestamp: 2025-08-10T18:26:33.017Z
Learning: In qdrant, docs/redoc/master/openapi.json is auto-generated. Any description changes must be made in the source (e.g., lib/api/src/rest/schema.rs and/or lib/api/src/grpc/proto/points.proto), not edited directly in the JSON.

Applied to files:

  • lib/api/src/grpc/proto/qdrant_internal_service.proto
📚 Learning: 2025-08-15T15:56:36.821Z
Learnt from: coszio
Repo: qdrant/qdrant PR: 7065
File: lib/collection/src/operations/universal_query/shard_query.rs:0-0
Timestamp: 2025-08-15T15:56:36.821Z
Learning: Internal gRPC APIs in Qdrant don't require additional runtime validation when proto-level validation is already configured via build.rs range constraints.

Applied to files:

  • src/tonic/api/qdrant_internal_api.rs
  • src/tonic/mod.rs
📚 Learning: 2025-08-10T18:30:02.986Z
Learnt from: generall
Repo: qdrant/qdrant PR: 7006
File: lib/collection/src/operations/verification/update.rs:158-174
Timestamp: 2025-08-10T18:30:02.986Z
Learning: In Qdrant's strict mode verification code (lib/collection/src/operations/verification/update.rs), exhaustive pattern matching without `..` is intentionally used for structs like PointsBatch and PointsList. This design pattern ensures compilation fails when new fields are added, forcing developers to explicitly consider how new fields should be handled in the indexed_filter_write method. This provides visibility and compile-time safety for struct evolution.

Applied to files:

  • src/tonic/mod.rs
🧬 Code graph analysis (4)
src/tonic/api/qdrant_internal_api.rs (3)
lib/api/src/grpc/qdrant.rs (24)
  • sync (10871-10895)
  • sync (11327-11333)
  • new (2579-2582)
  • new (3126-3128)
  • new (4031-4034)
  • new (4291-4293)
  • new (7559-7562)
  • new (8580-8582)
  • new (10788-10791)
  • new (11455-11457)
  • new (12804-12807)
  • new (12986-12988)
  • new (13319-13322)
  • new (13517-13519)
  • new (13942-13945)
  • new (14214-14216)
  • new (14774-14777)
  • new (14988-14990)
  • get_consensus_commit (12863-12887)
  • get_consensus_commit (12952-12958)
  • wait_on_consensus_commit (12889-12915)
  • wait_on_consensus_commit (12960-12966)
  • get_telemetry (12917-12941)
  • get_telemetry (12968-12974)
lib/storage/src/rbac/mod.rs (1)
  • full (99-101)
src/common/telemetry.rs (2)
  • try_from (149-180)
  • try_from (186-218)
src/consensus.rs (1)
src/actix/api/service_api.rs (1)
  • telemetry (44-71)
src/main.rs (3)
lib/storage/src/content_manager/toc/mod.rs (1)
  • new (94-214)
lib/storage/src/content_manager/consensus_manager.rs (1)
  • new (108-149)
src/common/health.rs (1)
  • spawn (41-68)
lib/api/src/grpc/qdrant.rs (1)
src/tonic/api/qdrant_internal_api.rs (2)
  • get_telemetry (71-106)
  • new (29-39)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: rust-tests (ubuntu-latest)
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: lint
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: Test Python bindings
  • GitHub Check: e2e-tests
  • GitHub Check: integration-tests
  • GitHub Check: test-consistency
  • GitHub Check: integration-tests-consensus
  • GitHub Check: test-consensus-compose
  • GitHub Check: test-shard-snapshot-api-s3-minio
🔇 Additional comments (17)
src/tonic/api/mod.rs (1)

5-5: LGTM!

The new qdrant_internal_api module is properly exposed following the same pattern as other internal API modules (collections_internal_api, points_internal_api).

lib/api/src/grpc/proto/telemetry_internal.proto (1)

8-18: Proto field tag renumbering is a breaking wire change.

Renaming GetPeerTelemetryRequest to GetTelemetryRequest and moving details_level from tag 2 to tag 1 breaks wire compatibility with any existing clients using the previous schema. Since this is an internal service proto used for cluster communication, ensure all cluster nodes are upgraded together to avoid protocol mismatches during rolling upgrades.

The addition of the timeout field (tag 2) is a safe, additive change.

lib/api/src/grpc/proto/qdrant_internal_service.proto (1)

17-18: LGTM!

The RPC method rename from GetPeerTelemetry to GetTelemetry aligns with the message renames in telemetry_internal.proto. Ensure coordinated cluster upgrades as noted in the proto message review.

src/consensus.rs (2)

72-73: LGTM!

The signature change properly separates the general-purpose TelemetryCollector (wrapped in tokio::sync::Mutex for async access) from the tonic-specific TonicTelemetryCollector. This clean separation allows the internal API to access full telemetry data while the middleware continues to track request metrics independently.


160-161: LGTM!

Both telemetry collectors are correctly passed to init_internal, maintaining the separation of concerns established in the function signature.

src/main.rs (4)

404-405: LGTM!

The variable declarations before the conditional block is a clean pattern that ensures both code paths properly initialize all required variables while avoiding code duplication.


425-429: LGTM!

The telemetry initialization correctly:

  1. Creates TelemetryCollector with the dispatcher
  2. Extracts the tonic_telemetry_collector before wrapping
  3. Wraps TelemetryCollector in Arc<tokio::sync::Mutex<_>> for shared async access

This matches the pattern used in src/actix/api/service_api.rs for the REST API telemetry endpoint.


453-454: LGTM!

Both telemetry collectors are correctly cloned and passed to Consensus::run, matching the updated function signature.


508-517: LGTM!

The non-distributed mode initialization mirrors the distributed mode pattern for telemetry setup, with health_checker correctly set to None since health checking is only relevant for distributed deployments.

src/tonic/api/qdrant_internal_api.rs (3)

19-40: LGTM!

The QdrantInternalService struct properly encapsulates the dependencies needed for the internal API: telemetry collection, settings, and consensus state. The constructor follows Rust conventions.


44-52: Verify integer type conversions for commit/term.

The as _ casts on lines 49-50 perform implicit type inference. The proto uses int64 for commit/term (see GetConsensusCommitResponse), but hard_state.commit and hard_state.term are u64 in raft. If these values exceed i64::MAX, the cast will produce negative values in the response.

Consider using explicit conversion with range checking or documenting that this is acceptable for your use case (e.g., if values are guaranteed to stay within i64 range during normal operation).


90-106: LGTM!

The telemetry collection logic correctly:

  1. Uses Access::full for internal cluster communication (appropriate for trusted internal traffic)
  2. Locks the telemetry collector with async mutex
  3. Passes the timeout to prepare_data
  4. Converts TelemetryData to PeerTelemetry using the established TryFrom conversion
  5. Includes timing information in the response
src/tonic/mod.rs (3)

211-212: LGTM!

The updated init_internal signature properly accepts both telemetry collectors:

  • telemetry_collector (tokio Mutex) for the internal API's telemetry data collection
  • tonic_telemetry_collector (parking_lot Mutex) for the gRPC middleware metrics

This clean separation allows different synchronization strategies optimized for each use case.


233-234: LGTM!

The QdrantInternalService is correctly constructed with the telemetry collector, settings, and consensus state, matching the new constructor signature defined in qdrant_internal_api.rs.


261-265: LGTM!

The middleware layer correctly uses tonic_telemetry_collector for request telemetry, while the internal service uses the full telemetry_collector. This ensures the middleware tracks gRPC request metrics independently from the telemetry data exposed via the internal API.

lib/api/src/grpc/qdrant.rs (2)

12452-12467: Telemetry request/response schema looks consistent; clarify timeout semantics & remember proto is the source of truth

  • GetTelemetryRequest (details_level: u32, timeout: u64) and GetTelemetryResponse (result: Option<PeerTelemetry>, time: f64) line up with the tonic handler in src/tonic/api/qdrant_internal_api.rs (types and usage match).
  • Because timeout is a non‑optional uint64, callers that omit it (or leave it at default) will send 0, which the handler currently turns into Duration::from_secs(0). If 0 is meant to mean “no extra timeout” rather than “immediate timeout”, consider documenting that explicitly in the .proto comments or making the field optional in the proto so that None can be distinguished from an explicit 0 (the change would then flow into this generated file).
  • This file is prost‑generated; any future tweaks to the telemetry schema or field docs should be done in the relevant .proto files (e.g. qdrant_internal_service.proto / telemetry_internal.proto) and then regenerated, rather than editing this Rust directly. Based on learnings, this file should not be hand‑edited.

12916-12941: New GetTelemetry gRPC wiring for QdrantInternal is coherent and correctly hooked up

  • Client stub, service trait, and server dispatch all use the same method name (GetTelemetry), request/response types (GetTelemetryRequest/GetTelemetryResponse), and fully qualified path ("/qdrant.QdrantInternal/GetTelemetry"), so there’s no mismatch between client, trait, and server router.
  • The tonic patterns (readiness check, GrpcMethod extension insertion, compression/size config in the server) follow the existing generated style for other RPCs in this file.

No issues spotted here; the internal telemetry RPC plumbing looks good.

Also applies to: 12967-12975, 13152-13197

Comment on lines +54 to +69
async fn wait_on_consensus_commit(
&self,
request: Request<WaitOnConsensusCommitRequest>,
) -> Result<Response<WaitOnConsensusCommitResponse>, Status> {
let request = request.into_inner();
let commit = request.commit as u64;
let term = request.term as u64;
let timeout = Duration::from_secs(request.timeout as u64);
let consensus_tick = Duration::from_millis(self.settings.cluster.consensus.tick_period_ms);
let ok = self
.consensus_state
.wait_for_consensus_commit(commit, term, consensus_tick, timeout)
.await
.is_ok();
Ok(Response::new(WaitOnConsensusCommitResponse { ok }))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Potential issue with negative value handling in request parsing.

Lines 59-61 cast i64 values from the proto to u64. If a client sends negative values (e.g., commit = -1), as u64 will produce a very large positive number, which may cause unexpected behavior in wait_for_consensus_commit.

Consider adding validation for non-negative values or using try_into() with proper error handling.

🔎 Suggested validation
     async fn wait_on_consensus_commit(
         &self,
         request: Request<WaitOnConsensusCommitRequest>,
     ) -> Result<Response<WaitOnConsensusCommitResponse>, Status> {
         let request = request.into_inner();
-        let commit = request.commit as u64;
-        let term = request.term as u64;
-        let timeout = Duration::from_secs(request.timeout as u64);
+        let commit = u64::try_from(request.commit)
+            .map_err(|_| Status::invalid_argument("commit must be non-negative"))?;
+        let term = u64::try_from(request.term)
+            .map_err(|_| Status::invalid_argument("term must be non-negative"))?;
+        let timeout = Duration::from_secs(
+            u64::try_from(request.timeout)
+                .map_err(|_| Status::invalid_argument("timeout must be non-negative"))?,
+        );
         let consensus_tick = Duration::from_millis(self.settings.cluster.consensus.tick_period_ms);

@coszio coszio merged commit 8ab86b6 into dev Dec 23, 2025
15 checks passed
@coszio coszio deleted the internal-telemetry branch December 23, 2025 17:32
@coderabbitai coderabbitai Bot mentioned this pull request Feb 7, 2026
generall pushed a commit that referenced this pull request Feb 9, 2026
* move QdrantInternalService to its own file

* implement internal telemetry endpoint

cleaner diff on src/main.rs

upd response in proto

* use try_from conversion

* rename request/response in proto
@coderabbitai coderabbitai Bot mentioned this pull request Apr 15, 2026
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants