[cluster telemetry] Internal telemetry endpoint#7672
Conversation
fed31a7 to
28636e4
Compare
dc482ed to
2477dc0
Compare
28636e4 to
b36246c
Compare
2477dc0 to
d3b27aa
Compare
b36246c to
b3a25aa
Compare
0af693e to
a412db6
Compare
b3a25aa to
45e92dc
Compare
a412db6 to
94c4271
Compare
45e92dc to
d0f8b28
Compare
756b4e8 to
063c5c3
Compare
99ec597 to
5244eda
Compare
cleaner diff on src/main.rs upd response in proto
5244eda to
6e4eabf
Compare
📝 WalkthroughWalkthroughThis pull request refactors the internal telemetry API by renaming gRPC methods and messages from Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
src/tonic/api/qdrant_internal_api.rs (1)
77-83: Redundant.max(2)call after validation.Line 77-81 already validates that
details_level >= 2, so the.max(2)on line 83 is redundant. Consider removing it for clarity.🔎 Suggested simplification
- let details_level = DetailsLevel::from(request.details_level.max(2) as usize); + let details_level = DetailsLevel::from(request.details_level as usize);
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (9)
lib/api/src/grpc/proto/qdrant_internal_service.protolib/api/src/grpc/proto/telemetry_internal.protolib/api/src/grpc/qdrant.rssrc/common/telemetry.rssrc/consensus.rssrc/main.rssrc/tonic/api/mod.rssrc/tonic/api/qdrant_internal_api.rssrc/tonic/mod.rs
💤 Files with no reviewable changes (1)
- src/common/telemetry.rs
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs
📄 CodeRabbit inference engine (.github/review-rules.md)
**/*.rs: Prefer explicitSomeType::from(x)over implicitx.into()in Rust
In new code, don't usetransmute_from_u8,transmute_to_u8,transmute_from_u8_to_slice,transmute_from_u8_to_mut_slice,transmute_to_u8_slice- usebytemuckorzerocopycrates instead
Prefer explicit exhaustive matches over catch-all_arm in Rust match expressions, so we don't miss handling new enum variants when they are added later (exceptions: tests and benchmarks, or when sure new variants won't affect logic)
Prefer explicit field ignoring using: _over using..in Rust struct destructuring (exceptions: tests and benchmarks, or when sure new fields won't affect logic)
Files:
src/tonic/api/mod.rssrc/tonic/api/qdrant_internal_api.rssrc/consensus.rssrc/main.rssrc/tonic/mod.rslib/api/src/grpc/qdrant.rs
🧠 Learnings (8)
📓 Common learnings
Learnt from: generall
Repo: qdrant/qdrant PR: 7183
File: lib/api/src/grpc/qdrant.rs:4263-4273
Timestamp: 2025-09-16T19:14:17.614Z
Learning: In qdrant, lib/api/src/grpc/qdrant.rs is auto-generated by prost-build; do not edit it directly. Make changes in lib/api/src/grpc/proto/points.proto (e.g., add [deprecated=true], doc comments, or encoding options), then regenerate the Rust code.
📚 Learning: 2025-03-03T15:54:45.553Z
Learnt from: generall
Repo: qdrant/qdrant PR: 6088
File: lib/segment/src/index/field_index/mod.rs:18-18
Timestamp: 2025-03-03T15:54:45.553Z
Learning: In the Qdrant codebase, modules can be organized as directories with their own `mod.rs` file, rather than single files, following standard Rust module organization patterns.
Applied to files:
src/tonic/api/mod.rs
📚 Learning: 2025-09-16T19:14:17.614Z
Learnt from: generall
Repo: qdrant/qdrant PR: 7183
File: lib/api/src/grpc/qdrant.rs:4263-4273
Timestamp: 2025-09-16T19:14:17.614Z
Learning: In qdrant, lib/api/src/grpc/qdrant.rs is auto-generated by prost-build; do not edit it directly. Make changes in lib/api/src/grpc/proto/points.proto (e.g., add [deprecated=true], doc comments, or encoding options), then regenerate the Rust code.
Applied to files:
src/tonic/api/mod.rslib/api/src/grpc/proto/telemetry_internal.protolib/api/src/grpc/proto/qdrant_internal_service.protosrc/tonic/api/qdrant_internal_api.rssrc/tonic/mod.rslib/api/src/grpc/qdrant.rs
📚 Learning: 2025-05-12T12:54:27.872Z
Learnt from: timvisee
Repo: qdrant/qdrant PR: 6503
File: lib/segment/src/index/field_index/geo_index/mmap_geo_index.rs:60-74
Timestamp: 2025-05-12T12:54:27.872Z
Learning: In the Qdrant codebase, using `pub(super)` visibility is preferred when fields need to be accessed by sibling modules (particularly for index type conversions), as it provides the necessary access without bloating the interface with numerous getters.
Applied to files:
src/tonic/api/mod.rs
📚 Learning: 2025-08-10T18:25:16.206Z
Learnt from: generall
Repo: qdrant/qdrant PR: 7006
File: docs/redoc/master/openapi.json:10626-10634
Timestamp: 2025-08-10T18:25:16.206Z
Learning: In Qdrant, docs/redoc/master/openapi.json is autogenerated from the Rust REST schema definitions (e.g., lib/api/src/rest/schema.rs). To change OpenAPI descriptions, update the Rust struct field doc comments/attributes rather than editing the JSON directly.
Applied to files:
src/tonic/api/mod.rs
📚 Learning: 2025-08-10T18:26:33.017Z
Learnt from: generall
Repo: qdrant/qdrant PR: 7006
File: docs/redoc/master/openapi.json:10393-10401
Timestamp: 2025-08-10T18:26:33.017Z
Learning: In qdrant, docs/redoc/master/openapi.json is auto-generated. Any description changes must be made in the source (e.g., lib/api/src/rest/schema.rs and/or lib/api/src/grpc/proto/points.proto), not edited directly in the JSON.
Applied to files:
lib/api/src/grpc/proto/qdrant_internal_service.proto
📚 Learning: 2025-08-15T15:56:36.821Z
Learnt from: coszio
Repo: qdrant/qdrant PR: 7065
File: lib/collection/src/operations/universal_query/shard_query.rs:0-0
Timestamp: 2025-08-15T15:56:36.821Z
Learning: Internal gRPC APIs in Qdrant don't require additional runtime validation when proto-level validation is already configured via build.rs range constraints.
Applied to files:
src/tonic/api/qdrant_internal_api.rssrc/tonic/mod.rs
📚 Learning: 2025-08-10T18:30:02.986Z
Learnt from: generall
Repo: qdrant/qdrant PR: 7006
File: lib/collection/src/operations/verification/update.rs:158-174
Timestamp: 2025-08-10T18:30:02.986Z
Learning: In Qdrant's strict mode verification code (lib/collection/src/operations/verification/update.rs), exhaustive pattern matching without `..` is intentionally used for structs like PointsBatch and PointsList. This design pattern ensures compilation fails when new fields are added, forcing developers to explicitly consider how new fields should be handled in the indexed_filter_write method. This provides visibility and compile-time safety for struct evolution.
Applied to files:
src/tonic/mod.rs
🧬 Code graph analysis (4)
src/tonic/api/qdrant_internal_api.rs (3)
lib/api/src/grpc/qdrant.rs (24)
sync(10871-10895)sync(11327-11333)new(2579-2582)new(3126-3128)new(4031-4034)new(4291-4293)new(7559-7562)new(8580-8582)new(10788-10791)new(11455-11457)new(12804-12807)new(12986-12988)new(13319-13322)new(13517-13519)new(13942-13945)new(14214-14216)new(14774-14777)new(14988-14990)get_consensus_commit(12863-12887)get_consensus_commit(12952-12958)wait_on_consensus_commit(12889-12915)wait_on_consensus_commit(12960-12966)get_telemetry(12917-12941)get_telemetry(12968-12974)lib/storage/src/rbac/mod.rs (1)
full(99-101)src/common/telemetry.rs (2)
try_from(149-180)try_from(186-218)
src/consensus.rs (1)
src/actix/api/service_api.rs (1)
telemetry(44-71)
src/main.rs (3)
lib/storage/src/content_manager/toc/mod.rs (1)
new(94-214)lib/storage/src/content_manager/consensus_manager.rs (1)
new(108-149)src/common/health.rs (1)
spawn(41-68)
lib/api/src/grpc/qdrant.rs (1)
src/tonic/api/qdrant_internal_api.rs (2)
get_telemetry(71-106)new(29-39)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
- GitHub Check: rust-tests (ubuntu-latest)
- GitHub Check: rust-tests (macos-latest)
- GitHub Check: lint
- GitHub Check: rust-tests (windows-latest)
- GitHub Check: Test Python bindings
- GitHub Check: e2e-tests
- GitHub Check: integration-tests
- GitHub Check: test-consistency
- GitHub Check: integration-tests-consensus
- GitHub Check: test-consensus-compose
- GitHub Check: test-shard-snapshot-api-s3-minio
🔇 Additional comments (17)
src/tonic/api/mod.rs (1)
5-5: LGTM!The new
qdrant_internal_apimodule is properly exposed following the same pattern as other internal API modules (collections_internal_api,points_internal_api).lib/api/src/grpc/proto/telemetry_internal.proto (1)
8-18: Proto field tag renumbering is a breaking wire change.Renaming
GetPeerTelemetryRequesttoGetTelemetryRequestand movingdetails_levelfrom tag 2 to tag 1 breaks wire compatibility with any existing clients using the previous schema. Since this is an internal service proto used for cluster communication, ensure all cluster nodes are upgraded together to avoid protocol mismatches during rolling upgrades.The addition of the
timeoutfield (tag 2) is a safe, additive change.lib/api/src/grpc/proto/qdrant_internal_service.proto (1)
17-18: LGTM!The RPC method rename from
GetPeerTelemetrytoGetTelemetryaligns with the message renames intelemetry_internal.proto. Ensure coordinated cluster upgrades as noted in the proto message review.src/consensus.rs (2)
72-73: LGTM!The signature change properly separates the general-purpose
TelemetryCollector(wrapped intokio::sync::Mutexfor async access) from the tonic-specificTonicTelemetryCollector. This clean separation allows the internal API to access full telemetry data while the middleware continues to track request metrics independently.
160-161: LGTM!Both telemetry collectors are correctly passed to
init_internal, maintaining the separation of concerns established in the function signature.src/main.rs (4)
404-405: LGTM!The variable declarations before the conditional block is a clean pattern that ensures both code paths properly initialize all required variables while avoiding code duplication.
425-429: LGTM!The telemetry initialization correctly:
- Creates
TelemetryCollectorwith the dispatcher- Extracts the
tonic_telemetry_collectorbefore wrapping- Wraps
TelemetryCollectorinArc<tokio::sync::Mutex<_>>for shared async accessThis matches the pattern used in
src/actix/api/service_api.rsfor the REST API telemetry endpoint.
453-454: LGTM!Both telemetry collectors are correctly cloned and passed to
Consensus::run, matching the updated function signature.
508-517: LGTM!The non-distributed mode initialization mirrors the distributed mode pattern for telemetry setup, with
health_checkercorrectly set toNonesince health checking is only relevant for distributed deployments.src/tonic/api/qdrant_internal_api.rs (3)
19-40: LGTM!The
QdrantInternalServicestruct properly encapsulates the dependencies needed for the internal API: telemetry collection, settings, and consensus state. The constructor follows Rust conventions.
44-52: Verify integer type conversions for commit/term.The
as _casts on lines 49-50 perform implicit type inference. The proto usesint64for commit/term (seeGetConsensusCommitResponse), buthard_state.commitandhard_state.termareu64in raft. If these values exceedi64::MAX, the cast will produce negative values in the response.Consider using explicit conversion with range checking or documenting that this is acceptable for your use case (e.g., if values are guaranteed to stay within i64 range during normal operation).
90-106: LGTM!The telemetry collection logic correctly:
- Uses
Access::fullfor internal cluster communication (appropriate for trusted internal traffic)- Locks the telemetry collector with async mutex
- Passes the timeout to
prepare_data- Converts
TelemetryDatatoPeerTelemetryusing the establishedTryFromconversion- Includes timing information in the response
src/tonic/mod.rs (3)
211-212: LGTM!The updated
init_internalsignature properly accepts both telemetry collectors:
telemetry_collector(tokio Mutex) for the internal API's telemetry data collectiontonic_telemetry_collector(parking_lot Mutex) for the gRPC middleware metricsThis clean separation allows different synchronization strategies optimized for each use case.
233-234: LGTM!The
QdrantInternalServiceis correctly constructed with the telemetry collector, settings, and consensus state, matching the new constructor signature defined inqdrant_internal_api.rs.
261-265: LGTM!The middleware layer correctly uses
tonic_telemetry_collectorfor request telemetry, while the internal service uses the fulltelemetry_collector. This ensures the middleware tracks gRPC request metrics independently from the telemetry data exposed via the internal API.lib/api/src/grpc/qdrant.rs (2)
12452-12467: Telemetry request/response schema looks consistent; clarify timeout semantics & remember proto is the source of truth
GetTelemetryRequest(details_level: u32,timeout: u64) andGetTelemetryResponse(result: Option<PeerTelemetry>,time: f64) line up with the tonic handler insrc/tonic/api/qdrant_internal_api.rs(types and usage match).- Because
timeoutis a non‑optionaluint64, callers that omit it (or leave it at default) will send0, which the handler currently turns intoDuration::from_secs(0). If0is meant to mean “no extra timeout” rather than “immediate timeout”, consider documenting that explicitly in the.protocomments or making the field optional in the proto so thatNonecan be distinguished from an explicit0(the change would then flow into this generated file).- This file is prost‑generated; any future tweaks to the telemetry schema or field docs should be done in the relevant
.protofiles (e.g.qdrant_internal_service.proto/telemetry_internal.proto) and then regenerated, rather than editing this Rust directly. Based on learnings, this file should not be hand‑edited.
12916-12941: NewGetTelemetrygRPC wiring forQdrantInternalis coherent and correctly hooked up
- Client stub, service trait, and server dispatch all use the same method name (
GetTelemetry), request/response types (GetTelemetryRequest/GetTelemetryResponse), and fully qualified path ("/qdrant.QdrantInternal/GetTelemetry"), so there’s no mismatch between client, trait, and server router.- The tonic patterns (readiness check,
GrpcMethodextension insertion, compression/size config in the server) follow the existing generated style for other RPCs in this file.No issues spotted here; the internal telemetry RPC plumbing looks good.
Also applies to: 12967-12975, 13152-13197
| async fn wait_on_consensus_commit( | ||
| &self, | ||
| request: Request<WaitOnConsensusCommitRequest>, | ||
| ) -> Result<Response<WaitOnConsensusCommitResponse>, Status> { | ||
| let request = request.into_inner(); | ||
| let commit = request.commit as u64; | ||
| let term = request.term as u64; | ||
| let timeout = Duration::from_secs(request.timeout as u64); | ||
| let consensus_tick = Duration::from_millis(self.settings.cluster.consensus.tick_period_ms); | ||
| let ok = self | ||
| .consensus_state | ||
| .wait_for_consensus_commit(commit, term, consensus_tick, timeout) | ||
| .await | ||
| .is_ok(); | ||
| Ok(Response::new(WaitOnConsensusCommitResponse { ok })) | ||
| } |
There was a problem hiding this comment.
Potential issue with negative value handling in request parsing.
Lines 59-61 cast i64 values from the proto to u64. If a client sends negative values (e.g., commit = -1), as u64 will produce a very large positive number, which may cause unexpected behavior in wait_for_consensus_commit.
Consider adding validation for non-negative values or using try_into() with proper error handling.
🔎 Suggested validation
async fn wait_on_consensus_commit(
&self,
request: Request<WaitOnConsensusCommitRequest>,
) -> Result<Response<WaitOnConsensusCommitResponse>, Status> {
let request = request.into_inner();
- let commit = request.commit as u64;
- let term = request.term as u64;
- let timeout = Duration::from_secs(request.timeout as u64);
+ let commit = u64::try_from(request.commit)
+ .map_err(|_| Status::invalid_argument("commit must be non-negative"))?;
+ let term = u64::try_from(request.term)
+ .map_err(|_| Status::invalid_argument("term must be non-negative"))?;
+ let timeout = Duration::from_secs(
+ u64::try_from(request.timeout)
+ .map_err(|_| Status::invalid_argument("timeout must be non-negative"))?,
+ );
let consensus_tick = Duration::from_millis(self.settings.cluster.consensus.tick_period_ms);* move QdrantInternalService to its own file * implement internal telemetry endpoint cleaner diff on src/main.rs upd response in proto * use try_from conversion * rename request/response in proto
Depends on #7631
Implements the internal service endpoint