[cluster telemetry] Aggregate telemetries from peers by coszio · Pull Request #7696 · qdrant/qdrant

coszio · 2025-12-05T01:04:28Z

Builds on top of #7672

This is the meat of the feature, so review carefully.

We take the TelemetryData from every peer and translate it into a single DistributedTelemetry, by selecting the best source for each piece of information.

As a base, we select the information from the peer with the highest term/commit. Then we select shard transfer info from the source, and resharding from the receiving peer.

coderabbitai · 2025-12-23T17:40:13Z

📝 Walkthrough

Walkthrough

This pull request introduces a new distributed telemetry data aggregation feature. It adds four new public data structures (DistributedTelemetryData, DistributedCollectionTelemetry, DistributedClusterTelemetry, DistributedPeerInfo) implemented in a new module that aggregates telemetry data across cluster peers. The implementation includes a resolve_telemetries entry point that selects base telemetry by priority, aggregates per-collection and cluster-wide metrics, and handles missing or unresponsive peers. The changes are purely additive: the OpenAPI schema is extended with new definitions, the new Rust module is exported, and the schema generator is updated to include the distributed telemetry type.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

generall
timvisee

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: aggregating telemetries from cluster peers into a unified distributed telemetry representation.
Description check	✅ Passed	The description is directly related to the changeset, explaining the core feature of aggregating telemetry data and the selection strategy used across peers.
Docstring Coverage	✅ Passed	Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch aggregate-telemetries

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

src/common/telemetry_ops/distributed_telemetry.rs (3)
1-1: Consider removing #![expect(dead_code)] if the module is publicly exported.

The DistributedTelemetryData struct and resolve_telemetries method are publicly exported and referenced by the schema generator. This attribute may be masking genuine unused code that could be removed, or it may be premature if the public API will be consumed soon.

If this is temporary scaffolding until the calling code is merged, consider adding a comment explaining the expected timeline.

272-274: Consider using a HashSet for missing_peers to improve lookup performance.

missing_peers.contains(peer_id) performs a linear O(n) search for each peer. If the cluster has many peers and/or many missing peers, this becomes O(n×m). Converting missing_peers to a HashSet at the start of the function would make lookups O(1).
🔎 Proposed fix
 fn aggregate_peers_info(
     missing_peers: Vec<u64>,
     base_cluster: &ClusterTelemetry,
     telemetry_by_peer: &HashMap<u64, &TelemetryData>,
 ) -> Option<HashMap<u64, DistributedPeerInfo>> {
     let all_peers = base_cluster.peers.as_ref()?;
+    let missing_peers_set: ahash::HashSet<u64> = missing_peers.iter().copied().collect();

     let mut distributed_peers_info = HashMap::default();
     for (peer_id, peer_info) in all_peers {
         let responsive =
-            telemetry_by_peer.contains_key(peer_id) && !missing_peers.contains(peer_id);
+            telemetry_by_peer.contains_key(peer_id) && !missing_peers_set.contains(peer_id);
296-325: Defensive code for unexpected missing peers.

The debug_assert!(false, ...) on line 298 indicates this code path is not expected to execute under normal conditions. The subsequent logic gracefully handles the case anyway, which is good defensive programming. Consider adding a log::warn! in release mode to surface unexpected situations in production.
🔎 Optional: Add logging for unexpected path
     // Add any failed peers that aren't in the all_peers list
     for peer_id in missing_peers {
         debug_assert!(false, "all missing peers should have been listed already");
+        log::warn!("Missing peer {peer_id} was not in the all_peers list");

         if distributed_peers_info.contains_key(&peer_id) {
             continue;
         }
docs/redoc/master/openapi.json (1)

16342-16483: Distributed telemetry schemas look good; only minor consistency nits

The new DistributedTelemetryData, DistributedCollectionTelemetry, DistributedClusterTelemetry, and DistributedPeerInfo definitions are structurally consistent with the existing telemetry schemas and reuse the right component types (ReshardingInfo, ShardTransferInfo, StateRole, ConsensusThreadStatus). This is a clean, additive extension and shouldn’t break existing clients.

Two minor, optional consistency points you may want to consider in the Rust schemas (not here, since this JSON is generated):

DistributedClusterTelemetry.number_of_peers (Line 16406) and DistributedPeerInfo.num_pending_operations (Line 16465) use format: "uint64", whereas similar counters elsewhere (e.g. ClusterStatusTelemetry.number_of_peers, RaftInfo.pending_operations) use format: "uint". If the underlying Rust types are the same, aligning formats would keep the OpenAPI surface more uniform.

Naming-wise, num_pending_operations vs existing pending_operations is slightly inconsistent; either is fine, but using the same naming pattern everywhere would reduce cognitive load for API users.

Functionally everything here looks correct, so these can be deferred unless you particularly care about OpenAPI cosmetic consistency.

Based on learnings, if you want to adjust descriptions or field types, please do it in the Rust schema sources (e.g. lib/api/src/rest/schema.rs) so this JSON stays generated.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ab86b6 and 80fbfe0.

📒 Files selected for processing (4)

docs/redoc/master/openapi.json
src/common/telemetry_ops/distributed_telemetry.rs
src/common/telemetry_ops/mod.rs
src/schema_generator.rs

🧰 Additional context used

📓 Path-based instructions (1)

**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust
In new code, don't use transmute_from_u8, transmute_to_u8, transmute_from_u8_to_slice, transmute_from_u8_to_mut_slice, transmute_to_u8_slice - use bytemuck or zerocopy crates instead
Prefer explicit exhaustive matches over catch-all _ arm in Rust match expressions, so we don't miss handling new enum variants when they are added later (exceptions: tests and benchmarks, or when sure new variants won't affect logic)
Prefer explicit field ignoring using : _ over using .. in Rust struct destructuring (exceptions: tests and benchmarks, or when sure new fields won't affect logic)

Files:

src/common/telemetry_ops/mod.rs
src/schema_generator.rs
src/common/telemetry_ops/distributed_telemetry.rs

🧠 Learnings (3)

📚 Learning: 2025-09-16T19:14:17.614Z

Learnt from: generall
Repo: qdrant/qdrant PR: 7183
File: lib/api/src/grpc/qdrant.rs:4263-4273
Timestamp: 2025-09-16T19:14:17.614Z
Learning: In qdrant, lib/api/src/grpc/qdrant.rs is auto-generated by prost-build; do not edit it directly. Make changes in lib/api/src/grpc/proto/points.proto (e.g., add [deprecated=true], doc comments, or encoding options), then regenerate the Rust code.

Applied to files:

src/schema_generator.rs

📚 Learning: 2025-08-10T18:26:12.443Z

Learnt from: generall
Repo: qdrant/qdrant PR: 7006
File: docs/redoc/master/openapi.json:13645-13652
Timestamp: 2025-08-10T18:26:12.443Z
Learning: In qdrant, docs/redoc/master/openapi.json is auto-generated from the REST schemas. To change field docs, edit lib/api/src/rest/schema.rs (e.g., add doc comments or #[schemars(description = ...)]). Specifically, UpdateVectors.update_filter lacked a description and should state: "If specified, only update vectors for points that match this filter; points not matching the filter are left unchanged."

Applied to files:

src/schema_generator.rs
docs/redoc/master/openapi.json

📚 Learning: 2025-08-10T18:26:33.017Z

Learnt from: generall
Repo: qdrant/qdrant PR: 7006
File: docs/redoc/master/openapi.json:10393-10401
Timestamp: 2025-08-10T18:26:33.017Z
Learning: In qdrant, docs/redoc/master/openapi.json is auto-generated. Any description changes must be made in the source (e.g., lib/api/src/rest/schema.rs and/or lib/api/src/grpc/proto/points.proto), not edited directly in the JSON.

Applied to files:

docs/redoc/master/openapi.json

🧬 Code graph analysis (1)

src/common/telemetry_ops/distributed_telemetry.rs (2)

lib/collection/src/telemetry.rs (1)

transfers (367-370)

lib/collection/src/shards/shard_holder/mod.rs (1)

get_transfers (1022-1032)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)

GitHub Check: test-consensus-compose
GitHub Check: Test Python bindings
GitHub Check: lint
GitHub Check: integration-tests
GitHub Check: integration-tests-consensus
GitHub Check: test-shard-snapshot-api-s3-minio
GitHub Check: e2e-tests
GitHub Check: test-consistency
GitHub Check: rust-tests (macos-latest)
GitHub Check: rust-tests (windows-latest)
GitHub Check: rust-tests (ubuntu-latest)

🔇 Additional comments (7)

src/common/telemetry_ops/distributed_telemetry.rs (5)

15-75: LGTM!

The struct definitions are well-designed with:

Appropriate use of Option for fields that may not always be available

Good use of skip_serializing_if to keep JSON output clean

Helpful documentation comments explaining each field's purpose

77-110: LGTM!

The resolve_telemetries method has sound logic:

Proper error handling when no valid telemetry is available

Efficient peer lookup map construction

Safe fallback with unwrap_or_default() for collections

169-177: Good use of explicit field ignoring.

The destructuring pattern correctly uses : _ to explicitly ignore unused fields (sync, method, comment), which aligns with the coding guidelines and ensures new fields added to ShardTransferInfo will be explicitly handled.

330-345: LGTM!

The newest_telemetry function correctly selects the telemetry with the highest (term, commit) tuple. The Option comparison semantics ensure that telemetries with valid consensus status are preferred over those without.

203-241: LGTM!

The resharding aggregation logic correctly prioritizes data from the peer mentioned in resharding.peer_id, falling back to the base telemetry when unavailable. This aligns with the PR objective of selecting the best source for each piece of information.

src/common/telemetry_ops/mod.rs (1)

5-5: LGTM!

The new module is correctly exported following the existing pattern.

src/schema_generator.rs (1)

34-34: LGTM!

The import and field addition correctly integrate DistributedTelemetryData into the schema generator, following the existing naming convention for fields.

Also applies to: 102-102

* aggregate telemetries * upd openapi * expect dead code * Pick just the highest telemetry * Tweak transfer selection --------- Co-authored-by: timvisee <[email protected]>

coszio force-pushed the internal-telemetry branch from 28636e4 to b36246c Compare December 8, 2025 20:02

coszio force-pushed the aggregate-telemetries branch from ab8edb3 to fe7c664 Compare December 8, 2025 20:02

coszio force-pushed the internal-telemetry branch from b36246c to b3a25aa Compare December 8, 2025 20:14

coszio force-pushed the aggregate-telemetries branch from fe7c664 to 66a627e Compare December 8, 2025 20:14

coszio mentioned this pull request Dec 8, 2025

[cluster telemetry] Expose endpoint #7729

Merged

3 tasks

coszio force-pushed the internal-telemetry branch from b3a25aa to 45e92dc Compare December 9, 2025 13:56

coszio force-pushed the aggregate-telemetries branch from 66a627e to c0895bb Compare December 9, 2025 13:56

coszio added this to the Cluster Telemetry milestone Dec 9, 2025

coszio force-pushed the aggregate-telemetries branch from c0895bb to 852f913 Compare December 9, 2025 17:34

coszio force-pushed the internal-telemetry branch from 45e92dc to d0f8b28 Compare December 9, 2025 18:00

coszio force-pushed the aggregate-telemetries branch 2 times, most recently from 1cff9b1 to b85901c Compare December 10, 2025 16:20

coszio force-pushed the internal-telemetry branch from 99ec597 to 5244eda Compare December 17, 2025 13:54

coszio force-pushed the aggregate-telemetries branch from b85901c to 442d453 Compare December 17, 2025 13:56

timvisee approved these changes Dec 23, 2025

View reviewed changes

coszio force-pushed the internal-telemetry branch from 5244eda to 6e4eabf Compare December 23, 2025 17:02

coszio force-pushed the aggregate-telemetries branch from 76246c1 to 85d8878 Compare December 23, 2025 17:04

coszio marked this pull request as ready for review December 23, 2025 17:16

Base automatically changed from internal-telemetry to dev December 23, 2025 17:32

coszio and others added 5 commits December 23, 2025 11:36

aggregate telemetries

0b88744

upd openapi

460dfb8

expect dead code

fb48d5c

Pick just the highest telemetry

ace660e

Tweak transfer selection

80fbfe0

coszio force-pushed the aggregate-telemetries branch from 85d8878 to 80fbfe0 Compare December 23, 2025 17:37

coderabbitai Bot reviewed Dec 23, 2025

View reviewed changes

coszio merged commit e12fb6f into dev Dec 23, 2025
15 checks passed

coszio deleted the aggregate-telemetries branch December 23, 2025 18:10

coderabbitai Bot mentioned this pull request Dec 29, 2025

[cluster telemetry] Include peers versions #7848

Merged

coderabbitai Bot mentioned this pull request Jan 9, 2026

[cluster telemetry] Include shards information #7886

Merged

coderabbitai Bot mentioned this pull request Feb 2, 2026

Add more stages for shard transfer profiling #8027

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cluster telemetry] Aggregate telemetries from peers#7696

[cluster telemetry] Aggregate telemetries from peers#7696
coszio merged 5 commits intodevfrom
aggregate-telemetries

coszio commented Dec 5, 2025

Uh oh!

coderabbitai Bot commented Dec 23, 2025

Walkthrough

Estimated code review effort

Suggested reviewers

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

coszio commented Dec 5, 2025

Uh oh!

coderabbitai Bot commented Dec 23, 2025

Walkthrough

Estimated code review effort

Suggested reviewers

Pre-merge checks and finishing touches

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants