Skip to content

[cluster telemetry] Aggregate telemetries from peers#7696

Merged
coszio merged 5 commits intodevfrom
aggregate-telemetries
Dec 23, 2025
Merged

[cluster telemetry] Aggregate telemetries from peers#7696
coszio merged 5 commits intodevfrom
aggregate-telemetries

Conversation

@coszio
Copy link
Copy Markdown
Contributor

@coszio coszio commented Dec 5, 2025

Builds on top of #7672

This is the meat of the feature, so review carefully.

We take the TelemetryData from every peer and translate it into a single DistributedTelemetry, by selecting the best source for each piece of information.

As a base, we select the information from the peer with the highest term/commit. Then we select shard transfer info from the source, and resharding from the receiving peer.

@coszio coszio force-pushed the internal-telemetry branch from 28636e4 to b36246c Compare December 8, 2025 20:02
@coszio coszio force-pushed the aggregate-telemetries branch from ab8edb3 to fe7c664 Compare December 8, 2025 20:02
@coszio coszio force-pushed the internal-telemetry branch from b36246c to b3a25aa Compare December 8, 2025 20:14
@coszio coszio force-pushed the aggregate-telemetries branch from fe7c664 to 66a627e Compare December 8, 2025 20:14
@coszio coszio mentioned this pull request Dec 8, 2025
3 tasks
@coszio coszio force-pushed the internal-telemetry branch from b3a25aa to 45e92dc Compare December 9, 2025 13:56
@coszio coszio force-pushed the aggregate-telemetries branch from 66a627e to c0895bb Compare December 9, 2025 13:56
@coszio coszio added this to the Cluster Telemetry milestone Dec 9, 2025
@coszio coszio force-pushed the aggregate-telemetries branch from c0895bb to 852f913 Compare December 9, 2025 17:34
@coszio coszio force-pushed the internal-telemetry branch from 45e92dc to d0f8b28 Compare December 9, 2025 18:00
@coszio coszio force-pushed the aggregate-telemetries branch 2 times, most recently from 1cff9b1 to b85901c Compare December 10, 2025 16:20
@coszio coszio force-pushed the internal-telemetry branch from 99ec597 to 5244eda Compare December 17, 2025 13:54
@coszio coszio force-pushed the aggregate-telemetries branch from b85901c to 442d453 Compare December 17, 2025 13:56
@coszio coszio force-pushed the internal-telemetry branch from 5244eda to 6e4eabf Compare December 23, 2025 17:02
@coszio coszio force-pushed the aggregate-telemetries branch from 76246c1 to 85d8878 Compare December 23, 2025 17:04
@coszio coszio marked this pull request as ready for review December 23, 2025 17:16
Base automatically changed from internal-telemetry to dev December 23, 2025 17:32
@coszio coszio force-pushed the aggregate-telemetries branch from 85d8878 to 80fbfe0 Compare December 23, 2025 17:37
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Dec 23, 2025

📝 Walkthrough

Walkthrough

This pull request introduces a new distributed telemetry data aggregation feature. It adds four new public data structures (DistributedTelemetryData, DistributedCollectionTelemetry, DistributedClusterTelemetry, DistributedPeerInfo) implemented in a new module that aggregates telemetry data across cluster peers. The implementation includes a resolve_telemetries entry point that selects base telemetry by priority, aggregates per-collection and cluster-wide metrics, and handles missing or unresponsive peers. The changes are purely additive: the OpenAPI schema is extended with new definitions, the new Rust module is exported, and the schema generator is updated to include the distributed telemetry type.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • generall
  • timvisee

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: aggregating telemetries from cluster peers into a unified distributed telemetry representation.
Description check ✅ Passed The description is directly related to the changeset, explaining the core feature of aggregating telemetry data and the selection strategy used across peers.
Docstring Coverage ✅ Passed Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch aggregate-telemetries

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
src/common/telemetry_ops/distributed_telemetry.rs (3)

1-1: Consider removing #![expect(dead_code)] if the module is publicly exported.

The DistributedTelemetryData struct and resolve_telemetries method are publicly exported and referenced by the schema generator. This attribute may be masking genuine unused code that could be removed, or it may be premature if the public API will be consumed soon.

If this is temporary scaffolding until the calling code is merged, consider adding a comment explaining the expected timeline.


272-274: Consider using a HashSet for missing_peers to improve lookup performance.

missing_peers.contains(peer_id) performs a linear O(n) search for each peer. If the cluster has many peers and/or many missing peers, this becomes O(n×m). Converting missing_peers to a HashSet at the start of the function would make lookups O(1).

🔎 Proposed fix
 fn aggregate_peers_info(
     missing_peers: Vec<u64>,
     base_cluster: &ClusterTelemetry,
     telemetry_by_peer: &HashMap<u64, &TelemetryData>,
 ) -> Option<HashMap<u64, DistributedPeerInfo>> {
     let all_peers = base_cluster.peers.as_ref()?;
+    let missing_peers_set: ahash::HashSet<u64> = missing_peers.iter().copied().collect();

     let mut distributed_peers_info = HashMap::default();
     for (peer_id, peer_info) in all_peers {
         let responsive =
-            telemetry_by_peer.contains_key(peer_id) && !missing_peers.contains(peer_id);
+            telemetry_by_peer.contains_key(peer_id) && !missing_peers_set.contains(peer_id);

296-325: Defensive code for unexpected missing peers.

The debug_assert!(false, ...) on line 298 indicates this code path is not expected to execute under normal conditions. The subsequent logic gracefully handles the case anyway, which is good defensive programming. Consider adding a log::warn! in release mode to surface unexpected situations in production.

🔎 Optional: Add logging for unexpected path
     // Add any failed peers that aren't in the all_peers list
     for peer_id in missing_peers {
         debug_assert!(false, "all missing peers should have been listed already");
+        log::warn!("Missing peer {peer_id} was not in the all_peers list");

         if distributed_peers_info.contains_key(&peer_id) {
             continue;
         }
docs/redoc/master/openapi.json (1)

16342-16483: Distributed telemetry schemas look good; only minor consistency nits

The new DistributedTelemetryData, DistributedCollectionTelemetry, DistributedClusterTelemetry, and DistributedPeerInfo definitions are structurally consistent with the existing telemetry schemas and reuse the right component types (ReshardingInfo, ShardTransferInfo, StateRole, ConsensusThreadStatus). This is a clean, additive extension and shouldn’t break existing clients.

Two minor, optional consistency points you may want to consider in the Rust schemas (not here, since this JSON is generated):

  • DistributedClusterTelemetry.number_of_peers (Line 16406) and DistributedPeerInfo.num_pending_operations (Line 16465) use format: "uint64", whereas similar counters elsewhere (e.g. ClusterStatusTelemetry.number_of_peers, RaftInfo.pending_operations) use format: "uint". If the underlying Rust types are the same, aligning formats would keep the OpenAPI surface more uniform.
  • Naming-wise, num_pending_operations vs existing pending_operations is slightly inconsistent; either is fine, but using the same naming pattern everywhere would reduce cognitive load for API users.

Functionally everything here looks correct, so these can be deferred unless you particularly care about OpenAPI cosmetic consistency.

Based on learnings, if you want to adjust descriptions or field types, please do it in the Rust schema sources (e.g. lib/api/src/rest/schema.rs) so this JSON stays generated.

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 8ab86b6 and 80fbfe0.

📒 Files selected for processing (4)
  • docs/redoc/master/openapi.json
  • src/common/telemetry_ops/distributed_telemetry.rs
  • src/common/telemetry_ops/mod.rs
  • src/schema_generator.rs
🧰 Additional context used
📓 Path-based instructions (1)
**/*.rs

📄 CodeRabbit inference engine (.github/review-rules.md)

**/*.rs: Prefer explicit SomeType::from(x) over implicit x.into() in Rust
In new code, don't use transmute_from_u8, transmute_to_u8, transmute_from_u8_to_slice, transmute_from_u8_to_mut_slice, transmute_to_u8_slice - use bytemuck or zerocopy crates instead
Prefer explicit exhaustive matches over catch-all _ arm in Rust match expressions, so we don't miss handling new enum variants when they are added later (exceptions: tests and benchmarks, or when sure new variants won't affect logic)
Prefer explicit field ignoring using : _ over using .. in Rust struct destructuring (exceptions: tests and benchmarks, or when sure new fields won't affect logic)

Files:

  • src/common/telemetry_ops/mod.rs
  • src/schema_generator.rs
  • src/common/telemetry_ops/distributed_telemetry.rs
🧠 Learnings (3)
📚 Learning: 2025-09-16T19:14:17.614Z
Learnt from: generall
Repo: qdrant/qdrant PR: 7183
File: lib/api/src/grpc/qdrant.rs:4263-4273
Timestamp: 2025-09-16T19:14:17.614Z
Learning: In qdrant, lib/api/src/grpc/qdrant.rs is auto-generated by prost-build; do not edit it directly. Make changes in lib/api/src/grpc/proto/points.proto (e.g., add [deprecated=true], doc comments, or encoding options), then regenerate the Rust code.

Applied to files:

  • src/schema_generator.rs
📚 Learning: 2025-08-10T18:26:12.443Z
Learnt from: generall
Repo: qdrant/qdrant PR: 7006
File: docs/redoc/master/openapi.json:13645-13652
Timestamp: 2025-08-10T18:26:12.443Z
Learning: In qdrant, docs/redoc/master/openapi.json is auto-generated from the REST schemas. To change field docs, edit lib/api/src/rest/schema.rs (e.g., add doc comments or #[schemars(description = ...)]). Specifically, UpdateVectors.update_filter lacked a description and should state: "If specified, only update vectors for points that match this filter; points not matching the filter are left unchanged."

Applied to files:

  • src/schema_generator.rs
  • docs/redoc/master/openapi.json
📚 Learning: 2025-08-10T18:26:33.017Z
Learnt from: generall
Repo: qdrant/qdrant PR: 7006
File: docs/redoc/master/openapi.json:10393-10401
Timestamp: 2025-08-10T18:26:33.017Z
Learning: In qdrant, docs/redoc/master/openapi.json is auto-generated. Any description changes must be made in the source (e.g., lib/api/src/rest/schema.rs and/or lib/api/src/grpc/proto/points.proto), not edited directly in the JSON.

Applied to files:

  • docs/redoc/master/openapi.json
🧬 Code graph analysis (1)
src/common/telemetry_ops/distributed_telemetry.rs (2)
lib/collection/src/telemetry.rs (1)
  • transfers (367-370)
lib/collection/src/shards/shard_holder/mod.rs (1)
  • get_transfers (1022-1032)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (11)
  • GitHub Check: test-consensus-compose
  • GitHub Check: Test Python bindings
  • GitHub Check: lint
  • GitHub Check: integration-tests
  • GitHub Check: integration-tests-consensus
  • GitHub Check: test-shard-snapshot-api-s3-minio
  • GitHub Check: e2e-tests
  • GitHub Check: test-consistency
  • GitHub Check: rust-tests (macos-latest)
  • GitHub Check: rust-tests (windows-latest)
  • GitHub Check: rust-tests (ubuntu-latest)
🔇 Additional comments (7)
src/common/telemetry_ops/distributed_telemetry.rs (5)

15-75: LGTM!

The struct definitions are well-designed with:

  • Appropriate use of Option for fields that may not always be available
  • Good use of skip_serializing_if to keep JSON output clean
  • Helpful documentation comments explaining each field's purpose

77-110: LGTM!

The resolve_telemetries method has sound logic:

  • Proper error handling when no valid telemetry is available
  • Efficient peer lookup map construction
  • Safe fallback with unwrap_or_default() for collections

169-177: Good use of explicit field ignoring.

The destructuring pattern correctly uses : _ to explicitly ignore unused fields (sync, method, comment), which aligns with the coding guidelines and ensures new fields added to ShardTransferInfo will be explicitly handled.


330-345: LGTM!

The newest_telemetry function correctly selects the telemetry with the highest (term, commit) tuple. The Option comparison semantics ensure that telemetries with valid consensus status are preferred over those without.


203-241: LGTM!

The resharding aggregation logic correctly prioritizes data from the peer mentioned in resharding.peer_id, falling back to the base telemetry when unavailable. This aligns with the PR objective of selecting the best source for each piece of information.

src/common/telemetry_ops/mod.rs (1)

5-5: LGTM!

The new module is correctly exported following the existing pattern.

src/schema_generator.rs (1)

34-34: LGTM!

The import and field addition correctly integrate DistributedTelemetryData into the schema generator, following the existing naming convention for fields.

Also applies to: 102-102

@coszio coszio merged commit e12fb6f into dev Dec 23, 2025
15 checks passed
@coszio coszio deleted the aggregate-telemetries branch December 23, 2025 18:10
generall pushed a commit that referenced this pull request Feb 9, 2026
* aggregate telemetries

* upd openapi

* expect dead code

* Pick just the highest telemetry

* Tweak transfer selection

---------

Co-authored-by: timvisee <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants