-
Notifications
You must be signed in to change notification settings - Fork 715
feat: use nats queue for cluster coordination #8413
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR implements a significant architectural change by replacing NATS key-value (KV) watch functionality with NATS stream queues for cluster coordination. The changes introduce a new event-driven coordination system centered around a new cluster_coordinator::events module that handles publishing and consuming coordination events through NATS streams.
The core architectural shift involves:
Event System Implementation: A new events.rs file introduces CoordinatorEvent and MetaEvent structures with comprehensive event publishing/subscribing using NATS streams. This replaces the previous approach of directly watching KV operations with a more controlled event distribution system.
Queue Interface Enhancement: The Queue trait is extended with RetentionPolicy support, allowing different retention behaviors - Interest-based retention for coordination events (messages kept only while consumers exist) versus Limits-based retention for other use cases.
Database Layer Changes: The NATS database implementation (nats.rs) removes complex KV watching logic and instead emits explicit events through the cluster coordinator when need_watch is enabled during put/delete operations.
Key Centralization: Multiple modules extract hardcoded path strings into constants (NODES_KEY, SCHEMA_KEY, ALERT_WATCHER_PREFIX, OFGA_KEY_PREFIX) improving maintainability and preparing for the new coordination system.
The new architecture maintains the same API surface (watch, put_event, delete_event) while switching from passive KV observation to active stream-based event distribution. This provides better durability, ordering guarantees, and acknowledgment semantics for cluster coordination operations. The system uses Interest retention policy for coordination streams, ensuring events are only kept while there are active consumers, which is appropriate for coordination events that don't require long-term storage.
Confidence score: 3/5
- This PR introduces complex architectural changes with potential reliability concerns including task panic handling, message acknowledgment on failures, and global state management issues
- Score reflects significant complexity in the new event coordination system with multiple potential failure points including Arc reference handling and task spawning without error recovery
- Pay close attention to
src/infra/src/cluster_coordinator/events.rsandsrc/infra/src/db/nats.rsfor potential race conditions and error handling issues
Context used:
Context - Avoid using expect with potentially failing operations; instead, handle the None case to prevent panics. (link)
13 files reviewed, 5 comments
PR Code Suggestions ✨Explore these optional code suggestions:
|
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: Hengfei Yang <[email protected]>
Co-authored-by: Hengfei Yang <[email protected]>
User description
Currently we use nats kv watch function to listen for various kv events from different nodes. This pr aims to replace kv with nats stream queue as the cluster coordinator for when nats is the cluster coordinator.
PR Type
Enhancement
Description
Introduce NATS-based coordinator events stream
Replace KV watch with queue watchers
Add constants for shared key prefixes
Improve enrichment table debug logging
Diagram Walkthrough
File Walkthrough
13 files
Add nodes key constant and use in watchersInitialize coordinator events during registerPrefix constant and export key parser; watch with constantNew coordinator events stream and watcher registryExpose events module and tighten Db pathPublish coordinator events on put/delete; delegate watchAdd retention policy and API for createSupport retention policy and robust consumer loopStub create_with_retention_policy methodIntroduce schema key constant and use itUse namespaced alerts coordinator functionsAdd OFGA key prefix constant; unify logsAdd detailed debug logs for fetch path