-
Notifications
You must be signed in to change notification settings - Fork 715
feat: use nats queue for cluster coordination (#8413) #8452
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR implements a significant architectural change to replace the existing cluster coordination mechanism with a NATS-based queue system. The changes introduce a new event-driven coordination layer that uses NATS JetStream for distributed messaging and cluster synchronization.
Key changes include:
-
New Event Coordination System: Added
src/infra/src/cluster_coordinator/events.rswhich implements a publish-subscribe mechanism using NATS streams. This system allows nodes to publish database events (Put/Delete) and subscribe to changes matching specific key prefixes. -
Queue Interface Extensions: Enhanced the
Queuetrait insrc/infra/src/queue/mod.rswithRetentionPolicyenum andcreate_with_retention_policymethod, allowing more granular control over NATS stream retention behavior (Interest vs Limits based). -
NATS Implementation Updates: Refactored
src/infra/src/queue/nats.rsandsrc/infra/src/db/nats.rsto integrate with the new cluster coordinator. The NATS database layer now publishes events to the coordinator whenneed_watchis enabled, and watching is delegated to the centralized event system. -
Service Layer Integration: Updated alert management (
src/service/db/alerts/alert.rs) and enrichment services to use the new cluster coordinator paths, while the enrichment module adds debug logging for better observability. -
Code Consistency Improvements: Extracted hardcoded path strings into constants across multiple modules (
NODES_KEY,SCHEMA_KEY,OFGA_KEY_PREFIX,ALERT_WATCHER_PREFIX) to improve maintainability. -
Cache Management Refactoring: Removed the old NATS event-driven cache refresh mechanism from
src/common/infra/config.rsand simplified NATS client initialization in various components.
The new architecture centralizes event coordination through a dedicated queue system rather than direct NATS callbacks, providing better separation of concerns between data storage and event coordination. This should improve scalability, reliability, and maintainability of the distributed cluster coordination system.
Confidence score: 2/5
- This PR introduces significant architectural changes with potential breaking issues that require careful review
- Score lowered due to critical problems including panic usage, missing cache refresh mechanism, and potential race conditions in the event coordination system
- Pay close attention to
src/infra/src/cluster_coordinator/events.rs,src/common/infra/config.rs, andsrc/infra/src/db/nats.rs
Context used:
Context - Avoid using expect with potentially failing operations; instead, handle the None case to prevent panics. (link)
Context - Return an Err from main instead of calling std::process::exit to allow for proper cleanup. (link)
14 files reviewed, 6 comments
Co-authored-by: Hengfei Yang <[email protected]>
No description provided.