feat: use nats queue for cluster coordination (#8413) #8452

Subhra264 · 2025-09-15T14:38:43Z

No description provided.

greptile-apps

Greptile Summary

This PR implements a significant architectural change to replace the existing cluster coordination mechanism with a NATS-based queue system. The changes introduce a new event-driven coordination layer that uses NATS JetStream for distributed messaging and cluster synchronization.

Key changes include:

New Event Coordination System: Added src/infra/src/cluster_coordinator/events.rs which implements a publish-subscribe mechanism using NATS streams. This system allows nodes to publish database events (Put/Delete) and subscribe to changes matching specific key prefixes.
Queue Interface Extensions: Enhanced the Queue trait in src/infra/src/queue/mod.rs with RetentionPolicy enum and create_with_retention_policy method, allowing more granular control over NATS stream retention behavior (Interest vs Limits based).
NATS Implementation Updates: Refactored src/infra/src/queue/nats.rs and src/infra/src/db/nats.rs to integrate with the new cluster coordinator. The NATS database layer now publishes events to the coordinator when need_watch is enabled, and watching is delegated to the centralized event system.
Service Layer Integration: Updated alert management (src/service/db/alerts/alert.rs) and enrichment services to use the new cluster coordinator paths, while the enrichment module adds debug logging for better observability.
Code Consistency Improvements: Extracted hardcoded path strings into constants across multiple modules (NODES_KEY, SCHEMA_KEY, OFGA_KEY_PREFIX, ALERT_WATCHER_PREFIX) to improve maintainability.
Cache Management Refactoring: Removed the old NATS event-driven cache refresh mechanism from src/common/infra/config.rs and simplified NATS client initialization in various components.

The new architecture centralizes event coordination through a dedicated queue system rather than direct NATS callbacks, providing better separation of concerns between data storage and event coordination. This should improve scalability, reliability, and maintainability of the distributed cluster coordination system.

Confidence score: 2/5

This PR introduces significant architectural changes with potential breaking issues that require careful review
Score lowered due to critical problems including panic usage, missing cache refresh mechanism, and potential race conditions in the event coordination system
Pay close attention to src/infra/src/cluster_coordinator/events.rs, src/common/infra/config.rs, and src/infra/src/db/nats.rs

Context used:

Context - Avoid using expect with potentially failing operations; instead, handle the None case to prevent panics. (link)
Context - Return an Err from main instead of calling std::process::exit to allow for proper cleanup. (link)

_{14 files reviewed, 6 comments}

_{Edit Code Review Bot Settings | Greptile}

src/common/infra/config.rs

src/infra/src/db/nats.rs

src/infra/src/cluster_coordinator/events.rs

src/infra/src/queue/nats.rs

Co-authored-by: Hengfei Yang <[email protected]>

feat: use nats queue for cluster coordination (#8413)

10d2b77

Subhra264 requested review from Loaki07, hengfeiyang, oasisk and uddhavdave September 15, 2025 14:38

github-actions bot added the ✏️ Feature label Sep 15, 2025

greptile-apps bot reviewed Sep 15, 2025

View reviewed changes

src/common/infra/config.rs Outdated Show resolved Hide resolved

src/infra/src/db/nats.rs Show resolved Hide resolved

src/infra/src/db/nats.rs Show resolved Hide resolved

src/infra/src/cluster_coordinator/events.rs Show resolved Hide resolved

hengfeiyang requested changes Sep 15, 2025

View reviewed changes

src/infra/src/cluster_coordinator/events.rs Show resolved Hide resolved

src/infra/src/queue/nats.rs Outdated Show resolved Hide resolved

fix: improve nats queue with limit policy (#8436)

ac71241

hengfeiyang approved these changes Sep 15, 2025

View reviewed changes

fix: unit tests

814912e

Subhra264 merged commit 323f022 into branch-v0.14.6-rc9 Sep 16, 2025
47 of 50 checks passed

Subhra264 deleted the nats_stream_watcher_rc9 branch September 16, 2025 04:49

hengfeiyang added a commit that referenced this pull request Sep 16, 2025

feat: use nats queue for cluster coordination (#8413) (#8452)

0a0dc4c

Co-authored-by: Hengfei Yang <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: use nats queue for cluster coordination (#8413) #8452

feat: use nats queue for cluster coordination (#8413) #8452

Uh oh!

Subhra264 commented Sep 15, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

feat: use nats queue for cluster coordination (#8413) #8452

feat: use nats queue for cluster coordination (#8413) #8452

Uh oh!

Conversation

Subhra264 commented Sep 15, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Greptile Summary

Confidence score: 2/5

Context used:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants