Skip to content

Conversation

@Subhra264
Copy link
Contributor

No description provided.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR implements a significant architectural change to replace the existing cluster coordination mechanism with a NATS-based queue system. The changes introduce a new event-driven coordination layer that uses NATS JetStream for distributed messaging and cluster synchronization.

Key changes include:

  1. New Event Coordination System: Added src/infra/src/cluster_coordinator/events.rs which implements a publish-subscribe mechanism using NATS streams. This system allows nodes to publish database events (Put/Delete) and subscribe to changes matching specific key prefixes.

  2. Queue Interface Extensions: Enhanced the Queue trait in src/infra/src/queue/mod.rs with RetentionPolicy enum and create_with_retention_policy method, allowing more granular control over NATS stream retention behavior (Interest vs Limits based).

  3. NATS Implementation Updates: Refactored src/infra/src/queue/nats.rs and src/infra/src/db/nats.rs to integrate with the new cluster coordinator. The NATS database layer now publishes events to the coordinator when need_watch is enabled, and watching is delegated to the centralized event system.

  4. Service Layer Integration: Updated alert management (src/service/db/alerts/alert.rs) and enrichment services to use the new cluster coordinator paths, while the enrichment module adds debug logging for better observability.

  5. Code Consistency Improvements: Extracted hardcoded path strings into constants across multiple modules (NODES_KEY, SCHEMA_KEY, OFGA_KEY_PREFIX, ALERT_WATCHER_PREFIX) to improve maintainability.

  6. Cache Management Refactoring: Removed the old NATS event-driven cache refresh mechanism from src/common/infra/config.rs and simplified NATS client initialization in various components.

The new architecture centralizes event coordination through a dedicated queue system rather than direct NATS callbacks, providing better separation of concerns between data storage and event coordination. This should improve scalability, reliability, and maintainability of the distributed cluster coordination system.

Confidence score: 2/5

  • This PR introduces significant architectural changes with potential breaking issues that require careful review
  • Score lowered due to critical problems including panic usage, missing cache refresh mechanism, and potential race conditions in the event coordination system
  • Pay close attention to src/infra/src/cluster_coordinator/events.rs, src/common/infra/config.rs, and src/infra/src/db/nats.rs

Context used:

Context - Avoid using expect with potentially failing operations; instead, handle the None case to prevent panics. (link)
Context - Return an Err from main instead of calling std::process::exit to allow for proper cleanup. (link)

14 files reviewed, 6 comments

Edit Code Review Bot Settings | Greptile

@Subhra264 Subhra264 merged commit 323f022 into branch-v0.14.6-rc9 Sep 16, 2025
47 of 50 checks passed
@Subhra264 Subhra264 deleted the nats_stream_watcher_rc9 branch September 16, 2025 04:49
hengfeiyang added a commit that referenced this pull request Sep 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants