-
Notifications
You must be signed in to change notification settings - Fork 715
feat: alert dedup #8884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: alert dedup #8884
Conversation
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Overview
Greptile Summary
Implemented comprehensive alert deduplication feature (enterprise-only) with fingerprint-based duplicate detection, time-window tracking, and automatic cleanup.
Key Changes
- Core deduplication logic (
src/service/alerts/deduplication.rs): Fingerprint calculation using SHA-256 hash of alert name, query context, and dimensional fields. Supports auto-detection for SQL (GROUP BY), PromQL (labels), and Custom queries (conditions) - Database schema (
m20251024_000001_add_alert_deduplication.rs): Addedalert_dedup_statetable with proper indexes and foreign keys, plus 3 new columns toalertstable - Alert scheduler integration (
handlers.rs): Deduplication applied before sending notifications, with graceful error handling that falls back to original behavior - Background cleanup job (
alert_manager.rs): Hourly cleanup of dedup state records older than 24 hours - Frontend UI (
DeduplicationConfig.vue): User-friendly configuration interface for fingerprint fields, time windows, and grouping options
Issues Found
- Silent error handling in deduplication state updates: Errors when saving dedup state are ignored using
let _ =, which could lead to inconsistent state or duplicate notifications
Confidence Score: 4/5
- This PR is mostly safe to merge with minor error handling improvements needed
- The implementation is well-architected with proper enterprise feature gating, comprehensive tests, and correct database schema. However, silent error handling in the deduplication state updates (lines 666-678 and 693-705) could lead to inconsistent state where notifications are sent/suppressed without proper dedup tracking. These errors should be logged at minimum.
- src/service/alerts/deduplication.rs requires attention for error handling improvements in the
apply_deduplication_enterprisefunction
Important Files Changed
File Analysis
| Filename | Score | Overview |
|---|---|---|
| src/service/alerts/deduplication.rs | 4/5 | Implemented core deduplication logic with fingerprinting, state management, and enterprise feature gating |
| src/infra/src/table/migration/m20251024_000001_add_alert_deduplication.rs | 5/5 | Added database migration for deduplication with proper indexes and foreign key constraints |
| src/job/alert_manager.rs | 5/5 | Added cleanup job for old deduplication state records with enterprise feature gating |
| src/service/alerts/scheduler/handlers.rs | 4/5 | Integrated deduplication into alert trigger handling with proper error handling |
| src/infra/src/table/alerts/mod.rs | 5/5 | Updated alert database operations to handle deduplication configuration fields |
Sequence Diagram
sequenceDiagram
participant Scheduler as Alert Scheduler
participant Handler as Alert Handler
participant DedupService as Deduplication Service
participant DB as Database (ORM)
participant AlertEngine as Alert Engine
participant Notification as Notification Service
Scheduler->>Handler: handle_alert_triggers(trigger)
Handler->>AlertEngine: Execute alert query
AlertEngine-->>Handler: Return result_rows
alt Deduplication Enabled (Enterprise)
Handler->>DedupService: apply_deduplication(alert, result_rows)
loop For each result row
DedupService->>DedupService: calculate_fingerprint(alert, row)
Note over DedupService: Fingerprint = SHA256(alert_name + query_context + dimensional_fields)
DedupService->>DB: get_dedup_state(fingerprint)
alt State exists and within time window
DedupService->>DB: save_dedup_state(update occurrence_count)
Note over DedupService: Suppress notification
else State expired or doesn't exist
DedupService->>DB: save_dedup_state(new state)
Note over DedupService: Allow notification
end
end
DedupService-->>Handler: Return deduplicated_rows
alt All rows deduplicated
Handler->>DB: update_trigger(timing only)
Note over Handler: Skip notification
end
end
Handler->>Notification: send_notification(deduplicated_rows)
Handler->>DB: update_trigger(complete state)
Note over Scheduler,DB: Background Cleanup Job (Hourly)
Scheduler->>DedupService: cleanup_expired_state()
DedupService->>DB: Delete records older than 24 hours
20 files reviewed, 2 comments
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com>
|
| Status | Total | Passed | Failed | Skipped | Flaky | Pass Rate | Duration |
|---|---|---|---|---|---|---|---|
| 4 tests failed | 365 | 339 | 4 | 19 | 3 | 93% | 10m 48s |
Test Failure Analysis
- alerts-ui-operations.spec.js: Multiple timeout issues related to element visibility and clicks
- Alerts UI Operations Alert Module UI Validations and Filters Check: Element 'Alert saved successfully.' not found within timeout.
- Alerts UI Operations Create and Delete Scheduled Alert with SQL Query: Timeout waiting for 'keyboard_arrow_down' button click.
- alerts-import.spec.js: Timeout issue with alert visibility after creation
- Alerts Import/Export Import/Export Alert Functionality: 'Alert saved successfully.' not visible after alert creation.
- alerts-e2e-flow.spec.js: Consistent timeout on alert creation confirmation
- Alerts E2E Flow Alerts E2E Flow - Create, Update, Move, Clone, Delete, Pause, Resume: 'Alert saved successfully.' not visible after alert creation.
Root Cause Analysis
- The code changes in src/.../alerts/deduplication.rs indicate potential issues with saving alert states, which may affect UI feedback.
Recommended Actions
- Investigate the timing and conditions under which 'Alert saved successfully.' appears to ensure it is displayed promptly after alert creation.
- Review the click handling for the 'keyboard_arrow_down' element to confirm it is available and visible before attempting to click.
- Increase timeout settings for critical UI interactions in alerts-ui-operations.spec.js to accommodate slower responses.
|
| Status | Total | Passed | Failed | Skipped | Flaky | Pass Rate | Duration |
|---|---|---|---|---|---|---|---|
| 4 tests failed | 365 | 340 | 4 | 19 | 2 | 93% | 10m 47s |
Test Failure Analysis
- alerts-ui-operations.spec.js: Multiple timeout issues related to element visibility and clicks
- Alerts UI Operations Alert Module UI Validations and Filters Check: Timeout waiting for 'Alert saved successfully.' to be visible.
- Alerts UI Operations Create and Delete Scheduled Alert with SQL Query: Timeout waiting for 'keyboard_arrow_down' to be clickable.
- alerts-import.spec.js: Timeout issue on alert visibility check
- Alerts Import/Export Import/Export Alert Functionality: Timeout waiting for 'Alert saved successfully.' to be visible.
- alerts-e2e-flow.spec.js: Timeout issue on alert visibility check
- Alerts E2E Flow Alerts E2E Flow - Create, Update, Move, Clone, Delete, Pause, Resume: Timeout waiting for 'Alert saved successfully.' to be visible.
Root Cause Analysis
- The code changes in alertsPage.js indicate potential issues with alert success message rendering, leading to timeouts.
Recommended Actions
- Investigate the rendering logic for the success message in alertsPage.js to ensure it appears as expected. 2. Increase the timeout duration for visibility checks in alerts-ui-operations.spec.js if the UI is slow to respond. 3. Ensure that the 'keyboard_arrow_down' element is present and visible before attempting to click in alerts-ui-operations.spec.js.
|
| Status | Total | Passed | Failed | Skipped | Flaky | Pass Rate | Duration |
|---|---|---|---|---|---|---|---|
| 4 tests failed | 365 | 339 | 4 | 19 | 3 | 93% | 10m 46s |
Test Failure Analysis
- alerts-ui-operations.spec.js: Timeout issues related to element visibility and clicks
- Alerts UI Operations Alert Module UI Validations and Filters Check: Element 'Alert saved successfully.' not found within timeout.
- Alerts UI Operations Create and Delete Scheduled Alert with SQL Query: Timeout waiting for 'keyboard_arrow_down' click action.
- alerts-import.spec.js: Visibility timeout for alert success message
- Alerts Import/Export Import/Export Alert Functionality: 'Alert saved successfully.' not visible after alert creation.
- alerts-e2e-flow.spec.js: Visibility timeout for alert success message
- Alerts E2E Flow Alerts E2E Flow - Create, Update, Move, Clone, Delete, Pause, Resume: 'Alert saved successfully.' not found after alert operations.
Root Cause Analysis
- The recent changes in alertsPage.js may have affected the visibility of alert success messages.
Recommended Actions
- Investigate the timing of the alert success message display in alertsPage.js to ensure it appears as expected. 2. Review the click action for 'keyboard_arrow_down' in alertsPage.js to confirm the element is present before interaction. 3. Increase the timeout duration for visibility checks in alertsPage.js to accommodate potential delays.
|
| Status | Total | Passed | Failed | Skipped | Flaky | Pass Rate | Duration |
|---|---|---|---|---|---|---|---|
| 4 tests failed | 365 | 338 | 4 | 19 | 4 | 93% | 10m 48s |
Test Failure Analysis
- alerts-ui-operations.spec.js: Multiple timeout issues related to element visibility and clicks
- Alerts UI Operations Alert Module UI Validations and Filters Check: Element not found for success message visibility.
- Alerts UI Operations Create and Delete Scheduled Alert with SQL Query: Timeout waiting for dropdown click action.
- alerts-import.spec.js: Timeout issues on alert success message visibility
- Alerts Import/Export Import/Export Alert Functionality: Element not found for success message visibility.
- alerts-e2e-flow.spec.js: Timeout issues on alert success message visibility
- Alerts E2E Flow Alerts E2E Flow - Create, Update, Move, Clone, Delete, Pause, Resume: Element not found for success message visibility.
Root Cause Analysis
- The code changes in alerts/deduplication.rs do not directly relate to the timeout issues observed in the UI tests.
Recommended Actions
- Increase the timeout duration for visibility checks in alerts-ui-operations.spec.js to accommodate slower responses.
- Ensure the success message 'Alert saved successfully.' is correctly rendered before the visibility check in alerts-import.spec.js and alerts-e2e-flow.spec.js.
- Verify the presence of the 'keyboard_arrow_down' element in alerts-ui-operations.spec.js before attempting to click.
|
| Status | Total | Passed | Failed | Skipped | Flaky | Pass Rate | Duration |
|---|---|---|---|---|---|---|---|
| 4 tests failed | 365 | 340 | 4 | 19 | 2 | 93% | 10m 47s |
Test Failure Analysis
- alerts-ui-operations.spec.js: Multiple timeout issues related to element visibility and clicks
- Alerts UI Operations Alert Module UI Validations and Filters Check: Element 'Alert saved successfully.' not found within timeout.
- Alerts UI Operations Create and Delete Scheduled Alert with SQL Query: Timeout waiting for 'keyboard_arrow_down' click.
- alerts-import.spec.js: Timeout issue with alert visibility after creation
- Alerts Import/Export Import/Export Alert Functionality: 'Alert saved successfully.' not visible after alert creation.
- alerts-e2e-flow.spec.js: Visibility timeout for alert success message
- Alerts E2E Flow Alerts E2E Flow - Create, Update, Move, Clone, Delete, Pause, Resume: 'Alert saved successfully.' not found after alert creation.
Root Cause Analysis
- The recent changes in migration files do not directly relate to the timeout issues observed in the tests.
Recommended Actions
- Increase the timeout duration in alerts-ui-operations.spec.js to allow more time for elements to appear.
- Verify that the success message 'Alert saved successfully.' is correctly rendered in the DOM after alert creation in both alerts-ui-operations.spec.js and alerts-import.spec.js.
- Ensure that the element 'keyboard_arrow_down' is present and visible before attempting to click it in alerts-ui-operations.spec.js.
Testdino Test Results
|
|
| Status | Total | Passed | Failed | Skipped | Flaky | Pass Rate | Duration |
|---|---|---|---|---|---|---|---|
| All tests passed | 365 | 341 | 0 | 19 | 5 | 93% | 4m 40s |
When super_cluster is enabled but the system is running in local_mode, attempting to emit alert events to the super_cluster queue causes a panic because the queue initialization explicitly rejects local_mode. This fix adds a check for `!config.common.local_mode` alongside the existing `config.super_cluster.enabled` check in all alert event emission functions: - emit_create_event - emit_update_event - emit_delete_event - _emit_delete_all_event This allows alerts to function properly in test environments and local deployments where super_cluster is configured but local_mode is true. Fixes the 500 Internal Server Error in alert creation E2E tests. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
The previous fix incorrectly tried to access local_mode from the o2-enterprise config's Common struct, but local_mode is actually part of the openobserve config. This fix: - Imports config::get_config() alongside the enterprise get_o2_config() - Uses oss_config.common.local_mode for the local mode check - Uses o2_config.super_cluster.enabled for the super cluster check This properly checks both configs to determine if super_cluster events should be emitted.
|
| Status | Total | Passed | Failed | Skipped | Flaky | Pass Rate | Duration |
|---|---|---|---|---|---|---|---|
| 1 test failed | 288 | 267 | 1 | 16 | 4 | 93% | 5m 25s |
Test Failure Analysis
- dashboard.spec.js: Timeout issues during page load state verification
- dashboard UI testcases should update the line chart correctly when used camel case in custom sql query: Page load timed out waiting for network to be idle.
Root Cause Analysis
- The changes in src/.../alerts/alert.rs may have introduced delays affecting the dashboard loading time.
Recommended Actions
- Increase the timeout duration in the test to accommodate longer loading times. 2. Investigate the performance of the SQL query used in the test to ensure it executes efficiently. 3. Optimize the dashboard rendering logic to reduce load times.
PR Type
Enhancement, Tests
Description
Introduce alert deduplication feature
Persist dedup state and migration
API/UI support for dedup config
Scheduled cleanup job added
Diagram Walkthrough
File Walkthrough
3 files
Bump DB schema version to 11Wire enterprise-only dedup migrationGate deduplication module behind enterprise feature15 files
Add optional deduplication to Alert modelExport deduplication moduleExpose deduplication in HTTP Alert model mappingsMap dedup fields to DB and serialize configNew entity for deduplication stateAdd dedup columns to alerts entityRegister alert_dedup_state entityMigration: add dedup columns and state tableAdd scheduled dedup state cleanup jobImplement dedup logic, fingerprinting, cleanupApply deduplication before sending alertsAdd deduplication to Alert interfaceUX: expand/collapse area clickable with hoverNew UI for configuring alert deduplicationIntegrate dedup config into scheduled alerts (enterprise)1 files
New deduplication and grouping config with tests1 files
i18n strings for deduplication UI