Skip to content

Conversation

@ByteBaker
Copy link
Contributor

Implements alert correlation that groups related alerts into incidents based on semantic field matching and temporal proximity, plus deduplication to prevent alert storms.

Core Capabilities:

  • Semantic field groups: Map field variations (hostname/host/node) to canonical dimensions for consistent matching across different data sources
  • Correlation strategies: Match alerts by dimensions (all/any), with temporal fallback for proximity-based grouping
  • Incident lifecycle: Track incidents (open/acknowledged/resolved) with confidence scoring based on match quality
  • Deduplication: Fingerprint-based suppression using alert name, query context, and semantic dimensions

API Endpoints:

  • GET/POST/DELETE /{org_id}/alerts/correlation/config
  • GET/POST/DELETE /{org_id}/alerts/deduplication/config
  • GET /{org_id}/alerts/incidents (list with status filter)
  • GET /{org_id}/alerts/incidents/{id} (details with alert list)
  • PUT /{org_id}/alerts/incidents/{id}/status

Database Schema:

  • alert_incidents: Incident records with correlation metadata
  • alert_incident_alerts: Many-to-many mapping of alerts to incidents
  • alert_dedup_state: Deduplication fingerprint tracking

Implementation:

  • Business logic: Pure algorithms for matching, classification, fingerprinting
  • Service layer: Orchestrates DB operations with algorithm delegation
  • HTTP handlers: Feature-gated dual implementations for OSS/enterprise builds
  • Config types: Shared data structures with validation in config crate

@github-actions
Copy link
Contributor

Failed to generate code suggestions for PR

@testdino-playwright-reporter
Copy link

⚠️ Test Run Unstable


Author: ByteBaker | Branch: feat/correlation | Commit: 22c4354

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
All tests passed 371 345 0 22 4 93% 8m 38s

View Detailed Results

@testdino-playwright-reporter
Copy link

⚠️ Test Run Unstable


Author: ByteBaker | Branch: feat/correlation | Commit: ce94bd9

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
2 tests failed 371 344 2 22 3 93% 8m 6s

Test Failure Analysis

  1. pipeline-core.spec.js: Failures due to timeout waiting for button click
    1. Core Pipeline Tests should add source, condition & destination node and then delete the pipeline: Timeout waiting for 'Explore' button click.
    2. Core Pipeline Tests should add source & destination node and then delete the pipeline: Timeout waiting for 'Explore' button click.

Root Cause Analysis

  • The failures are likely related to the recent changes in the UI that may have affected the button's visibility or availability.

Recommended Actions

  1. Investigate the UI changes in pipeline-core.spec.js to ensure the 'Explore' button is present and interactable. 2. Increase the timeout duration for button clicks in the tests if the UI is slow to respond. 3. Add checks to confirm the button's visibility before attempting to click it.

View Detailed Results

@testdino-playwright-reporter
Copy link

⚠️ Test Run Unstable


Author: ByteBaker | Branch: feat/correlation | Commit: 30d5562

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
2 tests failed 371 343 2 22 4 92% 8m 5s

Test Failure Analysis

  1. pipeline-core.spec.js: Tests failing due to timeout errors while clicking buttons
    1. Core Pipeline Tests should add source, condition & destination node and then delete the pipeline: Timeout while waiting for 'Explore' button click.
    2. Core Pipeline Tests should add source & destination node and then delete the pipeline: Timeout while waiting for 'Explore' button click.

Root Cause Analysis

  • The failures are likely related to recent changes in the UI that may have affected element visibility or loading times.

Recommended Actions

  1. Increase the timeout duration for button clicks in pipeline-core.spec.js. 2. Ensure the 'Explore' button is visible and enabled before the click action in pipeline-core.spec.js. 3. Add explicit wait conditions for the button to be present before attempting to click in pipeline-core.spec.js.

View Detailed Results

@ByteBaker ByteBaker force-pushed the feat/correlation branch 2 times, most recently from ad3194a to 1a516ad Compare November 11, 2025 01:59
@testdino-playwright-reporter
Copy link

⚠️ Test Run Unstable


Author: ByteBaker | Branch: feat/correlation | Commit: 1a516ad

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
All tests passed 371 345 0 24 2 93% 5m 30s

View Detailed Results

@testdino-playwright-reporter
Copy link

⚠️ Test Run Unstable


Author: ByteBaker | Branch: feat/correlation | Commit: 379ca47

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
1 test failed 242 220 1 20 1 91% 6m 56s

Test Failure Analysis

  1. logsqueries.spec.js: Timeout issues while interacting with UI elements
    1. Logs Queries testcases should redirect to logs after clicking on stream explorer via stream page: Timeout waiting for locator '[data-test="logs-search-bar-delete-streamslogpagtxtg5-saved-view-btn"]'.

Root Cause Analysis

  • The timeout errors are likely related to recent changes in the logs page interaction logic in logsPage.js.

Recommended Actions

  1. Investigate the visibility and loading time of the element '[data-test="logs-search-bar-delete-streamslogpagtxtg5-saved-view-btn"]' in logsPage.js. 2. Increase the timeout duration for the click action in the clickDeleteSavedViewButton method. 3. Ensure that the element is present and visible before attempting to click.

View Detailed Results

@testdino-playwright-reporter
Copy link

⚠️ Test Run Unstable


Author: ByteBaker | Branch: feat/correlation | Commit: bc6ded0

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
All tests passed 371 344 0 24 3 93% 5m 32s

View Detailed Results

@testdino-playwright-reporter
Copy link

⚠️ Test Run Unstable


Author: ByteBaker | Branch: feat/correlation | Commit: e2edbf3

Testdino Test Results

Status Total Passed Failed Skipped Flaky Pass Rate Duration
1 test failed 285 260 1 21 3 91% 3m 26s

Test Failure Analysis

  1. changeOrg.spec.js: Locator issues causing strict mode violations
    1. Change Organisation Alerts Page default validation: Locator resolved to multiple elements, causing click failure.

Root Cause Analysis

  • The recent changes in AlertList.vue introduced new elements that conflict with existing locators.

Recommended Actions

  1. Update the locator in HomePage.clickDefaultOrg to be more specific to avoid ambiguity.
  2. Consider using a different selector method that targets the intended element directly.
  3. Review the changes in AlertList.vue to ensure no unintended interactions with existing dropdowns.

View Detailed Results

@ByteBaker ByteBaker force-pushed the feat/correlation branch 6 times, most recently from 9c7a11a to 2f624cd Compare November 19, 2025 10:23
Implements alert correlation that groups related alerts into incidents based
on semantic field matching and temporal proximity, plus deduplication to
prevent alert storms.

Core Capabilities:
- Semantic field groups: Map field variations (hostname/host/node) to canonical
  dimensions for consistent matching across different data sources
- Correlation strategies: Match alerts by dimensions (all/any), with temporal
  fallback for proximity-based grouping
- Incident lifecycle: Track incidents (open/acknowledged/resolved) with
  confidence scoring based on match quality
- Deduplication: Fingerprint-based suppression using alert name, query context,
  and semantic dimensions

API Endpoints:
- GET/POST/DELETE `/{org_id}/alerts/correlation/config`
- GET/POST/DELETE `/{org_id}/alerts/deduplication/config`
- GET `/{org_id}/alerts/incidents` (list with status filter)
- GET `/{org_id}/alerts/incidents/{id}` (details with alert list)
- PUT `/{org_id}/alerts/incidents/{id}/status`

Database Schema:
- `alert_incidents`: Incident records with correlation metadata
- `alert_incident_alerts`: Many-to-many mapping of alerts to incidents
- `alert_dedup_state`: Deduplication fingerprint tracking

Implementation:
- Business logic: Pure algorithms for matching, classification, fingerprinting
- Service layer: Orchestrates DB operations with algorithm delegation
- HTTP handlers: Feature-gated dual implementations for OSS/enterprise builds
- Config types: Shared data structures with validation in `config` crate
…eduplication

This commit addresses CI failures from GitHub Actions runs:
- Fixed 5 clippy uninlined_format_args warnings in Rust code
- Added 140+ comprehensive unit tests for new Vue components

Backend fixes:
- correlation.rs: Inline format string variables (2 fixes)
- deduplication.rs: Inline format string variables (2 fixes)
- correlation.rs (service): Inline format string variable (1 fix)

Frontend tests added (71.96% coverage, up from 71.54%):
- TagInput.spec.ts: 19 tests for tag input component
- SemanticGroupItem.spec.ts: 19 tests for semantic group editing
- IncidentList.spec.ts: 30 tests for incident list display
- DeduplicationConfig.spec.ts: 8 tests for dedup configuration
- OrganizationDeduplicationSettings.spec.ts: 8 tests for org settings
- SemanticFieldGroupsConfig.spec.ts: 61 tests for field group management

The new tests provide solid coverage of core functionality including:
- Component rendering and structure
- User interactions (clicks, inputs, form submissions)
- Data validation and formatting
- State management and prop updates
- Preset loading and configuration
- Edge cases and error handling
Prevents TypeError when viewing SQL-based alerts where conditions field is null instead of an object with length property.
Separates deduplication configuration into two distinct levels:
- Organization-level: Semantic field groups + default time window (global)
- Per-alert level: Fingerprint fields + time window override (per-alert)

Backend Changes:
- Add OrganizationDeduplicationConfig for org-wide semantic groups
- Keep DeduplicationConfig (from main) for per-alert fingerprint fields
- Update API handlers to use OrganizationDeduplicationConfig
- Update service layer for proper type separation
- All type references updated across codebase

Frontend Changes:
- OrganizationDeduplicationSettings.vue: Remove fingerprint fields UI
- SemanticFieldGroupsConfig.vue: Add showFingerprintFields prop
- DeduplicationConfig.vue: Unchanged (per-alert, matches main branch)
- Updated descriptions to clarify config separation

Testing:
- Add test_dedup_correlation.sh for API integration testing
- Add TESTING_DEDUP_CORRELATION.md comprehensive test guide
- Cargo check passes ✅

How it works:
1. Org-level defines semantic groups: {"host": ["host", "hostname", "node"]}
2. Per-alert specifies actual fields: ["hostname", "service_name"]
3. Enterprise module uses semantic groups for reverse lookup/mapping
Implements cross-alert deduplication that suppresses alerts from different
alert rules when they share semantic dimensions with recently fired alerts.

Key Changes:
- Add cross_alert_dedup flag to OrganizationDeduplicationConfig
- Add semantic dimension extraction helpers to org config
- Update deduplication service to fetch and pass org config to enterprise
- Add find_matching_semantic_fingerprints() for cross-alert lookups

Behavior:
- cross_alert_dedup=false (default): Per-alert dedup only (backward compatible)
  * Alert A: fingerprint="alert_A:srv01:api"
  * Alert B: fingerprint="alert_B:srv01:region" → Both sent

- cross_alert_dedup=true (new): Cross-alert semantic dedup
  * Alert A fires: semantic_dims={host:srv01, service:api}
  * Alert B fires 30s later: semantic_dims={host:srv01, region:us-east}
  * Result: Alert B suppressed (shares host=srv01 dimension)

Enterprise Module Requirements:
- Updated calculate_fingerprint() signature with org_config parameter
- Semantic fingerprint format: "dim1=val1,dim2=val2" (no alert ID)
- fingerprint_matches_dimensions() for overlap detection
- See CROSS_ALERT_DEDUP_SPEC.md for full specification

Benefits:
- Prevents alert storms from related issues across different monitors
- Semantic grouping allows flexible field name variations
- Opt-in feature with backward compatibility
Fixes tests broken by the separation of org-level and per-alert configs.

Changes:
- Rename test_deduplication_config_validation → test_organization_deduplication_config_validation
- Add test_per_alert_deduplication_config_validation for per-alert config
- Split test_deduplication_config_serialization into org and per-alert versions
- Update test_deduplication_config_default to test both config types separately
- All tests now use correct config types (OrganizationDeduplicationConfig vs DeduplicationConfig)

Test results:
- 11 tests in config crate: ✅ all passing
- Tests properly validate both config levels independently
Fixes correlation service to fetch semantic groups from org-level
deduplication config instead of per-alert config.

Changes:
- Update scheduler/handlers.rs to fetch semantic groups from org config
- Remove duplicate claim_parser_function block (compilation error)
- Semantic groups now correctly sourced from OrganizationDeduplicationConfig

Behavior:
- Correlation uses org-wide semantic field groups
- Consistent with deduplication's use of org-level groups
- Per-alert config no longer has semantic_field_groups field
ByteBaker added a commit that referenced this pull request Nov 25, 2025
Implements alert correlation that groups related alerts into incidents based on semantic field matching and temporal proximity.

**Backend:**
- Add `correlation.rs` config with validation for correlation dimensions and matching strategies
- Add `alert_incidents` and `alert_incident_alerts` database entities with SeaORM
- Add database migration `m20251107_000003_create_alert_correlation_schema`
- Add `correlation.rs` service with transaction-safe incident creation and matching
- Add `incidents.rs` HTTP handlers for incident CRUD operations (6 endpoints)
- Integrate correlation into alert scheduler to auto-correlate on alert firing
- Add 7 correlation metrics for observability (incidents created, alerts matched, confidence distribution, processing duration, MTTR)
- Update `org_config.rs` with correlation config persistence functions
- Update organization settings to include deduplication config in response

**Frontend:**
- Add `IncidentList.vue` component with status filtering and sortable table
- Add `IncidentDetailsDrawer.vue` for viewing incident details and associated alerts
- Add Incidents tab to `AlertList.vue` for accessing incident management UI
- Add 6 incident API methods to `alerts.ts` service (list, get, update status, config CRUD)

**Fixes:**
- Fix `OrganizationSettingResponse` test to include `deduplication_config` field
- Fix metering init call signature (remove extra argument)
- Comment out data retention usage code pending enterprise module update

Migrated from PR #9011, separated from deduplication feature (PR #9209).
ByteBaker added a commit that referenced this pull request Nov 25, 2025
Implements alert correlation that groups related alerts into incidents based on semantic field matching and temporal proximity.

**Backend:**
- Add `correlation.rs` config with validation for correlation dimensions and matching strategies
- Add `alert_incidents` and `alert_incident_alerts` database entities with SeaORM
- Add database migration `m20251107_000003_create_alert_correlation_schema`
- Add `correlation.rs` service with transaction-safe incident creation and matching
- Add `incidents.rs` HTTP handlers for incident CRUD operations (6 endpoints)
- Integrate correlation into alert scheduler to auto-correlate on alert firing
- Add 7 correlation metrics for observability (incidents created, alerts matched, confidence distribution, processing duration, MTTR)
- Update `org_config.rs` with correlation config persistence functions
- Update organization settings to include deduplication config in response

**Frontend:**
- Add `IncidentList.vue` component with status filtering and sortable table
- Add `IncidentDetailsDrawer.vue` for viewing incident details and associated alerts
- Add Incidents tab to `AlertList.vue` for accessing incident management UI
- Add 6 incident API methods to `alerts.ts` service (list, get, update status, config CRUD)

**Fixes:**
- Fix `OrganizationSettingResponse` test to include `deduplication_config` field
- Fix metering init call signature (remove extra argument)
- Comment out data retention usage code pending enterprise module update

Migrated from PR #9011, separated from deduplication feature (PR #9209).
@ByteBaker ByteBaker marked this pull request as draft November 25, 2025 12:36
@ByteBaker
Copy link
Contributor Author

Closing in favour of separate dev track.

@ByteBaker ByteBaker closed this Dec 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants