Flaky tests

Introduction

This page describes GitLab’s organizational process for detecting, reporting, and managing flaky tests. For technical guidance on debugging and fixing flaky tests, see Unhealthy Tests (Developer Docs). For quarantine procedures and syntax, see Quarantining Tests (Developer Docs) and Quarantine Process.

A flaky test is an unreliable test that occasionally fails but passes eventually if you retry it enough times. Flaky tests can be a result of brittle tests, unstable test infrastructure, or an unstable application. We should try to identify the cause and remove the instability to improve quality and build trust in test results.

Manual flow to detect flaky tests

When a flaky test fails in an MR, the author might follow the following flow:

graph LR
    A[Test fails in a MR] --> C{Does the failure looks related to the MR?}
    C -->|Yes| D[Try to reproduce and fix the test locally]
    C -->|No| E{Does a flaky test issue exists?}
    E -->|Yes| F[Retry the job and hope that it will pass this time]
    E -->|No| G[Wonder if this is flaky and retry the job]

Why is flaky tests management important?

Flaky tests undermine test results, leading to engineers disregarding test failures as flaky.
Manual retries to try to get flaky tests to pass, and the effort needed to investigate flaky tests as failures are a significant waste of time.
Managing flaky tests by quickly fixing the cause or removing the test from the test suite allows test time and costs to be used where they add value.

Urgency Tiers and Response Timelines

Flaky tests are categorized by urgency based on their impact on pipeline stability:

🔴 Critical: 48 hours - Tests blocking critical workflows or affecting multiple teams
🟠 High: 1 week - Tests with significant pipeline impact
🟡 Medium: 2 weeks - Tests with moderate impact

These timelines guide when a test should be quarantined if it cannot be fixed. For quarantine procedures and technical implementation, see Quarantine Process and Quarantining Tests (Developer Docs).

Automated Reporting of Top Flaky Test Files

GitLab uses custom tooling to automatically identify and report the most impactful flaky test files that block CI/CD pipelines. The ci-alerts automation creates issues for test files causing repeated pipeline failures, which are then triaged and assigned to Engineering Managers for resolution.

View all top flaky test file issues: automation:top-flaky-test-file label

How It Works

The ci-alerts system analyzes test failure data from ClickHouse to identify test files with the highest impact on pipeline stability. It classifies test files into three categories:

Flaky: Failures spread over 3+ days, still actively failing (≤3 days since last failure)
Master-broken: High-volume incidents (≥30 in 12h with 40%+ concentration OR 60+ absolute)
Unclear: Don’t meet classification criteria

For detailed information about the classification algorithm and configuration, see the ci-alerts flaky tests reporting documentation.

Triage Process

Issues created by the automation are triaged by the Development Analytics team and dispatched to the responsible Engineering Managers. The complete triage workflow is documented in the ci-alerts TRIAGE.md.

Key steps:

Initial triage to verify genuine flakiness
Dispatch to responsible product group with EM mention
14-day follow-up with quarantine option if no action taken

For Engineering Managers

If you’ve been assigned a top flaky test file issue:

Review the issue description - Contains impact metrics, Grafana dashboard link, and recommended actions
Assess the situation - Use the Grafana dashboard to understand failure patterns
Take action within 14 days:
- Fix the root cause, or
- Merge the provided quarantine MR to unblock pipelines while investigating, or
- Request more time if actively working on a fix

For guidance on quarantining tests, see the Quarantine Process and Quarantining Tests (Developer Docs).

What About Other Flaky Test Reporting Systems?

You may have noticed other flaky test issues with different labels. GitLab runs multiple flaky test detection systems in parallel:

You may have noticed older flaky test issues with flakiness::* labels (e.g., flakiness::1, flakiness::2). These are from a previous reporting system that runs in parallel with the top flaky test file automation.

Key differences:

Old System (`flakiness::*` labels)	New System (`automation:top-flaky-test-file`)
Only shows failures on master branch	Includes both MRs and master
Only shows tests that failed then succeeded in the same job (via retry)	Shows tests that actually blocked pipelines
Doesn’t impact pipeline stability	Directly measures pipeline blocking impact

The old system has significant shortcomings - it misses the tests that actually block engineers from shipping features and fixes.

Recommendation: Prioritize issues with the automation:top-flaky-test-file label, as these represent tests actively blocking engineers from shipping features and fixes.

The two systems will eventually be merged into a unified reporting mechanism.

Technical Details

For developers and automation maintainers:

Source code: gitlab-org/quality/analytics/ci-alerts
Classification algorithm: doc/flaky_tests_reporting.md
Triage workflow: TRIAGE.md
Schedule: Runs weekly on Sundays at 10:00 UTC

Getting Help

For questions or support:

Slack: #g_development_analytics

Additional resources

Detailed Quarantine Process - Overall process for quarantined tests at GitLab
Unhealthy Tests (Developer Docs) - Technical reference for debugging and reproducing flaky tests
Quarantining Tests (Developer Docs) - Technical reference for quarantine syntax and implementation
Flaky tests dashboard

Last modified January 21, 2026: Add urgency tiers and clarify handbook vs docs separation (3f2316fa)

View page source - Edit this page - please contribute.