Flaky tests
Introduction
This page describes GitLab’s organizational process for detecting, reporting, and managing flaky tests. For technical guidance on debugging and fixing flaky tests, see Unhealthy Tests (Developer Docs). For quarantine procedures and syntax, see Quarantining Tests (Developer Docs) and Quarantine Process.
A flaky test is an unreliable test that occasionally fails but passes eventually if you retry it enough times. Flaky tests can be a result of brittle tests, unstable test infrastructure, or an unstable application. We should try to identify the cause and remove the instability to improve quality and build trust in test results.
Manual flow to detect flaky tests
When a flaky test fails in an MR, the author might follow the following flow:
graph LR
A[Test fails in a MR] --> C{Does the failure looks related to the MR?}
C -->|Yes| D[Try to reproduce and fix the test locally]
C -->|No| E{Does a flaky test issue exists?}
E -->|Yes| F[Retry the job and hope that it will pass this time]
E -->|No| G[Wonder if this is flaky and retry the job]
Why is flaky tests management important?
- Flaky tests undermine test results, leading to engineers disregarding test failures as flaky.
- Manual retries to try to get flaky tests to pass, and the effort needed to investigate flaky tests as failures are a significant waste of time.
- Managing flaky tests by quickly fixing the cause or removing the test from the test suite allows test time and costs to be used where they add value.
Urgency Tiers and Response Timelines
Flaky tests are categorized by urgency based on their impact on pipeline stability:
- 🔴 Critical: 48 hours - Tests blocking critical workflows or affecting multiple teams
- 🟠 High: 1 week - Tests with significant pipeline impact
- 🟡 Medium: 2 weeks - Tests with moderate impact
These timelines guide when a test should be quarantined if it cannot be fixed. For quarantine procedures and technical implementation, see Quarantine Process and Quarantining Tests (Developer Docs).
Automated Reporting of Top Flaky Test Files
GitLab uses custom tooling to automatically identify and report the most impactful flaky test files that block CI/CD pipelines. The ci-alerts automation creates issues for test files causing repeated pipeline failures, which are then triaged and assigned to Engineering Managers for resolution.
View all top flaky test file issues: automation:top-flaky-test-file label
How It Works
The ci-alerts system analyzes test failure data from ClickHouse to identify test files with the highest impact on pipeline stability. It classifies test files into three categories:
- Flaky: Failures spread over 3+ days, still actively failing (≤3 days since last failure)
- Master-broken: High-volume incidents (≥30 in 12h with 40%+ concentration OR 60+ absolute)
- Unclear: Don’t meet classification criteria
For detailed information about the classification algorithm and configuration, see the ci-alerts flaky tests reporting documentation.
Triage Process
Issues created by the automation are triaged by the Development Analytics team and dispatched to the responsible Engineering Managers. The complete triage workflow is documented in the ci-alerts TRIAGE.md.
Key steps:
- Initial triage to verify genuine flakiness
- Dispatch to responsible product group with EM mention
- 14-day follow-up with quarantine option if no action taken
For Engineering Managers
If you’ve been assigned a top flaky test file issue:
- Review the issue description - Contains impact metrics, Grafana dashboard link, and recommended actions
- Assess the situation - Use the Grafana dashboard to understand failure patterns
- Take action within 14 days:
- Fix the root cause, or
- Merge the provided quarantine MR to unblock pipelines while investigating, or
- Request more time if actively working on a fix
For guidance on quarantining tests, see the Quarantine Process and Quarantining Tests (Developer Docs).
What About Other Flaky Test Reporting Systems?
You may have noticed other flaky test issues with different labels. GitLab runs multiple flaky test detection systems in parallel:
You may have noticed older flaky test issues with flakiness::* labels (e.g., flakiness::1, flakiness::2). These are from a previous reporting system that runs in parallel with the top flaky test file automation.
Key differences:
Old System (flakiness::* labels) |
New System (automation:top-flaky-test-file) |
|---|---|
| Only shows failures on master branch | Includes both MRs and master |
| Only shows tests that failed then succeeded in the same job (via retry) | Shows tests that actually blocked pipelines |
| Doesn’t impact pipeline stability | Directly measures pipeline blocking impact |
The old system has significant shortcomings - it misses the tests that actually block engineers from shipping features and fixes.
Recommendation: Prioritize issues with the automation:top-flaky-test-file label, as these represent tests actively blocking engineers from shipping features and fixes.
The two systems will eventually be merged into a unified reporting mechanism.
Technical Details
For developers and automation maintainers:
- Source code: gitlab-org/quality/analytics/ci-alerts
- Classification algorithm: doc/flaky_tests_reporting.md
- Triage workflow: TRIAGE.md
- Schedule: Runs weekly on Sundays at 10:00 UTC
Getting Help
For questions or support:
- Slack: #g_development_analytics
Additional resources
- Detailed Quarantine Process - Overall process for quarantined tests at GitLab
- Unhealthy Tests (Developer Docs) - Technical reference for debugging and reproducing flaky tests
- Quarantining Tests (Developer Docs) - Technical reference for quarantine syntax and implementation
- Flaky tests dashboard
3f2316fa)
