Rate Limiting Observability Improvements
Problem
There is currently limited visibility into when someone is being rate limited. It's possible to dig into Cloudflare logs, use BigQuery to search HAProxy logs, or using Kibana to look into when/how a customer has been rate limited but usually this occurs after they've been rate limited and contacted us. This isn't a good experience for customers, support, or SREs.
For example: gitlab-org/gitlab#461608
Proposal
We should surface rate limiting metrics in a standardised way (that is ideally accessible by both Support and SREs). These metrics should illustrate rate limiting trends, such as:
- At what layer are most requests being rate limited?
- Has there been a spike in rate limited requests (maybe off the back of an event like the tokening?)
Metrics themselves should not be too high cardinality (i.e. by user), therefore there should be supplementary links to logging to support further investigation.
This issue is designed to capture the work involved in improving our rate limiting observability overall, with a big focus on metrics, as part of the larger epic.