-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Description
PyTorch CI using up GitHub APIs over our ratelimit
Current Status
Status: Resolved
Error looks like
- GitHub self-hosted runners being terminated unexpectedly.
- GitHub API Rate Limit reached
Incident timeline (all times pacific)
- Began around Friday Nov 15th @ 9 am
- GitHub notified Monday Nov 18th @ 8:36 am
- GitHub resolved Monday Nov 18th @ 1:45 pm
User impact
Intermittently GitHub self-hosted runners may terminate mid-job.
Root cause
A bug was introduced into the repository-level list runners API which resulted in pagination logic incorrectly being applied twice. Due to this, results were not returned beyond the first page. The change with the bug was intended to be fully feature flagged and disabled, but the offending logic was accidentally added outside of the feature flag block and was missed in reviews. A test case for pagination was missing from this API.
Mitigation
Split infra load between Meta fleet and LF fleet to spread the API usage across both accounts.
Prevention/followups
This issue was caused by GitHub rolling out changes to their API. We have monitoring in place that can help us troubleshoot this issue and escalate if it happens again.