Skip to content

Failure to deploy Self-Hosted runners - High Queue Times and canceled jobs #140958

@zxiiro

Description

@zxiiro

PyTorch CI using up GitHub APIs over our ratelimit

Current Status

Status: Resolved

Error looks like

  • GitHub self-hosted runners being terminated unexpectedly.
  • GitHub API Rate Limit reached

Incident timeline (all times pacific)

  • Began around Friday Nov 15th @ 9 am
  • GitHub notified Monday Nov 18th @ 8:36 am
  • GitHub resolved Monday Nov 18th @ 1:45 pm

User impact

Intermittently GitHub self-hosted runners may terminate mid-job.

Root cause

A bug was introduced into the repository-level list runners API which resulted in pagination logic incorrectly being applied twice. Due to this, results were not returned beyond the first page. The change with the bug was intended to be fully feature flagged and disabled, but the offending logic was accidentally added outside of the feature flag block and was missed in reviews. A test case for pagination was missing from this API.

Mitigation

Split infra load between Meta fleet and LF fleet to spread the API usage across both accounts.

Prevention/followups

This issue was caused by GitHub rolling out changes to their API. We have monitoring in place that can help us troubleshoot this issue and escalate if it happens again.

Metadata

Metadata

Labels

ci: sevcritical failure affecting PyTorch CI

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions