Skip to content

Failed EphemeralRunners block launching new pods #3685

@igaskin

Description

@igaskin

Checks

Controller Version

0.8.3

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. Trigger a `FailedScheduling` event.
2. Wait for 5 failures in pod scheduling.
3. Recover the cluster.
4. New ephemeral runner pods will not be scheduled to meet capacity.

Describe the bug

When EphemeralRunners are in Failed state they get stuck in that state, which prevents other pods from being launched. This issue has been previously noted in these discussions.

status:
  currentRunners: 17
  failedEphemeralRunners: 16
  pendingEphemeralRunners: 0
  runningEphemeralRunners: 1 

https://github.com/actions/actions-runner-controller/discussions/3300
https://github.com/actions/actions-runner-controller/discussions/3610

Describe the expected behavior

Failed Ephemeral runners will be cleared, so scheduling can be retired.

Additional Context

https://github.com/actions/actions-runner-controller/discussions/3610
https://github.com/actions/actions-runner-controller/discussions/3300

Controller Logs

2024-06-20T19:18:03Z	INFO	listener-app.worker.kubernetesworker	Ephemeral runner set scaled.	{"namespace": "my-scaleset-ns", "name": "my-runner-6pzbd", "replicas": 3}
2024-06-20T19:18:03Z	INFO	listener-app.listener	Getting next message	{"lastMessageID": 11}
2024-06-20T19:18:11Z	INFO	listener-app.listener	Getting next message	{"lastMessageID": 14}
2024-06-20T19:18:53Z	INFO	listener-app.listener	Getting next message	{"lastMessageID": 11}
2024-06-20T19:19:01Z	INFO	listener-app.listener	Getting next message	{"lastMessageID": 14}

Runner Pod Logs

2024-06-21T16:22:44Z	INFO	listener-app.worker.kubernetesworker	Ephemeral runner set scaled.	{"namespace": "my-scaleset", "name": "my-runner-rpvp2", "replicas": 10}

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions