Skip to content

Add Customizable Failure Threshold for Ephemeral Runner Retries #3700

@ali-kafel

Description

@ali-kafel

Checks

Controller Version

0.9.3

Deployment Method

Helm

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

Related to this line of code: https://github.com/actions/actions-runner-controller/blob/master/controllers/actions.github.com/ephemeralrunner_controller.go#L202

If an ephemeral runner fails to start up more than 5 times it is marked as failed. If multiple runners fail to startup it will take up the max runner limit and block new runners from starting up.

1. Create a runner set with a max amount of any number of runners
2. Fail the runners and let them be marked as failed to approach the runner maximum
3. Try spinning up new runners and you will see the failed runners take up space blocking new runners from starting or capping the amount of new runners we can spin up

Describe the bug

Related to this issue: https://github.com/actions/actions-runner-controller/discussions/3300

Related to this line of code: https://github.com/actions/actions-runner-controller/blob/master/controllers/actions.github.com/ephemeralrunner_controller.go#L202

If an ephemeral runner fails to start up more than 5 times it is marked as failed. If multiple runners fail to startup it will take up the max runner limit and block new runners from starting up. We need this to be configurable and somehow clean the failed runners after sometime as well.

Describe the expected behavior

The expected behavior we want is to set the failure threshold so that we can buy more time to catch these failed ephemeral runners. Something like this would be great:

case len(ephemeralRunner.Status.Failures) > failedRetryLimit:

We should be able to set it in the helm chart for the actions runner controller. And if the controller automatically cleaned the failed runners that would be great as well maybe once a day or something.

Additional Context

N/A

Controller Logs

N/A

Runner Pod Logs

N/A

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinggha-runner-scale-setRelated to the gha-runner-scale-set modeneeds triageRequires review from the maintainers

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions