introducing ephemeral-runners #1122

mganter · 2025-11-05T09:47:37Z

mganter commented

2025-11-05 09:47:37 +00:00

This PR introduces a flag for runner registration to tell forgejo that the runner token should be invalidated after executing one task. To prevent unauthorized looping in daemon mode, the runner will now terminated in deamon mode after one task.

Rel: https://codeberg.org/forgejo/forgejo/pulls/9962
Rel: https://codeberg.org/forgejo/forgejo/issues/9407

Big thanks to @ChristopherHX for implementing this in gitea

other
- PR: introducing ephemeral-runners

This PR introduces a flag for runner registration to tell forgejo that the runner token should be invalidated after executing one task. To prevent unauthorized looping in daemon mode, the runner will now terminated in deamon mode after one task. Rel: https://codeberg.org/forgejo/forgejo/pulls/9962 Rel: https://codeberg.org/forgejo/forgejo/issues/9407 Big thanks to [@ChristopherHX](https://github.com/ChristopherHX/) for implementing this in gitea   - other - [PR](https://code.forgejo.org/forgejo/runner/pulls/1122): introducing ephemeral-runners

👍 3

mganter commented

2025-11-05 09:51:13 +00:00

Rel: https://codeberg.org/forgejo/docs/pulls/1575/files

earl-warren added the

Kind/Feature

label

2025-11-05 13:33:00 +00:00

earl-warren reviewed

2025-11-05 13:39:46 +00:00

act/artifactcache/mock_caches.go Outdated

					
				@ -215,4 +215,2 @@

					mock.TestingT

					Cleanup(func())

				},

				) *mockCaches {

This is a cosmetic change that does not belong.

oh yeah sry :O removed another file format as well

pipeline reformats this when executing make fmt. which is enforced by the pipeline

https://code.forgejo.org/forgejo/runner/actions/runs/11777/jobs/0/attempt/1

pipeline reformats this when executing make fmt. which is enforced by the pipeline https://code.forgejo.org/forgejo/runner/actions/runs/11777/jobs/0/attempt/1

mganter marked this conversation as resolved

earl-warren commented

2025-11-05 13:43:17 +00:00

It looks good 👍 It needs careful review but overall this is great.

👍 1

mganter force-pushed ephemeral-runners from 04692452e1

cascade / forgejo (pull_request_target) Has been skipped

Details

cascade / debug (pull_request_target) Has been skipped

Details

cascade / end-to-end (pull_request_target) Has been skipped

Details

issue-labels / release-notes (pull_request_target) Successful in 6s

Details

checks / validate pre-commit-hooks file (pull_request) Successful in 50s

Details

checks / validate mocks (pull_request) Successful in 1m2s

Details

checks / build and test (pull_request) Successful in 1m21s

Details

/ example-docker-compose (pull_request) Successful in 1m37s

Details

checks / runner exec tests (pull_request) Successful in 32s

Details

checks / runner integration tests (pull_request) Failing after 1m22s

Details

/ example-lxc-systemd (pull_request) Successful in 7m5s

Details

checks / integration tests (pull_request) Successful in 13m7s

Details

to 64150d7ecf

cascade / forgejo (pull_request_target) Has been skipped

Details

cascade / debug (pull_request_target) Has been skipped

Details

cascade / end-to-end (pull_request_target) Has been skipped

Details

issue-labels / release-notes (pull_request_target) Successful in 7s

Details

checks / validate pre-commit-hooks file (pull_request) Successful in 43s

Details

checks / validate mocks (pull_request) Successful in 53s

Details

checks / build and test (pull_request) Successful in 1m3s

Details

checks / runner exec tests (pull_request) Successful in 32s

Details

checks / runner integration tests (pull_request) Successful in 7m36s

Details

checks / integration tests (pull_request) Successful in 14m47s

Details

2025-11-05 15:10:03 +00:00

Compare

mfenniak commented

2025-11-05 16:45:26 +00:00

Can someone describe to me how this feature is intended to be used... in what system architecture you'd be doing this? 🤔 All of the documentation is very technical about how it works, great to have... but I don't understand what the motivation is behind the attention in this area.

mganter commented

2025-11-05 17:02:41 +00:00

@mfenniak wrote in #1122 (comment):

Can someone describe to me how this feature is intended to be used... in what system architecture you'd be doing this? 🤔 All of the documentation is very technical about how it works, great to have... but I don't understand what the motivation is behind the attention in this area.

The problem resides with hijacking runner tokens. Depending on the mapped level (forgejo, orga, repo), a hijacked runner token would allow to receive secrets and code from tasks that are currently pending by fetching new tasks.

By restricting the access for a given runner token to a single task, the token can only be used to fetch data related to this task.

This also implies that if you want to create new ephemeral runners, you need to register them first to retrieve a runner token and provide it to the runner.

So in contrast to one-job runners, ephemeral runners can only do one job with their token.

Maybe this comment might help you understand the problem a little more:
https://gitea.com/gitea/act_runner/issues/19#issuecomment-739221

@mfenniak wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-65274: > Can someone describe to me how this feature is intended to be used... in what system architecture you'd be doing this? :thinking: All of the documentation is very technical about how it works, great to have... but I don't understand what the motivation is behind the attention in this area. The problem resides with hijacking runner tokens. Depending on the mapped level (forgejo, orga, repo), a hijacked runner token would allow to receive secrets and code from tasks that are currently pending by fetching new tasks. By restricting the access for a given runner token to a single task, the token can only be used to fetch data related to this task. This also implies that if you want to create new ephemeral runners, you need to register them first to retrieve a runner token and provide it to the runner. So in contrast to one-job runners, ephemeral runners can only do one job with their token. Maybe this comment might help you understand the problem a little more: https://gitea.com/gitea/act_runner/issues/19#issuecomment-739221

mfenniak commented

2025-11-05 17:16:05 +00:00

I understand that. But why are you using one-job?

Just to be clear, I'm not asking this because I intend to negatively review this change. I understand the security motivation. But I don't understand, as I said, "in what system architecture you'd be doing this?"

I understand that. But why are you using `one-job`? Just to be clear, I'm not asking this because I intend to negatively review this change. I understand the security motivation. But I don't understand, as I said, "in what system architecture you'd be doing this?"

mganter commented

2025-11-06 08:14:23 +00:00

We are currently using autoscaling runners on k8s using garm and host mode, because we dont have access to the privileges requried by dind.
To build containers, we have a rootless buildkitd sidecar which acts as builder for docker buildx.

We are currently using autoscaling runners on k8s using garm and host mode, because we dont have access to the privileges requried by dind. To build containers, we have a rootless buildkitd sidecar which acts as builder for docker buildx.

earl-warren commented

2025-11-06 08:26:32 +00:00

That sounds interesting. Could you point to a description / infrastructure-as-code of how you do that? I'm also interested to know about the concrete problem you have in this particular context and how you plan to deploy the ephemeral feature once it is implemented.

mfenniak commented

2025-11-06 15:18:03 +00:00

The security improvement provided by ephemeral runners makes sense to me. But if it's a security enhancement, then should something be done to follow-up on preventing any ongoing usage of the insecure one-job capability?

The security improvement provided by ephemeral runners makes sense to me. But if it's a security enhancement, then should something be done to follow-up on preventing any ongoing usage of the insecure `one-job` capability?

mganter commented

2025-11-06 16:29:03 +00:00

@mfenniak wrote in #1122 (comment):

The security improvement provided by ephemeral runners makes sense to me. But if it's a security enhancement, then should something be done to follow-up on preventing any ongoing usage of the insecure one-job capability?

We could deprecate the feature or enforce the ephemeral registration for one-job executions.

Maybe ppl distribute runner tokens but no registration tokens. Which would require an adaption from their side. Altough i dont know anyone with such an environment.

@mfenniak wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-65359: > The security improvement provided by ephemeral runners makes sense to me. But if it's a security enhancement, then should something be done to follow-up on preventing any ongoing usage of the insecure `one-job` capability? We could deprecate the feature or enforce the ephemeral registration for one-job executions. Maybe ppl distribute runner tokens but no registration tokens. Which would require an adaption from their side. Altough i dont know anyone with such an environment.

mganter commented

2025-11-06 16:38:05 +00:00

@earl-warren wrote in #1122 (comment):

That sounds interesting. Could you point to a description / infrastructure-as-code of how you do that? I'm also interested to know about the concrete problem you have in this particular context and how you plan to deploy the ephemeral feature once it is implemented.

As mentioned above, we are using garm and k8s to autoscale forgejo runners. But we are providing them cross teams and dont want the teams to interfere with each other. Also, we want to prevent leakage of code or secrets required for deployments from other repositories through this hole.

The plan is that we use an adapted version of the garm-k8s-provider and garm's gitea integration to provision forgejo runner for our forgejo instance. Garm's k8s-provider will deploy runner pods in host mode with a buildkitd sidecar. Sadly i cannot share the code for this with you, as it is still closed source.

At some point we might tackle the forgejo integration into garm, but thats out of scope yet.

@earl-warren wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-65339: > That sounds interesting. Could you point to a description / infrastructure-as-code of how you do that? I'm also interested to know about the concrete problem you have in this particular context and how you plan to deploy the ephemeral feature once it is implemented. As mentioned [above](https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-65330), we are using [garm](https://github.com/cloudbase/garm) and k8s to autoscale forgejo runners. But we are providing them cross teams and dont want the teams to interfere with each other. Also, we want to prevent leakage of code or secrets required for deployments from other repositories through this hole. The plan is that we use an adapted version of the [garm-k8s-provider](https://github.com/mercedes-benz/garm-provider-k8s) and garm's gitea integration to provision forgejo runner for our forgejo instance. Garm's k8s-provider will deploy runner pods in host mode with a buildkitd sidecar. Sadly i cannot share the code for this with you, as it is still closed source. At some point we might tackle the forgejo integration into garm, but thats out of scope yet.

earl-warren commented

2025-11-07 12:05:22 +00:00

As mentioned above, we are using garm and k8s to autoscale forgejo runners.

Is there a documentation or infrastructure as code I could read to understand how you are doing that? It is something the Forgejo infrastructure itself could benefit from actually. It is k8s based and 100% Infrastructure as Code https://codeberg.org/forgejo/k8s-cluster, which is presumably very similar to what you are doing.

I realize this may seem out of scope but it will go a long way to give substance to the concrete need for this ephemeral feature. Without such an example to look at, it feels like a solution for an abstract problem. I trust you when you write that it is concrete for you. And I'm just looking for a way to observe that concrete example.

> As mentioned above, we are using garm and k8s to autoscale forgejo runners. Is there a documentation or infrastructure as code I could read to understand how you are doing that? It is something the Forgejo infrastructure itself could benefit from actually. It is k8s based and 100% Infrastructure as Code https://codeberg.org/forgejo/k8s-cluster, which is presumably very similar to what you are doing. I realize this may seem out of scope but it will go a long way to give substance to the concrete need for this ephemeral feature. Without such an example to look at, it feels like a solution for an abstract problem. I trust you when you write that it is concrete for you. And I'm just looking for a way to observe that concrete example.

danielsy commented

2025-11-07 15:57:54 +00:00

First-time contributor

@earl-warren wrote in #1122 (comment):

As mentioned above, we are using garm and k8s to autoscale forgejo runners.

Is there a documentation or infrastructure as code I could read to understand how you are doing that? It is something the Forgejo infrastructure itself could benefit from actually. It is k8s based and 100% Infrastructure as Code https://codeberg.org/forgejo/k8s-cluster, which is presumably very similar to what you are doing.

I realize this may seem out of scope but it will go a long way to give substance to the concrete need for this ephemeral feature. Without such an example to look at, it feels like a solution for an abstract problem. I trust you when you write that it is concrete for you. And I'm just looking for a way to observe that concrete example.

This is a rather incomplete example how it could look like, but it is still missing either dind or buildkitd and proper images:

apiVersion: v1
kind: Pod
metadata:
  name: runner
spec:
  initContainers:
  - image: forgejo/forgejo-runner:latest
    name: init-register-runner
    command: ['sh', '-c', 'forgejo-runner register --ephemeral --instance $INSTANCE_URL --token $RUNNER_TOKEN --name $(hostname) --no-interactive --labels k8s:host']
    resources: {}
    env:
    - name: GITEA_RUNNER_FILE
      value: /forgejo-runner/runner.json
    - name: RUNNER_TOKEN
      value: "token"
    - name: INSTANCE_URL
      value: "http://forgejo:3000/"
    volumeMounts:
    - name: runner-config
      mountPath: /forgejo-runner
  containers:
  - image: forgejo/forgejo-runner:latest
    name: runner
    resources: {}
    env:
    - name: GITEA_RUNNER_FILE
      value: /forgejo-runner/runner.json
    volumeMounts:
    - name: runner-config
      mountPath: /forgejo-runner
  restartPolicy: Never
  volumes:
  - name: runner-config
    emptyDir: {}

Hope this helps a little!

@earl-warren wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-65460: > > As mentioned above, we are using garm and k8s to autoscale forgejo runners. > > Is there a documentation or infrastructure as code I could read to understand how you are doing that? It is something the Forgejo infrastructure itself could benefit from actually. It is k8s based and 100% Infrastructure as Code https://codeberg.org/forgejo/k8s-cluster, which is presumably very similar to what you are doing. > > I realize this may seem out of scope but it will go a long way to give substance to the concrete need for this ephemeral feature. Without such an example to look at, it feels like a solution for an abstract problem. I trust you when you write that it is concrete for you. And I'm just looking for a way to observe that concrete example. This is a rather incomplete example how it could look like, but it is still missing either dind or buildkitd and proper images: ```yaml apiVersion: v1 kind: Pod metadata: name: runner spec: initContainers: - image: forgejo/forgejo-runner:latest name: init-register-runner command: ['sh', '-c', 'forgejo-runner register --ephemeral --instance $INSTANCE_URL --token $RUNNER_TOKEN --name $(hostname) --no-interactive --labels k8s:host'] resources: {} env: - name: GITEA_RUNNER_FILE value: /forgejo-runner/runner.json - name: RUNNER_TOKEN value: "token" - name: INSTANCE_URL value: "http://forgejo:3000/" volumeMounts: - name: runner-config mountPath: /forgejo-runner containers: - image: forgejo/forgejo-runner:latest name: runner resources: {} env: - name: GITEA_RUNNER_FILE value: /forgejo-runner/runner.json volumeMounts: - name: runner-config mountPath: /forgejo-runner restartPolicy: Never volumes: - name: runner-config emptyDir: {} ``` Hope this helps a little!

👍 3 🚀 1

mganter commented

2025-11-07 17:31:05 +00:00

@earl-warren wrote in #1122 (comment):

As mentioned above, we are using garm and k8s to autoscale forgejo runners.

Is there a documentation or infrastructure as code I could read to understand how you are doing that? It is something the Forgejo infrastructure itself could benefit from actually. It is k8s based and 100% Infrastructure as Code https://codeberg.org/forgejo/k8s-cluster, which is presumably very similar to what you are doing.

I realize this may seem out of scope but it will go a long way to give substance to the concrete need for this ephemeral feature. Without such an example to look at, it feels like a solution for an abstract problem. I trust you when you write that it is concrete for you. And I'm just looking for a way to observe that concrete example.

I just took a look into the repo, couldn't find a runner config there. But i really like the flux setup there 👍. Thanks @danielsy for the nice example. If i can find a some time during theweekend, i can provide a little more suffisticated example with autoscaling and buildkitd or dind.

Is there any contribution guide to the infrastructure repo? If not i will just provide some flux defintions and steps to configure it.

@earl-warren wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-65460: > > As mentioned above, we are using garm and k8s to autoscale forgejo runners. > > Is there a documentation or infrastructure as code I could read to understand how you are doing that? It is something the Forgejo infrastructure itself could benefit from actually. It is k8s based and 100% Infrastructure as Code https://codeberg.org/forgejo/k8s-cluster, which is presumably very similar to what you are doing. > > I realize this may seem out of scope but it will go a long way to give substance to the concrete need for this ephemeral feature. Without such an example to look at, it feels like a solution for an abstract problem. I trust you when you write that it is concrete for you. And I'm just looking for a way to observe that concrete example. I just took a look into the repo, couldn't find a runner config there. But i really like the flux setup there 👍. Thanks @danielsy for the nice [example](https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-65486). If i can find a some time during theweekend, i can provide a little more suffisticated example with autoscaling and buildkitd or dind. Is there any contribution guide to the infrastructure repo? If not i will just provide some flux defintions and steps to configure it.

earl-warren commented

2025-11-07 17:47:46 +00:00

@mganter wrote in #1122 (comment):

I just took a look into the repo, couldn't find a runner config there.

There is none. Hence my interest 😁 It would be a matter of getting it right the first time instead of reworking something that's not quite right.

@mganter wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-65489: > I just took a look into the repo, couldn't find a runner config there. There is none. Hence my interest 😁 It would be a matter of getting it right the first time instead of reworking something that's not quite right.

earl-warren commented

2025-11-07 17:50:39 +00:00

@danielsy would you also have a snippet in the same spirit for how the token is obtained?

earl-warren commented

2025-11-07 17:52:07 +00:00

And also... what would create this? Would a script polling the Forgejo API have to be created?

aahlenst commented

2025-11-07 18:34:52 +00:00

I'd like to know whether you've considered emulating GitHub's just-in-time runner for a repository.

I'd like to know whether you've considered emulating GitHub's [just-in-time runner for a repository](https://docs.github.com/en/rest/actions/self-hosted-runners?apiVersion=2022-11-28#create-configuration-for-a-just-in-time-runner-for-a-repository).

aahlenst commented

2025-11-08 10:18:55 +00:00

Disclaimer, because it has caused confusion in the past: I'm not affiliated with anyone involved in this PR. I just happen to have similar interests.

While I'm happy about the introduction of ephemeral runners, I'm not sure whether the overall workflow is good. There does not seem to be a central tracking issue that discusses it (I'd appreciate any pointers). So it's very well possible that I am missing something.

What worries me is that this PR exist. It should not because whether a runner is ephemeral or not should only be controlled by Forgejo. There should be zero involvement of the runner.

On a high level, my ideal workflow would look like this:

Workflow run is triggered.
Autoscaler becomes aware of the pending workflow run.
Autoscaler asks Forgejo for the creation of an ephemeral runner for a specific single pending workflow run. Arguments: run ID, runner name, labels.
Forgejo creates a runner in the corresponding repository with the given name and labels. The runner is marked as ephemeral and restricted to the run ID specified in step 3. Forgejo returns: Authentication token for the runner.
Autoscaler creates container/VM/whatever, configures the Forgejo Runner (roughly forgejo-runner register), passing the authentication token and the labels (to allow rewriting "trixie" to "trixie:host" or something else), optionally the name. Then, it starts Forgejo Runner (roughly forgejo-runner one-job). (Note: Label rewriting is incompatible with GitHub's JIT config.)
Forgejo Runner connects to Forgejo and asks for a job, similarly to today's one-job. Forgejo returns the previously registered run with the ID from step 3. Forgejo does not hand out the run to any other runner and prevents reuse of the authentication token.

Advantages:

Runs can no longer end up on a runner they are not supposed to. Today, and with the proposed implementation, it's possible.
Binding runners to a specific run makes cancellation and dealing with all kinds of faults a lot easier.
Forgejo alone is in charge of enforcing the rules. Important when building PRs and running forgejo-runner in host mode which seems to be an objective of this PR.
The runner does not have to know whether it's ephemeral or not. It reduces the potential for shenanigans. For example, what happens with the PR if I omit the --ephemeral option?

Disadvantages:

Less compatibility with GitHub, Gitea.
No ephemeral runners that can process multiple jobs (I consider that a win).

_Disclaimer, because it has caused confusion in the past: I'm not affiliated with anyone involved in this PR. I just happen to have similar interests._ While I'm happy about the introduction of ephemeral runners, I'm not sure whether the overall workflow is good. There does not seem to be a central tracking issue that discusses it (I'd appreciate any pointers). So it's very well possible that I am missing something. What worries me is that this PR exist. It should not because whether a runner is ephemeral or not should only be controlled by Forgejo. There should be zero involvement of the runner. On a high level, my ideal workflow would look like this: 1. Workflow run is triggered. 2. Autoscaler becomes aware of the pending workflow run. 3. Autoscaler asks Forgejo for the creation of **an ephemeral runner for a specific single pending workflow run**. Arguments: run ID, runner name, labels. 4. Forgejo creates a runner in the corresponding repository with the given name and labels. The runner is marked as ephemeral and restricted to the run ID specified in step 3. Forgejo returns: Authentication token for the runner. 5. Autoscaler creates container/VM/whatever, configures the Forgejo Runner (roughly `forgejo-runner register`), passing the authentication token and the labels (to allow rewriting "trixie" to "trixie:host" or something else), optionally the name. Then, it starts Forgejo Runner (roughly `forgejo-runner one-job`). (_Note_: Label rewriting is incompatible with GitHub's JIT config.) 6. Forgejo Runner connects to Forgejo and asks for a job, similarly to today's `one-job`. Forgejo returns the previously registered run with the ID from step 3. Forgejo does not hand out the run to any other runner and prevents reuse of the authentication token. Advantages: * Runs can no longer end up on a runner they are not supposed to. Today, and with the proposed implementation, it's possible. * Binding runners to a specific run makes cancellation and dealing with all kinds of faults a lot easier. * Forgejo alone is in charge of enforcing the rules. Important when building PRs and running `forgejo-runner` in `host` mode which seems to be an objective of this PR. * The runner does not have to know whether it's ephemeral or not. It reduces the potential for shenanigans. For example, what happens with the PR if I omit the `--ephemeral` option? Disadvantages: * Less compatibility with GitHub, Gitea. * No ephemeral runners that can process multiple jobs (I consider that a win).

dharsanb commented

2025-11-08 15:12:08 +00:00

First-time contributor

Hey,
Just another interested user working on autoscaling CI.

Autoscaler becomes aware of the pending workflow run.

I don't think this part exists yet.
There are two ways that GitHub does this

GitHub has WebHooks for queued jobs. docs link
Action Runner Controller has an internal long polling endpoint which maintains connection till a job is queued and returns a response if a job is queued. If the connection is terminated / if response is received, the ARC controller opens a new connection.

Hey, Just another interested user working on autoscaling CI. > 2. Autoscaler becomes aware of the pending workflow run. I don't think this part exists yet. There are two ways that GitHub does this 1. GitHub has WebHooks for queued jobs. [docs link](https://docs.github.com/en/webhooks/webhook-events-and-payloads?actionType=queued#workflow_job) 2. Action Runner Controller has an internal long polling endpoint which maintains connection till a job is queued and returns a response if a job is queued. If the connection is terminated / if response is received, the ARC controller opens a new connection.

earl-warren commented

2025-11-08 19:47:59 +00:00

I made a note to check where this PR is at when I get back from vacation (yeah!) in two weeks.

danielsy commented

2025-11-10 13:58:44 +00:00

First-time contributor

@earl-warren wrote in #1122 (comment):

@danielsy would you also have a snippet in the same spirit for how the token is obtained?

In Garm environment, the runner calls a garm specific endpoint and garm calls

POST /repos/{owner}/{repo}/actions/runners/registration-token
POST /orgs/{org}/actions/runners/registration-token
POST /admin/actions/runners/registration-token

authenticated via personal access token. The retrieved token can be used by the init container to perform the registration and the runner can act upon this without having the registration token or any other secret than the runner token.

@earl-warren wrote in #1122 (comment):

And also... what would create this? Would a script polling the Forgejo API have to be created?

As soon as https://codeberg.org/forgejo/forgejo/pulls/9803 is integrated, forgejo can notify the orchestration tool (e.g. garm) to scale up workers.

@dharsanb wrote in #1122 (comment):

I don't think this part exists yet. There are two ways that GitHub does this

1. GitHub has WebHooks for queued jobs. [docs link](https://docs.github.com/en/webhooks/webhook-events-and-payloads?actionType=queued#workflow_job)

I created a PR solving this https://codeberg.org/forgejo/forgejo/pulls/9803

@earl-warren wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-65495: > @danielsy would you also have a snippet in the same spirit for how the token is obtained? In Garm environment, the runner calls [a garm specific endpoint](https://github.com/cloudbase/garm/blob/326204d1a684fbd1143a5f35178b56bd6c9c7170/apiserver/routers/routers.go#L166) and garm calls - `POST /repos/{owner}/{repo}/actions/runners/registration-token` - `POST /orgs/{org}/actions/runners/registration-token` - `POST /admin/actions/runners/registration-token` authenticated via personal access token. The retrieved token can be used by the init container to perform the registration and the runner can act upon this without having the registration token or any other secret than the runner token. @earl-warren wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-65499: > And also... what would create this? Would a script polling the Forgejo API have to be created? As soon as https://codeberg.org/forgejo/forgejo/pulls/9803 is integrated, forgejo can notify the orchestration tool (e.g. garm) to scale up workers. @dharsanb wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-65633: > I don't think this part exists yet. There are two ways that GitHub does this > > 1. GitHub has WebHooks for queued jobs. [docs link](https://docs.github.com/en/webhooks/webhook-events-and-payloads?actionType=queued#workflow_job) I created a PR solving this https://codeberg.org/forgejo/forgejo/pulls/9803

👍 1

mganter commented

2025-11-10 14:57:22 +00:00

@aahlenst wrote in #1122 (comment):

* Runs can no longer end up on a runner they are not supposed to. Today, and with the proposed implementation, it's possible.

I guess we just move the problem from the runner to the autoscaler.
As forgejo is a multi tenant environment, we need to take into account that we have multiple autoscalers for a specific repo.
E.g. there is a global autoscaler the whole platform and there are possibly org autoscalers and repo autoscalers, which serve sometimes runners with the same tags, sometimes with different ones.

I think your idea a valuable thing to achieve, but i would rather introduce a attribute to the workflow file, to assign tasks to the different runner types (global/org/repo).

Advantages:

* Binding runners to a specific run makes cancellation and dealing with all kinds of faults a lot easier.

I dont think so, as the jobs are bound to a runner as soon as the runner fetches a job. (already pre-ephmeral)

* Forgejo alone is in charge of enforcing the rules. Important when building PRs and running `forgejo-runner` in `host` mode which seems to be an objective of this PR.

I dont really know what you mean by that, as forgejo always enforces all the rules.

* The runner does not have to know whether it's ephemeral or not. It reduces the potential for shenanigans. For example, what happens with the PR if I omit the `--ephemeral` option?

In my mind I differ between runner registration and job execution. They dont need to be the same entity. In your example (step 3), this can also be done by the autoscaler.
As you can see in @danielsy comment, only the GITEA_RUNNER_FILE needs to be exposed to the job execution.

When the runner does not know about its ephemeral state, as of now it will be stuck in a 401 loop when run in daemon mode.

@aahlenst wrote in #1122 (comment):

No ephemeral runners that can process multiple jobs (I consider that a win).

👍 thats a win for sure

@aahlenst wrote in #1122 (comment):

I'd like to know whether you've considered emulating GitHub's just-in-time runner for a repository.

I just looked it up, and to be honest, i don't really understand this endpoint. Generates a configuration that can be passed to the runner application at startup. is kind of ambigous. Did you already try it out? Do you have a nice resource for getting me up to speed?

@aahlenst wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-65631: > * Runs can no longer end up on a runner they are not supposed to. Today, and with the proposed implementation, it's possible. I guess we just move the problem from the runner to the autoscaler. As forgejo is a multi tenant environment, we need to take into account that we have multiple autoscalers for a specific repo. E.g. there is a global autoscaler the whole platform and there are possibly org autoscalers and repo autoscalers, which serve sometimes runners with the same tags, sometimes with different ones. I think your idea a valuable thing to achieve, but i would rather introduce a attribute to the workflow file, to assign tasks to the different runner types (global/org/repo). > Advantages: > > * Binding runners to a specific run makes cancellation and dealing with all kinds of faults a lot easier. I dont think so, as the jobs are bound to a runner as soon as the runner fetches a job. (already pre-ephmeral) > * Forgejo alone is in charge of enforcing the rules. Important when building PRs and running `forgejo-runner` in `host` mode which seems to be an objective of this PR. I dont really know what you mean by that, as forgejo always enforces all the rules. > * The runner does not have to know whether it's ephemeral or not. It reduces the potential for shenanigans. For example, what happens with the PR if I omit the `--ephemeral` option? In my mind I differ between runner registration and job execution. They dont need to be the same entity. In your example (step 3), this can also be done by the autoscaler. As you can see in @danielsy [comment](https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-65486), only the GITEA_RUNNER_FILE needs to be exposed to the job execution. When the runner does not know about its ephemeral state, as of now it will be stuck in a 401 loop when run in daemon mode. @aahlenst wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-65631: > No ephemeral runners that can process multiple jobs (I consider that a win). 👍 thats a win for sure @aahlenst wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-65501: > I'd like to know whether you've considered emulating GitHub's [just-in-time runner for a repository](https://docs.github.com/en/rest/actions/self-hosted-runners?apiVersion=2022-11-28#create-configuration-for-a-just-in-time-runner-for-a-repository). I just looked it up, and to be honest, i don't really understand this endpoint. `Generates a configuration that can be passed to the runner application at startup.` is kind of ambigous. Did you already try it out? Do you have a nice resource for getting me up to speed?

mganter commented

2025-11-10 15:07:34 +00:00

This one is about ephemeral runners, maybe we can continue the discussion there.

forgejo/forgejo-actions-feature-requests#43

This one is about ephemeral runners, maybe we can continue the discussion there. https://code.forgejo.org/forgejo/forgejo-actions-feature-requests/issues/43

aahlenst commented

2025-11-10 16:40:33 +00:00

I guess we just move the problem from the runner to the autoscaler.

The runner is on a potentially attacker-controlled machine. Therefore, moving the problem somewhere else sounds good to me.

I think your idea a valuable thing to achieve, but i would rather introduce a attribute to the workflow file, to assign tasks to the different runner types (global/org/repo).

I believe we're talking about different things here. Let's say two jobs with equal labels are pending in a single repository, 1 and 2. In response, the autoscaler provisions two containers/VMs/whatever, A (for 1) and B (for 2). Right now, you cannot guarantee that 1 will actually end up on A and 2 on B.

If job 1 has a superset of labels of job 2, it might even happen that 2 ends up on A and 1 cannot run because B is missing labels. A workaround would be to ensure that label sets are unique, but that shifts the responsibility to the workflow author and allows PR authors to cause havoc.

Binding runners to a specific run makes cancellation and dealing with all kinds of faults a lot easier.
I dont think so, as the jobs are bound to a runner as soon as the runner fetches a job. (already pre-ephmeral)

If the Forgejo API exposes that information, it would be possible to figure out on which container/VM/whatever a job has ended up by matching runner names. But for that to work, a job must be successfully assigned to a runner. If something goes wrong earlier (see previous example), you're out of luck.

In my mind I differ between runner registration and job execution. They dont need to be the same entity. In your example (step 3), this can also be done by the autoscaler.

Is there currently an API to do this? And if there is, why should the runner also be able to register itself as ephemeral runner?

As you can see in @danielsy comment, only the GITEA_RUNNER_FILE needs to be exposed to the job execution.

That is very nice solution for the tool that you're using. For somebody using, let's say OpenStack, that would require launching multiple VMs after another and transferring data between them. That wastes a lot of resources and takes forever.

I'd like to know whether you've considered emulating GitHub's just-in-time runner for a repository.

I just looked it up, and to be honest, i don't really understand this endpoint. Generates a configuration that can be passed to the runner application at startup. is kind of ambigous. Did you already try it out? Do you have a nice resource for getting me up to speed?

I don't have a link ready, but I'm using it with GitHub. In short: It registers an ephemeral runner and returns Base64-encoded JSON. That JSON contains all configuration information for the runner. You can pass that directly to actions-runner/run.sh --jitconfig which immediately starts running the job. No separate configuration is necessary and it avoids all quoting problems. But it might be tricky to do that with Forgejo's label rewriting.

This one is about ephemeral runners, maybe we can continue the discussion there.

forgejo/forgejo-actions-feature-requests#43

Might be difficult now. We'd lose all people that are following this issue.

> I guess we just move the problem from the runner to the autoscaler. The runner is on a potentially attacker-controlled machine. Therefore, moving the problem somewhere else sounds good to me. > I think your idea a valuable thing to achieve, but i would rather introduce a attribute to the workflow file, to assign tasks to the different runner types (global/org/repo). I believe we're talking about different things here. Let's say two jobs with equal labels are pending in a single repository, 1 and 2. In response, the autoscaler provisions two containers/VMs/whatever, A (for 1) and B (for 2). Right now, you cannot guarantee that 1 will actually end up on A and 2 on B. If job 1 has a superset of labels of job 2, it might even happen that 2 ends up on A and 1 cannot run because B is missing labels. A workaround would be to ensure that label sets are unique, but that shifts the responsibility to the workflow author and allows PR authors to cause havoc. >> Binding runners to a specific run makes cancellation and dealing with all kinds of faults a lot easier. > I dont think so, as the jobs are bound to a runner as soon as the runner fetches a job. (already pre-ephmeral) If the Forgejo API exposes that information, it would be possible to figure out on which container/VM/whatever a job has ended up by matching runner names. But for that to work, a job must be successfully assigned to a runner. If something goes wrong earlier (see previous example), you're out of luck. > In my mind I differ between runner registration and job execution. They dont need to be the same entity. In your example (step 3), this can also be done by the autoscaler. Is there currently an API to do this? And if there is, why should the runner also be able to register itself as ephemeral runner? > As you can see in @danielsy [comment](https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-65486), only the GITEA_RUNNER_FILE needs to be exposed to the job execution. That is very nice solution for the tool that you're using. For somebody using, let's say OpenStack, that would require launching multiple VMs after another and transferring data between them. That wastes a lot of resources and takes forever. > > I'd like to know whether you've considered emulating GitHub's [just-in-time runner for a repository](https://docs.github.com/en/rest/actions/self-hosted-runners?apiVersion=2022-11-28#create-configuration-for-a-just-in-time-runner-for-a-repository). > > I just looked it up, and to be honest, i don't really understand this endpoint. `Generates a configuration that can be passed to the runner application at startup.` is kind of ambigous. Did you already try it out? Do you have a nice resource for getting me up to speed? I don't have a link ready, but I'm using it with GitHub. In short: It registers an ephemeral runner and returns Base64-encoded JSON. That JSON contains all configuration information for the runner. You can pass that directly to `actions-runner/run.sh --jitconfig` which immediately starts running the job. No separate configuration is necessary and it avoids all quoting problems. But it might be tricky to do that with Forgejo's label rewriting. > > This one is about ephemeral runners, maybe we can continue the discussion there. > > forgejo/forgejo-actions-feature-requests#43 Might be difficult now. We'd lose all people that are following this issue.

aahlenst commented

2025-11-19 14:25:27 +00:00

I've thought some more about it.

I believe that using forgejo-runner register isn't beneficial in this case due to its reliance on the runner registration token. If the token leaks, adversaries can create new runners. Preventing it from leaking is not always easy. For example, people that wanted to start an ephemeral runner using cloud-init couldn't do so safely because cloud-init data sources are usually accessible during the lifetime of a machine.

Using the offline registration token with forgejo-runner create-runner-file looks more promising because it is bound to a single runner. Therefore, it could be invalidated immediately after the runner has connected to Forgejo. forgejo-runner one-job or even forgejo-runner daemon would still work and no changes to Forgejo Runner itself are necessary. I'm not opposed to changing Forgejo Runner if there's a compelling reason.

I'm much less clear on the changes required in Forgejo itself and what APIs to introduce. Endpoints for runner creation based on the offline registration token would certainly be necessary.

I have compiled a lengthy document about ephemeral runners that I use for research. It could serve as a basis for a design discussion. I'm willing to share it if anyone wants to read it. Just tell me where I should put it.

I've thought some more about it. I believe that using `forgejo-runner register` isn't beneficial in this case due to its reliance on the runner registration token. If the token leaks, adversaries can create new runners. Preventing it from leaking is not always easy. For example, people that wanted to start an ephemeral runner using [cloud-init](https://cloud-init.io/) couldn't do so safely because cloud-init data sources are usually accessible during the lifetime of a machine. Using the [offline registration token](https://forgejo.org/docs/v13.0/admin/actions/runner-installation/#offline-registration) with `forgejo-runner create-runner-file` looks more promising because it is bound to a single runner. Therefore, it could be invalidated immediately after the runner has connected to Forgejo. `forgejo-runner one-job` or even `forgejo-runner daemon` would still work and no changes to Forgejo Runner itself are necessary. I'm not opposed to changing Forgejo Runner if there's a compelling reason. I'm much less clear on the changes required in Forgejo itself and what APIs to introduce. Endpoints for runner creation based on the offline registration token would certainly be necessary. I have compiled a lengthy document about ephemeral runners that I use for research. It could serve as a basis for a design discussion. I'm willing to share it if anyone wants to read it. Just tell me where I should put it.

mganter commented

2025-11-20 10:29:04 +00:00

First, i guess we still dont understand each other with the registration procedure.

In this graphic, you can see that the registration process, in your case it might be the thing that provisions a VM, can register the runner. After that, the one that made the registration can share the runner.json with the runner.

@aahlenst wrote in #1122 (comment):

I believe we're talking about different things here. Let's say two jobs with equal labels are pending in a single repository, 1 and 2. In response, the autoscaler provisions two containers/VMs/whatever, A (for 1) and B (for 2). Right now, you cannot guarantee that 1 will actually end up on A and 2 on B.

If job 1 has a superset of labels of job 2, it might even happen that 2 ends up on A and 1 cannot run because B is missing labels. A workaround would be to ensure that label sets are unique, but that shifts the responsibility to the workflow author and allows PR authors to cause havoc.

Fair point, but I would consider this a problem of the autoscaler, as the autoscaler should be able to see that one job is still pending and has not a matching runner.

@aahlenst wrote in #1122 (comment):

That is very nice solution for the tool that you're using. For somebody using, let's say OpenStack, that would require launching multiple VMs after another and transferring data between them. That wastes a lot of resources and takes forever.

Is there currently an API to do this? And if there is, why should the runner also be able to register itself as ephemeral runner?

As mentioned above, you need something that scales your VMs that can also register the runner. The API call is currently implemented as GRPC call with this message.

@aahlenst wrote in #1122 (comment):

I'm much less clear on the changes required in Forgejo itself and what APIs to introduce. Endpoints for runner creation based on the offline registration token would certainly be necessary.

In my opinion, offline registration and remote runner registration that is not executed on the runner host and the grpc call, are equivalent.
So there is actually already an endpoint doing that.

First, i guess we still dont understand each other with the registration procedure. ![image](/attachments/0bd5af28-74b3-46e2-8233-5651a381db11) In this graphic, you can see that the registration process, in your case it might be the thing that provisions a VM, can register the runner. After that, the one that made the registration can share the `runner.json` with the runner. @aahlenst wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-65763: > I believe we're talking about different things here. Let's say two jobs with equal labels are pending in a single repository, 1 and 2. In response, the autoscaler provisions two containers/VMs/whatever, A (for 1) and B (for 2). Right now, you cannot guarantee that 1 will actually end up on A and 2 on B. > > If job 1 has a superset of labels of job 2, it might even happen that 2 ends up on A and 1 cannot run because B is missing labels. A workaround would be to ensure that label sets are unique, but that shifts the responsibility to the workflow author and allows PR authors to cause havoc. Fair point, but I would consider this a problem of the autoscaler, as the autoscaler should be able to see that one job is still pending and has not a matching runner. @aahlenst wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-65763: > That is very nice solution for the tool that you're using. For somebody using, let's say OpenStack, that would require launching multiple VMs after another and transferring data between them. That wastes a lot of resources and takes forever. > Is there currently an API to do this? And if there is, why should the runner also be able to register itself as ephemeral runner? As mentioned above, you need something that scales your VMs that can also register the runner. The API call is currently implemented as [GRPC](https://code.forgejo.org/forgejo/actions-proto/src/branch/main/runner/v1/runnerv1connect/services.connect.go#L37) call with this [message](https://code.forgejo.org/forgejo/actions-proto/src/branch/main/runner/v1/messages.pb.go#L134-L148). @aahlenst wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-67074: > I'm much less clear on the changes required in Forgejo itself and what APIs to introduce. Endpoints for runner creation based on the offline registration token would certainly be necessary. In my opinion, offline registration and remote runner registration that is not executed on the runner host and the grpc call, are equivalent. So there is actually already an endpoint doing that.

image.png

72 KiB

kindlerw commented

2025-11-20 10:48:30 +00:00

First-time contributor

Garm does have an provider for Openstack and LXD that you could use as an inspiration.

In all the cases the registration token is not exposed to the actual runner.

Garm does have an [provider for Openstack ](https://github.com/cloudbase/garm-provider-openstack) and [LXD](https://github.com/cloudbase/garm-provider-lxd) that you could use as an inspiration. In all the cases the registration token is not exposed to the actual runner.

👀 1

aahlenst commented

2025-11-20 13:09:51 +00:00

@mganter wrote in #1122 (comment):

In this graphic, you can see that the registration process, in your case it might be the thing that provisions a VM, can register the runner. After that, the one that made the registration can share the runner.json with the runner.

I understand that this is a possibility. It works well for GARM with containers. It might work well for some of the other tools out there.

But: Why not make it easier, more accessible, and safer by default for everybody? Why require running a separate binary somewhere else instead of an API call? What are the advantages of your solution? What would you lose by not using forgejo-runner register?

Fair point, but I would consider this a problem of the autoscaler, as the autoscaler should be able to see that one job is still pending and has not a matching runner.

The autoscaler has provisioned something useless, compute time was wasted, and some precious resource like a GPU that the assigned job does not need (but the waiting one) is suddenly occupied.

Why not prevent it from happening in the first place? Why not make it more efficient and foolproof by default?

All problems I pointed out can somehow be solved without changing the current proposal. But we are talking about a new feature. So, why don't we try to come up with the absolute best possible solution as long as it does not require a rewrite of Forgejo? Why settle for less? It's unlikely we get a do-over soon. Also, keep in mind that people are great at using tools in unforeseen ways. Therefore, the number of problems with any solution will only go up.

To be clear, I don't expect that you implement any of the improvements we come up with.

@kindlerw wrote in #1122 (comment):

Garm does have an provider for Openstack and LXD that you could use as an inspiration.

In all the cases the registration token is not exposed to the actual runner.

I scrolled over the OpenStack provider. I can see how machines are provisioned and configuration is provided to the metadata service. I don't see there how runners are started. I found https://github.com/cloudbase/garm-provider-openstack/blob/main/vendor/github.com/cloudbase/garm-provider-common/cloudconfig/templates.go in there which does pretty much what I would expect: Fetch credentials from the metadata service, including the JIT config. Alternatively, it uses a normal runner token and configures the runner. If that script isn't executed on the same VM as the runner, where does it happen?

@mganter wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-67242: > In this graphic, you can see that the registration process, in your case it might be the thing that provisions a VM, can register the runner. After that, the one that made the registration can share the `runner.json` with the runner. I understand that this is a possibility. It works well for GARM with containers. It might work well for some of the other tools out there. But: Why not make it easier, more accessible, and safer _by default_ for everybody? Why require running a separate binary somewhere else instead of an API call? What are the advantages of your solution? What would you lose by not using `forgejo-runner register`? > Fair point, but I would consider this a problem of the autoscaler, as the autoscaler should be able to see that one job is still pending and has not a matching runner. The autoscaler has provisioned something useless, compute time was wasted, and some precious resource like a GPU that the assigned job does not need (but the waiting one) is suddenly occupied. Why not prevent it from happening in the first place? Why not make it more efficient and foolproof _by default_? All problems I pointed out can somehow be solved without changing the current proposal. But we are talking about a new feature. So, why don't we try to come up with the absolute best possible solution as long as it does not require a rewrite of Forgejo? Why settle for less? It's unlikely we get a do-over soon. Also, keep in mind that people are great at using tools in unforeseen ways. Therefore, the number of problems with any solution will only go up. To be clear, I don't expect that you implement any of the improvements we come up with. @kindlerw wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-67247: > Garm does have an [provider for Openstack ](https://github.com/cloudbase/garm-provider-openstack) and [LXD](https://github.com/cloudbase/garm-provider-lxd) that you could use as an inspiration. > > In all the cases the registration token is not exposed to the actual runner. I scrolled over the OpenStack provider. I can see how machines are provisioned and configuration is provided to the metadata service. I don't see there how runners are started. I found https://github.com/cloudbase/garm-provider-openstack/blob/main/vendor/github.com/cloudbase/garm-provider-common/cloudconfig/templates.go in there which does pretty much what I would expect: Fetch credentials from the metadata service, including the JIT config. Alternatively, it uses a normal runner token and configures the runner. If that script isn't executed on the same VM as the runner, where does it happen?

aahlenst commented

2025-11-22 12:35:14 +00:00

After extensive discussions with @mganter and colleagues, our conclusion¹ is:

There are setups that benefit from the workflow proposed by this PR. They avoid the risks thanks to their architecture (for example, by running forgejo-runner register in a separate container) and forgejo-runner register does not require additional plumbing in that particular case.
There are setups where replacing forgejo-runner register with an HTTP API call and using the offline registration token greatly simplifies developing integration while removing the risks posed by forgejo-runner register.

A new feature request will be filed for the second option.

Then, there's the problem of job binding. I have filed a separate feature request for that. It is slightly different than what was initially discussed because I discovered some flaws (the original proposal is still listed as an alternative).

From my side, all questions are resolved.

Please correct me if I am misrepresenting anything. ↩︎

After extensive discussions with @mganter and colleagues, our conclusion[^1] is: * There are setups that benefit from the workflow proposed by this PR. They avoid the risks thanks to their architecture (for example, by running `forgejo-runner register` in a separate container) and `forgejo-runner register` does not require additional plumbing in that particular case. * There are setups where replacing `forgejo-runner register` with an HTTP API call and using the offline registration token greatly simplifies developing integration while removing the risks posed by `forgejo-runner register`. A new feature request will be filed for the second option. Then, there's the problem of job binding. I have filed [a separate feature request for that](https://code.forgejo.org/forgejo/forgejo-actions-feature-requests/issues/76). It is slightly different than what was initially discussed because I discovered some flaws (the original proposal is still listed as an alternative). From my side, all questions are resolved. [^1]: Please correct me if I am misrepresenting anything.

👍 2

mganter force-pushed ephemeral-runners from e47d3908a7

cascade / forgejo (pull_request_target) Has been skipped

Details

cascade / debug (pull_request_target) Has been skipped

Details

cascade / end-to-end (pull_request_target) Has been skipped

Details

issue-labels / release-notes (pull_request_target) Successful in 4s

Details

Integration tests for the release process / release-simulation (pull_request) Has been cancelled

Details

example / docker-build-push-action-in-lxc (pull_request) Has been cancelled

Details

/ example-docker-compose (pull_request) Has been cancelled

Details

/ example-lxc-systemd (pull_request) Has been cancelled

Details

checks / build and test (pull_request) Has been cancelled

Details

checks / runner exec tests (pull_request) Has been cancelled

Details

checks / integration tests (pull_request) Has been cancelled

Details

checks / runner integration tests (pull_request) Has been cancelled

Details

checks / validate mocks (pull_request) Has been cancelled

Details

checks / validate pre-commit-hooks file (pull_request) Has been cancelled

Details

to 32ea44cab3

cascade / forgejo (pull_request_target) Has been skipped

Details

cascade / debug (pull_request_target) Has been skipped

Details

cascade / end-to-end (pull_request_target) Has been skipped

Details

issue-labels / release-notes (pull_request_target) Successful in 8s

Details

checks / validate pre-commit-hooks file (pull_request) Successful in 43s

Details

checks / validate mocks (pull_request) Successful in 57s

Details

checks / build and test (pull_request) Successful in 1m16s

Details

checks / runner exec tests (pull_request) Successful in 30s

Details

checks / runner integration tests (pull_request) Successful in 4m55s

Details

checks / integration tests (pull_request) Successful in 10m38s

Details

2025-12-03 08:22:56 +00:00

Compare

aahlenst commented

2025-12-04 15:12:30 +00:00

I gave it a whirl and encountered some problems.

$ forgejo-runner register --instance http://192.168.178.62:3000 --name forgejo-vm --token 6hakOEiBDH5UqV5nodTHbgSIh9rdK1rCICT8aGhm --labels "debian:docker://node:24-trixie" --ephemeral --no-interactive
INFO Registering runner, arch=amd64, os=linux, version=v12.1.0+13-g32ea44ca. 
WARN Runner in user-mode.                         
DEBU Successfully pinged the Forgejo instance server 
ERRO poller: cannot register new runner as ephemeral upgrade Forgejo to gain security, one-job will be used  automatically 
INFO Runner registered successfully.

The error "cannot register new runner as ephemeral upgrade Forgejo to gain security, one-job will be used automatically" contains too many different messages. While "cannot register new runner as ephemeral" is indeed an error, "upgrade Forgejo to gain security" seems unrelated. one-job will be used automatically" (one space too many between used and automatically) should be WARN or even INFO. I also find the message "one-job will be used automatically" confusing.

The resulting .runner file:

{
  "WARNING": "This file is automatically generated by act-runner. Do not edit it manually unless you know what you are doing. Removing this file will cause act runner to re-register as a new runner.",
  "id": 4,
  "uuid": "3ed7652f-ea66-4a08-9fce-0fe9e7fe2171",
  "name": "forgejo-vm",
  "token": "189d5fed8dbc911167a8c30b6881cd8cff7e3c82",
  "address": "http://192.168.178.62:3000",
  "labels": [
    "debian:docker://node:24-trixie"
  ],
  "ephemeral": true
}

Unfortunately, Forgejo does not recognize the runner as ephemeral:

sqlite> select id, ephemeral from action_runner where id = 4;
4|0

Forgejo Runner is now in a weird state. I think the registration error should be fatal.

When I switch the runner manually to be ephemeral in the database, it works. However, the jobs do not terminate cleanly:

$ forgejo-runner one-job
INFO[2025-12-04T14:46:52Z] Starting job                                 
INFO[2025-12-04T14:46:52Z] runner: forgejo-vm, with version: v12.1.0+13-g32ea44ca, with labels: [debian], declared successfully 
INFO[2025-12-04T14:46:52Z] task 10 repo is andreas/test https://data.forgejo.org http://192.168.178.62:3000 
INFO[2025-12-04T14:46:53Z] Cleaning up network for job test, and network name is: WORKFLOW-44e71cf8a3776b23a0054a93a70216b9 
WARN[2025-12-04T14:46:54Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-04T14:46:54Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-04T14:46:54Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-04T14:46:54Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-04T14:46:55Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-04T14:46:57Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-04T14:47:00Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-04T14:47:07Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-04T14:47:19Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-04T14:47:45Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
ERRO[2025-12-04T14:47:45Z] unable to send final job logs and status: All attempts fail:
#1: unauthenticated: unregistered runner
#2: unauthenticated: unregistered runner
#3: unauthenticated: unregistered runner
#4: unauthenticated: unregistered runner
#5: unauthenticated: unregistered runner
#6: unauthenticated: unregistered runner
#7: unauthenticated: unregistered runner
#8: unauthenticated: unregistered runner
#9: unauthenticated: unregistered runner
#10: unauthenticated: unregistered runner

The output of forgejo-runner daemon is confusing when the runner is configured to be ephemeral:

$ forgejo-runner daemon
INFO[2025-12-04T14:49:54Z] Starting runner daemon                       
INFO[2025-12-04T14:49:54Z] runner: forgejo-vm, with version: v12.1.0+13-g32ea44ca, with labels: [debian], declared successfully 
INFO[2025-12-04T14:50:06Z] task 11 repo is andreas/test https://data.forgejo.org http://192.168.178.62:3000 
INFO[2025-12-04T14:50:07Z] Cleaning up network for job test, and network name is: WORKFLOW-11c8b17f1f1ad67a3bb0dd296a60545a 
WARN[2025-12-04T14:50:07Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-04T14:50:07Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-04T14:50:07Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-04T14:50:08Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-04T14:50:09Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-04T14:50:10Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-04T14:50:14Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-04T14:50:20Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-04T14:50:33Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-04T14:50:59Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
ERRO[2025-12-04T14:50:59Z] unable to send final job logs and status: All attempts fail:
#1: unauthenticated: unregistered runner
#2: unauthenticated: unregistered runner
#3: unauthenticated: unregistered runner
#4: unauthenticated: unregistered runner
#5: unauthenticated: unregistered runner
#6: unauthenticated: unregistered runner
#7: unauthenticated: unregistered runner
#8: unauthenticated: unregistered runner
#9: unauthenticated: unregistered runner
#10: unauthenticated: unregistered runner 
INFO[2025-12-04T14:50:59Z] runner: forgejo-vm shutdown initiated, waiting [runner].shutdown_timeout=0s for running jobs to complete before shutting down

The last message should state why the runner is shutting down.

Offline registration works well 👍

The help messages for the --ephemeral flag could be improved. It states: "Configure the runner to be ephemeral and only ever be able to pick a single job." What about "instruct Forgejo to delete this runner after it has run one job"?

I gave it a whirl and encountered some problems. ```bash $ forgejo-runner register --instance http://192.168.178.62:3000 --name forgejo-vm --token 6hakOEiBDH5UqV5nodTHbgSIh9rdK1rCICT8aGhm --labels "debian:docker://node:24-trixie" --ephemeral --no-interactive INFO Registering runner, arch=amd64, os=linux, version=v12.1.0+13-g32ea44ca. WARN Runner in user-mode. DEBU Successfully pinged the Forgejo instance server ERRO poller: cannot register new runner as ephemeral upgrade Forgejo to gain security, one-job will be used automatically INFO Runner registered successfully. ``` The error "cannot register new runner as ephemeral upgrade Forgejo to gain security, one-job will be used automatically" contains too many different messages. While "cannot register new runner as ephemeral" is indeed an error, "upgrade Forgejo to gain security" seems unrelated. one-job will be used automatically" (one space too many between used and automatically) should be `WARN` or even `INFO`. I also find the message "one-job will be used automatically" confusing. The resulting `.runner` file: ```json { "WARNING": "This file is automatically generated by act-runner. Do not edit it manually unless you know what you are doing. Removing this file will cause act runner to re-register as a new runner.", "id": 4, "uuid": "3ed7652f-ea66-4a08-9fce-0fe9e7fe2171", "name": "forgejo-vm", "token": "189d5fed8dbc911167a8c30b6881cd8cff7e3c82", "address": "http://192.168.178.62:3000", "labels": [ "debian:docker://node:24-trixie" ], "ephemeral": true } ``` Unfortunately, Forgejo does not recognize the runner as ephemeral: ``` sqlite> select id, ephemeral from action_runner where id = 4; 4|0 ``` Forgejo Runner is now in a weird state. I think the registration error should be fatal. When I switch the runner manually to be ephemeral in the database, it works. However, the jobs do not terminate cleanly: ``` $ forgejo-runner one-job INFO[2025-12-04T14:46:52Z] Starting job INFO[2025-12-04T14:46:52Z] runner: forgejo-vm, with version: v12.1.0+13-g32ea44ca, with labels: [debian], declared successfully INFO[2025-12-04T14:46:52Z] task 10 repo is andreas/test https://data.forgejo.org http://192.168.178.62:3000 INFO[2025-12-04T14:46:53Z] Cleaning up network for job test, and network name is: WORKFLOW-44e71cf8a3776b23a0054a93a70216b9 WARN[2025-12-04T14:46:54Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-04T14:46:54Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-04T14:46:54Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-04T14:46:54Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-04T14:46:55Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-04T14:46:57Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-04T14:47:00Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-04T14:47:07Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-04T14:47:19Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-04T14:47:45Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner ERRO[2025-12-04T14:47:45Z] unable to send final job logs and status: All attempts fail: #1: unauthenticated: unregistered runner #2: unauthenticated: unregistered runner #3: unauthenticated: unregistered runner #4: unauthenticated: unregistered runner #5: unauthenticated: unregistered runner #6: unauthenticated: unregistered runner #7: unauthenticated: unregistered runner #8: unauthenticated: unregistered runner #9: unauthenticated: unregistered runner #10: unauthenticated: unregistered runner ``` The output of `forgejo-runner daemon` is confusing when the runner is configured to be ephemeral: ``` $ forgejo-runner daemon INFO[2025-12-04T14:49:54Z] Starting runner daemon INFO[2025-12-04T14:49:54Z] runner: forgejo-vm, with version: v12.1.0+13-g32ea44ca, with labels: [debian], declared successfully INFO[2025-12-04T14:50:06Z] task 11 repo is andreas/test https://data.forgejo.org http://192.168.178.62:3000 INFO[2025-12-04T14:50:07Z] Cleaning up network for job test, and network name is: WORKFLOW-11c8b17f1f1ad67a3bb0dd296a60545a WARN[2025-12-04T14:50:07Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-04T14:50:07Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-04T14:50:07Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-04T14:50:08Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-04T14:50:09Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-04T14:50:10Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-04T14:50:14Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-04T14:50:20Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-04T14:50:33Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-04T14:50:59Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner ERRO[2025-12-04T14:50:59Z] unable to send final job logs and status: All attempts fail: #1: unauthenticated: unregistered runner #2: unauthenticated: unregistered runner #3: unauthenticated: unregistered runner #4: unauthenticated: unregistered runner #5: unauthenticated: unregistered runner #6: unauthenticated: unregistered runner #7: unauthenticated: unregistered runner #8: unauthenticated: unregistered runner #9: unauthenticated: unregistered runner #10: unauthenticated: unregistered runner INFO[2025-12-04T14:50:59Z] runner: forgejo-vm shutdown initiated, waiting [runner].shutdown_timeout=0s for running jobs to complete before shutting down ``` The last message should state why the runner is shutting down. Offline registration works well 👍 The help messages for the `--ephemeral` flag could be improved. It states: "Configure the runner to be ephemeral and only ever be able to pick a single job." What about "instruct Forgejo to delete this runner after it has run one job"?

mfenniak commented

2025-12-04 16:18:07 +00:00

@aahlenst Thanks for doing a functional review on this.

I've had this item sitting in my inbox with the intent to review it for a while, and I apologize for not getting to it. I'm a little skeptical of the overall architecture and that's caused me to drag my feet... but honestly it's not an problem area that I've tackled for the runner, so my skepticism isn't warranted. So I'll offer some clarity on my thoughts: if it gets a ✅ functional and code review from aahlenst, then I'm happy to get this merged and supported.

@aahlenst Thanks for doing a functional review on this. I've had this item sitting in my inbox with the intent to review it for a while, and I apologize for not getting to it. I'm a little skeptical of the overall architecture and that's caused me to drag my feet... but honestly it's not an problem area that I've tackled for the runner, so my skepticism isn't warranted. So I'll offer some clarity on my thoughts: if it gets a ✅ functional and code review from aahlenst, then I'm happy to get this merged and supported.

aahlenst commented

2025-12-04 21:32:48 +00:00

@mfenniak wrote in #1122 (comment):

I'm a little skeptical of the overall architecture and that's caused me to drag my feet... but honestly it's not an problem area that I've tackled for the runner, so my skepticism isn't warranted.

I'd still like to read it. When I started to think about the feature a couple of months back, my ideas looked very different than today.

if it gets a ✅ functional and code review from aahlenst, then I'm happy to get this merged and supported.

Thanks a lot for your confidence, but I'll need some help. For example, my Go skills aren't good enough yet to asses any substantial PR.

@mfenniak wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-69082: > I'm a little skeptical of the overall architecture and that's caused me to drag my feet... but honestly it's not an problem area that I've tackled for the runner, so my skepticism isn't warranted. I'd still like to read it. When I started to think about the feature a couple of months back, my ideas looked very different than today. > if it gets a ✅ functional and code review from aahlenst, then I'm happy to get this merged and supported. Thanks a lot for your confidence, but I'll need some help. For example, my Go skills aren't good enough yet to asses any substantial PR.

aahlenst commented

2025-12-04 21:39:27 +00:00

@mganter I forgot to ask: Is there a particular reason for the presence of the ephemeral bit in the runner file and for altering the behaviour of daemon in case of its presence? ephemeral together with daemon feels wrong and causes complexity. There's also the risk of Forgejo and Runner disagreeing about the value of ephemeral. So I'd like to explore whether we could live without it or, if necessary, replace it.

@mganter I forgot to ask: Is there a particular reason for the presence of the `ephemeral` bit in the runner file and for altering the behaviour of `daemon` in case of its presence? `ephemeral` together with `daemon` feels wrong and causes complexity. There's also the risk of Forgejo and Runner disagreeing about the value of `ephemeral`. So I'd like to explore whether we could live without it or, if necessary, replace it.

mfenniak commented

2025-12-05 15:41:36 +00:00

@aahlenst wrote in #1122 (comment):

Thanks a lot for your confidence, but I'll need some help. For example, my Go skills aren't good enough yet to asses any substantial PR.

No problem, I'll be available for a detailed code review when appropriate, just let me know.

@aahlenst wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-69168: > Thanks a lot for your confidence, but I'll need some help. For example, my Go skills aren't good enough yet to asses any substantial PR. No problem, I'll be available for a detailed code review when appropriate, just let me know.

shuppy referenced this pull request from forgejo/forgejo-actions-feature-requests

2025-12-06 13:14:31 +00:00

HTTP Endpoint for forgejo runner registration #78

mganter commented

2025-12-09 16:11:44 +00:00

@aahlenst and @mfenniak thanks for reviewing

@aahlenst wrote in #1122 (comment):

Unfortunately, Forgejo does not recognize the runner as ephemeral:

I missed a bug where the registration was not sent properly. Fixed it in !1122 (commit 46db91068c) sorry about that

@aahlenst wrote in #1122 (comment):

@mganter I forgot to ask: Is there a particular reason for the presence of the ephemeral bit in the runner file and for altering the behaviour of daemon in case of its presence? ephemeral together with daemon feels wrong and causes complexity. There's also the risk of Forgejo and Runner disagreeing about the value of ephemeral. So I'd like to explore whether we could live without it or, if necessary, replace it.

This was intended as it allows users to run the deamon command, even though they registered their runner as ephemeral.
If this is not implemented, the runner would loop in a 403 error.
Similiar to this (even though i executed the daemon command twice in this example):

> go run main.go -c config daemon
INFO[2025-12-09T17:10:16+01:00] Starting runner daemon
INFO[2025-12-09T17:10:16+01:00] runner: test, with version: dev, with labels: [asd], declared successfully
INFO[2025-12-09T17:10:16+01:00] [poller 0] launched
INFO[2025-12-09T17:10:32+01:00] task 2 repo is giteaAdmin/asd https://data.forgejo.org http://localhost:3000
ERRO[2025-12-09T17:10:34+01:00] failed to fetch task                          error="unauthenticated: unregistered runner"
ERRO[2025-12-09T17:10:36+01:00] failed to fetch task                          error="unauthenticated: unregistered runner"
ERRO[2025-12-09T17:10:38+01:00] failed to fetch task                          error="unauthenticated: unregistered runner"
ERRO[2025-12-09T17:10:40+01:00] failed to fetch task                          error="unauthenticated: unregistered runner"
...

@aahlenst and @mfenniak thanks for reviewing @aahlenst wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-69071: > Unfortunately, Forgejo does not recognize the runner as ephemeral: I missed a bug where the registration was not sent properly. Fixed it in https://code.forgejo.org/forgejo/runner/pulls/1122/commits/46db91068c7e85c1d6cfd5b1a010304445886103 sorry about that @aahlenst wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-69169: > @mganter I forgot to ask: Is there a particular reason for the presence of the `ephemeral` bit in the runner file and for altering the behaviour of `daemon` in case of its presence? `ephemeral` together with `daemon` feels wrong and causes complexity. There's also the risk of Forgejo and Runner disagreeing about the value of `ephemeral`. So I'd like to explore whether we could live without it or, if necessary, replace it. This was intended as it allows users to run the deamon command, even though they registered their runner as ephemeral. If this is not implemented, the runner would loop in a 403 error. Similiar to this (even though i executed the daemon command twice in this example): ```bash > go run main.go -c config daemon INFO[2025-12-09T17:10:16+01:00] Starting runner daemon INFO[2025-12-09T17:10:16+01:00] runner: test, with version: dev, with labels: [asd], declared successfully INFO[2025-12-09T17:10:16+01:00] [poller 0] launched INFO[2025-12-09T17:10:32+01:00] task 2 repo is giteaAdmin/asd https://data.forgejo.org http://localhost:3000 ERRO[2025-12-09T17:10:34+01:00] failed to fetch task error="unauthenticated: unregistered runner" ERRO[2025-12-09T17:10:36+01:00] failed to fetch task error="unauthenticated: unregistered runner" ERRO[2025-12-09T17:10:38+01:00] failed to fetch task error="unauthenticated: unregistered runner" ERRO[2025-12-09T17:10:40+01:00] failed to fetch task error="unauthenticated: unregistered runner" ... ```

mganter commented

2025-12-09 16:24:38 +00:00

@aahlenst wrote in #1122 (comment):

WARN[2025-12-04T14:50:07Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner

thats an interesting error, do you have an example for a matching workflow file?

@aahlenst wrote in #1122 (comment):

The error "cannot register new runner as ephemeral upgrade Forgejo to gain security, one-job will be used automatically" contains too many different messages.

You're right, I update the message

@aahlenst wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-69071: > WARN[2025-12-04T14:50:07Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner thats an interesting error, do you have an example for a matching workflow file? @aahlenst wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-69071: > The error "cannot register new runner as ephemeral upgrade Forgejo to gain security, one-job will be used automatically" contains too many different messages. You're right, I update the message

mganter commented

2025-12-09 16:53:37 +00:00

@aahlenst wrote in #1122 (comment):

The last message should state why the runner is shutting down.

Not quite the last last message but i hope that helps (f0f24ec15a)

> go run main.go -c config daemon
INFO[2025-12-09T17:48:40+01:00] Starting runner daemon
INFO[2025-12-09T17:48:41+01:00] runner: test, with version: dev, with labels: [asd], declared successfully
^CINFO[2025-12-09T17:48:44+01:00] runner: received shutdown signal
INFO[2025-12-09T17:48:44+01:00] runner: test shutdown initiated, waiting [runner].shutdown_timeout=3h0m0s for running jobs to complete before shutting down

> go run main.go -c config daemon
INFO[2025-12-09T17:48:51+01:00] Starting runner daemon
INFO[2025-12-09T17:48:51+01:00] runner: test, with version: dev, with labels: [asd], declared successfully
INFO[2025-12-09T17:48:59+01:00] task 6 repo is giteaAdmin/asd https://data.forgejo.org http://localhost:3000
INFO[2025-12-09T17:48:59+01:00] runner: ephemeral runner shutting down after job has completed
INFO[2025-12-09T17:48:59+01:00] runner: test shutdown initiated, waiting [runner].shutdown_timeout=3h0m0s for running jobs to complete before shutting down

@aahlenst wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-69071: > The last message should state why the runner is shutting down. Not quite the last last message but i hope that helps (https://code.forgejo.org/forgejo/runner/commit/f0f24ec15a293f9148b50fa08d7473055d231644) ``` > go run main.go -c config daemon INFO[2025-12-09T17:48:40+01:00] Starting runner daemon INFO[2025-12-09T17:48:41+01:00] runner: test, with version: dev, with labels: [asd], declared successfully ^CINFO[2025-12-09T17:48:44+01:00] runner: received shutdown signal INFO[2025-12-09T17:48:44+01:00] runner: test shutdown initiated, waiting [runner].shutdown_timeout=3h0m0s for running jobs to complete before shutting down > go run main.go -c config daemon INFO[2025-12-09T17:48:51+01:00] Starting runner daemon INFO[2025-12-09T17:48:51+01:00] runner: test, with version: dev, with labels: [asd], declared successfully INFO[2025-12-09T17:48:59+01:00] task 6 repo is giteaAdmin/asd https://data.forgejo.org http://localhost:3000 INFO[2025-12-09T17:48:59+01:00] runner: ephemeral runner shutting down after job has completed INFO[2025-12-09T17:48:59+01:00] runner: test shutdown initiated, waiting [runner].shutdown_timeout=3h0m0s for running jobs to complete before shutting down ```

mganter force-pushed ephemeral-runners from abc79b83c8

cascade / forgejo (pull_request_target) Has been skipped

Details

cascade / debug (pull_request_target) Has been skipped

Details

cascade / end-to-end (pull_request_target) Has been skipped

Details

issue-labels / release-notes (pull_request_target) Successful in 9s

Details

checks / build and test (pull_request) Failing after 30s

Details

checks / runner exec tests (pull_request) Has been skipped

Details

checks / runner integration tests (pull_request) Has been skipped

Details

checks / validate mocks (pull_request) Failing after 50s

Details

checks / validate pre-commit-hooks file (pull_request) Successful in 50s

Details

checks / integration tests (pull_request) Successful in 15m38s

Details

to f0f24ec15a

cascade / forgejo (pull_request_target) Has been skipped

Details

cascade / debug (pull_request_target) Has been skipped

Details

cascade / end-to-end (pull_request_target) Has been skipped

Details

issue-labels / release-notes (pull_request_target) Successful in 7s

Details

checks / validate pre-commit-hooks file (pull_request) Successful in 50s

Details

checks / validate mocks (pull_request) Successful in 56s

Details

checks / build and test (pull_request) Successful in 1m7s

Details

checks / runner exec tests (pull_request) Successful in 35s

Details

checks / runner integration tests (pull_request) Successful in 7m18s

Details

checks / integration tests (pull_request) Successful in 15m11s

Details

2025-12-09 16:54:54 +00:00

Compare

aahlenst commented

2025-12-10 15:24:43 +00:00

@mganter wrote in #1122 (comment):

I missed a bug where the registration was not sent properly. Fixed it in forgejo/runner@!1122 (commit 46db91068c) sorry about that

No worries. Works great now. Thanks a lot.

This was intended as it allows users to run the deamon command, even though they registered their runner as ephemeral.
If this is not implemented, the runner would loop in a 403 error.

I see two problems with the ephemeral bit as it's currently implemented:

We create another situation where Forgejo Runner and Forgejo can disagree about a runner's configuration.
We trade one usage error (forgejo-runner daemon with an ephemeral runner) for another (forgetting or accidentally adding --ephemeral). The former only impacts people with ephemeral runners and might even be easier to recognize than the latter because the ephemeral flag is in the somewhat obscure .runner file.

If avoiding the 403 error in a loop is the only reason for the ephemeral bit, I'd like to forego it for the time being. Adding it later is easier than taking it out. If we recognize that it is a real problem, we should try to come up with a more robust solution where a disagreement between Forgejo Runner and Forgejo results in an hard error.

However, if there's another reason, for example, turning daemon into one-job that waits for a job to run, we should rather consider extending one-job with a --wait flag.

@mganter wrote in #1122 (comment):

thats an interesting error, do you have an example for a matching workflow file?

on:
  workflow_dispatch:
jobs:
  test:
    runs-on: "debian"
    steps:
      - name: Greet
        run: echo 'Hello World'

The label definition: debian:docker://node:24-trixie

Logs from Ubuntu 24.04 with the latest Docker:

$ ./forgejo-runner daemon
INFO[2025-12-10T15:12:47Z] Starting runner daemon                       
INFO[2025-12-10T15:12:47Z] runner: ephemeral, with version: v12.1.0+17-gf0f24ec1, with labels: [debian], declared successfully 
INFO[2025-12-10T15:14:38Z] task 1 repo is andreas/test https://data.forgejo.org http://192.168.178.62:3000 
INFO[2025-12-10T15:14:39Z] Cleaning up network for job test, and network name is: WORKFLOW-f315fe9928a983253749cc6cc9e5b48d 
WARN[2025-12-10T15:14:39Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-10T15:14:39Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-10T15:14:39Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-10T15:14:40Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-10T15:14:41Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-10T15:14:42Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-10T15:14:45Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-10T15:14:52Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-10T15:15:05Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-10T15:15:30Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
ERRO[2025-12-10T15:15:30Z] unable to send final job logs and status: All attempts fail:
#1: unauthenticated: unregistered runner
#2: unauthenticated: unregistered runner
#3: unauthenticated: unregistered runner
#4: unauthenticated: unregistered runner
#5: unauthenticated: unregistered runner
#6: unauthenticated: unregistered runner
#7: unauthenticated: unregistered runner
#8: unauthenticated: unregistered runner
#9: unauthenticated: unregistered runner
#10: unauthenticated: unregistered runner 
INFO[2025-12-10T15:15:30Z] runner: ephemeral runner shutting down after job has completed 
INFO[2025-12-10T15:15:30Z] runner: ephemeral shutdown initiated, waiting [runner].shutdown_timeout=0s for running jobs to complete before shutting down

Logs from Fedora 43 with rootless Podman:

$ DOCKER_HOST=unix:///run/user/1000/podman/podman.sock ./forgejo-runner daemon
INFO[2025-12-10T16:17:42+01:00] Starting runner daemon                       
INFO[2025-12-10T16:17:42+01:00] runner: ephemeral-local, with version: v12.1.0+17-gf0f24ec1, with labels: [debian], declared successfully 
INFO[2025-12-10T16:17:42+01:00] task 2 repo is andreas/test https://data.forgejo.org http://localhost:3000 
WARN[2025-12-10T16:17:46+01:00] ReportLog error: unauthenticated: unregistered runner 
WARN[2025-12-10T16:17:46+01:00] ReportState error: unauthenticated: unregistered runner 
WARN[2025-12-10T16:17:47+01:00] ReportLog error: unauthenticated: unregistered runner 
WARN[2025-12-10T16:17:47+01:00] ReportState error: unauthenticated: unregistered runner 
WARN[2025-12-10T16:17:48+01:00] ReportLog error: unauthenticated: unregistered runner 
WARN[2025-12-10T16:17:48+01:00] ReportState error: unauthenticated: unregistered runner 
WARN[2025-12-10T16:17:49+01:00] ReportLog error: unauthenticated: unregistered runner 
WARN[2025-12-10T16:17:49+01:00] ReportState error: unauthenticated: unregistered runner 
WARN[2025-12-10T16:17:50+01:00] ReportLog error: unauthenticated: unregistered runner 
WARN[2025-12-10T16:17:50+01:00] ReportState error: unauthenticated: unregistered runner 
WARN[2025-12-10T16:17:51+01:00] ReportLog error: unauthenticated: unregistered runner 
WARN[2025-12-10T16:17:51+01:00] ReportState error: unauthenticated: unregistered runner 
WARN[2025-12-10T16:17:52+01:00] ReportLog error: unauthenticated: unregistered runner 
WARN[2025-12-10T16:17:52+01:00] ReportState error: unauthenticated: unregistered runner 
WARN[2025-12-10T16:17:53+01:00] ReportLog error: unauthenticated: unregistered runner 
WARN[2025-12-10T16:17:53+01:00] ReportState error: unauthenticated: unregistered runner 
WARN[2025-12-10T16:17:54+01:00] ReportLog error: unauthenticated: unregistered runner 
WARN[2025-12-10T16:17:54+01:00] ReportState error: unauthenticated: unregistered runner 
INFO[2025-12-10T16:17:54+01:00] Cleaning up network for job test, and network name is: WORKFLOW-ea10c47ff2da78dae16d3eec4462372a 
WARN[2025-12-10T16:17:54+01:00] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-10T16:17:55+01:00] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-10T16:17:55+01:00] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-10T16:17:55+01:00] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-10T16:17:56+01:00] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-10T16:17:58+01:00] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-10T16:18:01+01:00] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-10T16:18:07+01:00] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-10T16:18:20+01:00] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
WARN[2025-12-10T16:18:46+01:00] uploading final logs failed, but will be retried: unauthenticated: unregistered runner 
ERRO[2025-12-10T16:18:46+01:00] unable to send final job logs and status: All attempts fail:
#1: unauthenticated: unregistered runner
#2: unauthenticated: unregistered runner
#3: unauthenticated: unregistered runner
#4: unauthenticated: unregistered runner
#5: unauthenticated: unregistered runner
#6: unauthenticated: unregistered runner
#7: unauthenticated: unregistered runner
#8: unauthenticated: unregistered runner
#9: unauthenticated: unregistered runner
#10: unauthenticated: unregistered runner 
INFO[2025-12-10T16:18:46+01:00] runner: ephemeral runner shutting down after job has completed 
INFO[2025-12-10T16:18:46+01:00] runner: ephemeral-local shutdown initiated, waiting [runner].shutdown_timeout=0s for running jobs to complete before shutting down 
INFO[2025-12-10T16:18:46+01:00] forcing the jobs to shutdown                 
INFO[2025-12-10T16:18:46+01:00] all jobs have been shutdown                  
WARN[2025-12-10T16:18:46+01:00] runner: ephemeral-local cancelled in progress jobs during shutdown

However, it works flawlessly when forgejo-runner operates in host mode.

@mganter wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-69569: > I missed a bug where the registration was not sent properly. Fixed it in [forgejo/runner@!1122 (commit `46db91068c`)](https://code.forgejo.org/forgejo/runner/pulls/1122/commits/46db91068c7e85c1d6cfd5b1a010304445886103) sorry about that No worries. Works great now. Thanks a lot. > This was intended as it allows users to run the deamon command, even though they registered their runner as ephemeral. > If this is not implemented, the runner would loop in a 403 error. I see two problems with the `ephemeral` bit as it's currently implemented: * We create another situation where Forgejo Runner and Forgejo can disagree about a runner's configuration. * We trade one usage error (`forgejo-runner daemon` with an ephemeral runner) for another (forgetting or accidentally adding `--ephemeral`). The former only impacts people with ephemeral runners and might even be easier to recognize than the latter because the `ephemeral` flag is in the somewhat obscure `.runner` file. If avoiding the 403 error in a loop is the only reason for the `ephemeral` bit, I'd like to forego it for the time being. Adding it later is easier than taking it out. If we recognize that it is a real problem, we should try to come up with a more robust solution where a disagreement between Forgejo Runner and Forgejo results in an hard error. However, if there's another reason, for example, turning `daemon` into `one-job` that waits for a job to run, we should rather consider extending `one-job` with a `--wait` flag. @mganter wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-69572: > thats an interesting error, do you have an example for a matching workflow file? ```yaml on: workflow_dispatch: jobs: test: runs-on: "debian" steps: - name: Greet run: echo 'Hello World' ``` The label definition: `debian:docker://node:24-trixie` Logs from Ubuntu 24.04 with the latest Docker: ``` $ ./forgejo-runner daemon INFO[2025-12-10T15:12:47Z] Starting runner daemon INFO[2025-12-10T15:12:47Z] runner: ephemeral, with version: v12.1.0+17-gf0f24ec1, with labels: [debian], declared successfully INFO[2025-12-10T15:14:38Z] task 1 repo is andreas/test https://data.forgejo.org http://192.168.178.62:3000 INFO[2025-12-10T15:14:39Z] Cleaning up network for job test, and network name is: WORKFLOW-f315fe9928a983253749cc6cc9e5b48d WARN[2025-12-10T15:14:39Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-10T15:14:39Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-10T15:14:39Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-10T15:14:40Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-10T15:14:41Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-10T15:14:42Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-10T15:14:45Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-10T15:14:52Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-10T15:15:05Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-10T15:15:30Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner ERRO[2025-12-10T15:15:30Z] unable to send final job logs and status: All attempts fail: #1: unauthenticated: unregistered runner #2: unauthenticated: unregistered runner #3: unauthenticated: unregistered runner #4: unauthenticated: unregistered runner #5: unauthenticated: unregistered runner #6: unauthenticated: unregistered runner #7: unauthenticated: unregistered runner #8: unauthenticated: unregistered runner #9: unauthenticated: unregistered runner #10: unauthenticated: unregistered runner INFO[2025-12-10T15:15:30Z] runner: ephemeral runner shutting down after job has completed INFO[2025-12-10T15:15:30Z] runner: ephemeral shutdown initiated, waiting [runner].shutdown_timeout=0s for running jobs to complete before shutting down ``` Logs from Fedora 43 with rootless Podman: ``` $ DOCKER_HOST=unix:///run/user/1000/podman/podman.sock ./forgejo-runner daemon INFO[2025-12-10T16:17:42+01:00] Starting runner daemon INFO[2025-12-10T16:17:42+01:00] runner: ephemeral-local, with version: v12.1.0+17-gf0f24ec1, with labels: [debian], declared successfully INFO[2025-12-10T16:17:42+01:00] task 2 repo is andreas/test https://data.forgejo.org http://localhost:3000 WARN[2025-12-10T16:17:46+01:00] ReportLog error: unauthenticated: unregistered runner WARN[2025-12-10T16:17:46+01:00] ReportState error: unauthenticated: unregistered runner WARN[2025-12-10T16:17:47+01:00] ReportLog error: unauthenticated: unregistered runner WARN[2025-12-10T16:17:47+01:00] ReportState error: unauthenticated: unregistered runner WARN[2025-12-10T16:17:48+01:00] ReportLog error: unauthenticated: unregistered runner WARN[2025-12-10T16:17:48+01:00] ReportState error: unauthenticated: unregistered runner WARN[2025-12-10T16:17:49+01:00] ReportLog error: unauthenticated: unregistered runner WARN[2025-12-10T16:17:49+01:00] ReportState error: unauthenticated: unregistered runner WARN[2025-12-10T16:17:50+01:00] ReportLog error: unauthenticated: unregistered runner WARN[2025-12-10T16:17:50+01:00] ReportState error: unauthenticated: unregistered runner WARN[2025-12-10T16:17:51+01:00] ReportLog error: unauthenticated: unregistered runner WARN[2025-12-10T16:17:51+01:00] ReportState error: unauthenticated: unregistered runner WARN[2025-12-10T16:17:52+01:00] ReportLog error: unauthenticated: unregistered runner WARN[2025-12-10T16:17:52+01:00] ReportState error: unauthenticated: unregistered runner WARN[2025-12-10T16:17:53+01:00] ReportLog error: unauthenticated: unregistered runner WARN[2025-12-10T16:17:53+01:00] ReportState error: unauthenticated: unregistered runner WARN[2025-12-10T16:17:54+01:00] ReportLog error: unauthenticated: unregistered runner WARN[2025-12-10T16:17:54+01:00] ReportState error: unauthenticated: unregistered runner INFO[2025-12-10T16:17:54+01:00] Cleaning up network for job test, and network name is: WORKFLOW-ea10c47ff2da78dae16d3eec4462372a WARN[2025-12-10T16:17:54+01:00] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-10T16:17:55+01:00] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-10T16:17:55+01:00] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-10T16:17:55+01:00] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-10T16:17:56+01:00] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-10T16:17:58+01:00] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-10T16:18:01+01:00] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-10T16:18:07+01:00] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-10T16:18:20+01:00] uploading final logs failed, but will be retried: unauthenticated: unregistered runner WARN[2025-12-10T16:18:46+01:00] uploading final logs failed, but will be retried: unauthenticated: unregistered runner ERRO[2025-12-10T16:18:46+01:00] unable to send final job logs and status: All attempts fail: #1: unauthenticated: unregistered runner #2: unauthenticated: unregistered runner #3: unauthenticated: unregistered runner #4: unauthenticated: unregistered runner #5: unauthenticated: unregistered runner #6: unauthenticated: unregistered runner #7: unauthenticated: unregistered runner #8: unauthenticated: unregistered runner #9: unauthenticated: unregistered runner #10: unauthenticated: unregistered runner INFO[2025-12-10T16:18:46+01:00] runner: ephemeral runner shutting down after job has completed INFO[2025-12-10T16:18:46+01:00] runner: ephemeral-local shutdown initiated, waiting [runner].shutdown_timeout=0s for running jobs to complete before shutting down INFO[2025-12-10T16:18:46+01:00] forcing the jobs to shutdown INFO[2025-12-10T16:18:46+01:00] all jobs have been shutdown WARN[2025-12-10T16:18:46+01:00] runner: ephemeral-local cancelled in progress jobs during shutdown ``` However, it works flawlessly when `forgejo-runner` operates in `host` mode.

mganter commented

2025-12-11 15:34:43 +00:00

WARN[2025-12-10T15:14:39Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner

In my latest insights, this is a race condition between RunDaemon() and Close() calling ReportLog() and ReportState() in internal/pkg/report/reporter.go. ReportState() draws its to be reported state from the struct it is attached to (Reporter). If RunDaemon executes ReportState() when the Reporter would consider the task finished, it will tell forgejo to deregister the runner. After that, the ReportLog() called from the Close() function, resulting in the error above.

The log below originates from the fix version with more informational logs.
Notice, that ReportState: result: RESULT_SUCCESS appears twice, and in between ReportLog: TaskId: 34, Index: 12, NoMore: true, Rows: 0 (NoMore: true indicates that it is called from the Close function)

INFO[2025-12-11T16:20:41+01:00] ReportLog: TaskId: 34, Index: 6, NoMore: false, Rows: 0
INFO[2025-12-11T16:20:41+01:00] ReportState: result: RESULT_UNSPECIFIED
INFO[2025-12-11T16:20:41+01:00] ReportState: state: id:34 started_at:{seconds:1765466420 nanos:52767000} steps:{}, outputs: map[]
INFO[2025-12-11T16:20:43+01:00] ReportLog: TaskId: 34, Index: 12, NoMore: false, Rows: 0
INFO[2025-12-11T16:20:43+01:00] ReportState: result: RESULT_SUCCESS
INFO[2025-12-11T16:20:43+01:00] Skipping ReportState to ensure ReportLog can be executed
INFO[2025-12-11T16:20:43+01:00] Cleaning up network for job deploy, and network name is: WORKFLOW-413bce77284d76012cbbd9fb355b9de1
INFO[2025-12-11T16:20:43+01:00] ReportLog: TaskId: 34, Index: 12, NoMore: true, Rows: 0
INFO[2025-12-11T16:20:43+01:00] ReportState: result: RESULT_SUCCESS
INFO[2025-12-11T16:20:43+01:00] ReportState: state: id:34 result:RESULT_SUCCESS started_at:{seconds:1765466420 nanos:52767000} stopped_at:{seconds:1765466443 nanos:184271000} steps:{result:RESULT_SUCCESS started_at:{seconds:1765466442 nanos:952339000} stopped_at:{seconds:1765466443 nanos:3085000} log_index:10 log_length:1}, outputs: map[]
INFO[2025-12-11T16:20:43+01:00] runner: ephemeral runner shutting down after job has completed
INFO[2025-12-11T16:20:43+01:00] runner: test shutdown initiated, waiting [runner].shutdown_timeout=0s for running jobs to complete before shutting down

This should solve the issue. !1122 (commit db9ad857f0)
Can you test this fix?

> WARN[2025-12-10T15:14:39Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner In my latest insights, this is a race condition between RunDaemon() and Close() calling ReportLog() and ReportState() in `internal/pkg/report/reporter.go`. ReportState() draws its to be reported state from the struct it is attached to (Reporter). If RunDaemon executes ReportState() when the Reporter would consider the task finished, it will tell forgejo to deregister the runner. After that, the ReportLog() called from the Close() function, resulting in the error above. The log below originates from the fix version with more informational logs. Notice, that `ReportState: result: RESULT_SUCCESS` appears twice, and in between `ReportLog: TaskId: 34, Index: 12, NoMore: true, Rows: 0` (NoMore: true indicates that it is called from the Close function) ``` INFO[2025-12-11T16:20:41+01:00] ReportLog: TaskId: 34, Index: 6, NoMore: false, Rows: 0 INFO[2025-12-11T16:20:41+01:00] ReportState: result: RESULT_UNSPECIFIED INFO[2025-12-11T16:20:41+01:00] ReportState: state: id:34 started_at:{seconds:1765466420 nanos:52767000} steps:{}, outputs: map[] INFO[2025-12-11T16:20:43+01:00] ReportLog: TaskId: 34, Index: 12, NoMore: false, Rows: 0 INFO[2025-12-11T16:20:43+01:00] ReportState: result: RESULT_SUCCESS INFO[2025-12-11T16:20:43+01:00] Skipping ReportState to ensure ReportLog can be executed INFO[2025-12-11T16:20:43+01:00] Cleaning up network for job deploy, and network name is: WORKFLOW-413bce77284d76012cbbd9fb355b9de1 INFO[2025-12-11T16:20:43+01:00] ReportLog: TaskId: 34, Index: 12, NoMore: true, Rows: 0 INFO[2025-12-11T16:20:43+01:00] ReportState: result: RESULT_SUCCESS INFO[2025-12-11T16:20:43+01:00] ReportState: state: id:34 result:RESULT_SUCCESS started_at:{seconds:1765466420 nanos:52767000} stopped_at:{seconds:1765466443 nanos:184271000} steps:{result:RESULT_SUCCESS started_at:{seconds:1765466442 nanos:952339000} stopped_at:{seconds:1765466443 nanos:3085000} log_index:10 log_length:1}, outputs: map[] INFO[2025-12-11T16:20:43+01:00] runner: ephemeral runner shutting down after job has completed INFO[2025-12-11T16:20:43+01:00] runner: test shutdown initiated, waiting [runner].shutdown_timeout=0s for running jobs to complete before shutting down ``` This should solve the issue. https://code.forgejo.org/forgejo/runner/pulls/1122/commits/db9ad857f08eebbe7cd0160f2de800ba985d4198 Can you test this fix?

mganter commented

2025-12-11 15:41:47 +00:00

@aahlenst wrote in #1122 (comment):

$ DOCKER_HOST=unix:///run/user/1000/podman/podman.sock ./forgejo-runner daemon
INFO[2025-12-10T16:17:42+01:00] Starting runner daemon
INFO[2025-12-10T16:17:42+01:00] runner: ephemeral-local, with version: v12.1.0+17-gf0f24ec1, with labels: [debian], declared successfully
INFO[2025-12-10T16:17:42+01:00] task 2 repo is andreas/test https://data.forgejo.org http://localhost:3000
WARN[2025-12-10T16:17:46+01:00] ReportLog error: unauthenticated: unregistered runner
WARN[2025-12-10T16:17:46+01:00] ReportState error: unauthenticated: unregistered runner
WARN[2025-12-10T16:17:47+01:00] ReportLog error: unauthenticated: unregistered runner
WARN[2025-12-10T16:17:47+01:00] ReportState error: unauthenticated: unregistered runner

I didnt dive into this error, yet. But the other fix might fix this as well.

@aahlenst wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-69605: > $ DOCKER_HOST=unix:///run/user/1000/podman/podman.sock ./forgejo-runner daemon > INFO[2025-12-10T16:17:42+01:00] Starting runner daemon > INFO[2025-12-10T16:17:42+01:00] runner: ephemeral-local, with version: v12.1.0+17-gf0f24ec1, with labels: [debian], declared successfully > INFO[2025-12-10T16:17:42+01:00] task 2 repo is andreas/test https://data.forgejo.org http://localhost:3000 > WARN[2025-12-10T16:17:46+01:00] ReportLog error: unauthenticated: unregistered runner > WARN[2025-12-10T16:17:46+01:00] ReportState error: unauthenticated: unregistered runner > WARN[2025-12-10T16:17:47+01:00] ReportLog error: unauthenticated: unregistered runner > WARN[2025-12-10T16:17:47+01:00] ReportState error: unauthenticated: unregistered runner I didnt dive into this error, yet. But the other fix might fix this as well.

mganter commented

2025-12-12 10:28:33 +00:00

@aahlenst wrote in #1122 (comment):

I see two problems with the ephemeral bit as it's currently implemented:
* We create another situation where Forgejo Runner and Forgejo can disagree about a runner's configuration.

* We trade one usage error (`forgejo-runner daemon` with an ephemeral runner) for another (forgetting or accidentally adding `--ephemeral`). The former only impacts people with ephemeral runners and might even be easier to recognize than the latter because the `ephemeral` flag is in the somewhat obscure `.runner` file.
If avoiding the 403 error in a loop is the only reason for the ephemeral bit, I'd like to forego it for the time being. Adding it later is easier than taking it out. If we recognize that it is a real problem, we should try to come up with a more robust solution where a disagreement between Forgejo Runner and Forgejo results in an hard error.

I just saw that the response of Declare, which is executed before FetchTask, contains the Runner object present on Forgejo side. So I propose that we should use the server state as truth, and use it to configure the runner during execution time, without needing the ephemeral flag in .runner.

@aahlenst wrote in #1122 (comment):

However, if there's another reason, for example, turning daemon into one-job that waits for a job to run, we should rather consider extending one-job with a --wait flag.

This is a good idea, I will implement that.

@aahlenst wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-69605: > I see two problems with the `ephemeral` bit as it's currently implemented: > > * We create another situation where Forgejo Runner and Forgejo can disagree about a runner's configuration. > > * We trade one usage error (`forgejo-runner daemon` with an ephemeral runner) for another (forgetting or accidentally adding `--ephemeral`). The former only impacts people with ephemeral runners and might even be easier to recognize than the latter because the `ephemeral` flag is in the somewhat obscure `.runner` file. > > > If avoiding the 403 error in a loop is the only reason for the `ephemeral` bit, I'd like to forego it for the time being. Adding it later is easier than taking it out. If we recognize that it is a real problem, we should try to come up with a more robust solution where a disagreement between Forgejo Runner and Forgejo results in an hard error. I just saw that the response of `Declare`, which is executed before `FetchTask`, [contains](https://code.forgejo.org/forgejo/actions-proto/src/branch/main/proto/runner/v1/messages.proto#L27) the [`Runner`](https://code.forgejo.org/forgejo/actions-proto/src/branch/main/proto/runner/v1/messages.proto#L62-L73) object present on Forgejo side. So I propose that we should use the server state as truth, and use it to configure the runner during execution time, without needing the `ephemeral` flag in `.runner`. @aahlenst wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-69605: > However, if there's another reason, for example, turning `daemon` into `one-job` that waits for a job to run, we should rather consider extending `one-job` with a `--wait` flag. This is a good idea, I will implement that.

aahlenst commented

2025-12-12 12:23:02 +00:00

@mganter wrote in #1122 (comment):

Can you test this fix?

Yep. Much better now, thanks a lot.

$ DOCKER_HOST=unix:///run/user/1000/podman/podman.sock ./forgejo-runner one-job
INFO[2025-12-12T13:10:12+01:00] Starting job                                 
INFO[2025-12-12T13:10:12+01:00] runner: test-runner, with version: v12.1.0+19-g4ab10261, with labels: [debian], declared successfully 
INFO[2025-12-12T13:10:12+01:00] task 4 repo is andreas/test https://data.forgejo.org http://192.168.178.62:3000 
INFO[2025-12-12T13:10:14+01:00] Skipping ReportState to ensure ReportLog can be executed 
INFO[2025-12-12T13:10:15+01:00] Skipping ReportState to ensure ReportLog can be executed 
INFO[2025-12-12T13:10:16+01:00] Skipping ReportState to ensure ReportLog can be executed 
INFO[2025-12-12T13:10:17+01:00] Skipping ReportState to ensure ReportLog can be executed 
INFO[2025-12-12T13:10:18+01:00] Skipping ReportState to ensure ReportLog can be executed 
INFO[2025-12-12T13:10:19+01:00] Skipping ReportState to ensure ReportLog can be executed 
INFO[2025-12-12T13:10:20+01:00] Skipping ReportState to ensure ReportLog can be executed 
INFO[2025-12-12T13:10:21+01:00] Skipping ReportState to ensure ReportLog can be executed 
INFO[2025-12-12T13:10:22+01:00] Skipping ReportState to ensure ReportLog can be executed 
INFO[2025-12-12T13:10:23+01:00] Skipping ReportState to ensure ReportLog can be executed 
INFO[2025-12-12T13:10:24+01:00] Cleaning up network for job test, and network name is: WORKFLOW-a465fb2e17fe2307f2d6f94589cd822f

It might make sense to downgrade Skipping ReportState to ensure ReportLog can be executed to debug because I don't think it's useful for the average user. It will still appear until Forgejo Runner no longer enables debug logging by default.

@mganter wrote in #1122 (comment):

I just saw that the response of Declare, which is executed before FetchTask, contains the Runner object present on Forgejo side.

That's great.

So I propose that we check the .runner file to be consistent with this, and crash otherwise. That way we rule out any user misconfiguration or accidental behaviour.

Is it not possible to store the ephemeral bit in the context and act accordingly without having to change the runner file?

@mganter wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-69774: > Can you test this fix? Yep. Much better now, thanks a lot. ``` $ DOCKER_HOST=unix:///run/user/1000/podman/podman.sock ./forgejo-runner one-job INFO[2025-12-12T13:10:12+01:00] Starting job INFO[2025-12-12T13:10:12+01:00] runner: test-runner, with version: v12.1.0+19-g4ab10261, with labels: [debian], declared successfully INFO[2025-12-12T13:10:12+01:00] task 4 repo is andreas/test https://data.forgejo.org http://192.168.178.62:3000 INFO[2025-12-12T13:10:14+01:00] Skipping ReportState to ensure ReportLog can be executed INFO[2025-12-12T13:10:15+01:00] Skipping ReportState to ensure ReportLog can be executed INFO[2025-12-12T13:10:16+01:00] Skipping ReportState to ensure ReportLog can be executed INFO[2025-12-12T13:10:17+01:00] Skipping ReportState to ensure ReportLog can be executed INFO[2025-12-12T13:10:18+01:00] Skipping ReportState to ensure ReportLog can be executed INFO[2025-12-12T13:10:19+01:00] Skipping ReportState to ensure ReportLog can be executed INFO[2025-12-12T13:10:20+01:00] Skipping ReportState to ensure ReportLog can be executed INFO[2025-12-12T13:10:21+01:00] Skipping ReportState to ensure ReportLog can be executed INFO[2025-12-12T13:10:22+01:00] Skipping ReportState to ensure ReportLog can be executed INFO[2025-12-12T13:10:23+01:00] Skipping ReportState to ensure ReportLog can be executed INFO[2025-12-12T13:10:24+01:00] Cleaning up network for job test, and network name is: WORKFLOW-a465fb2e17fe2307f2d6f94589cd822f ``` It might make sense to downgrade `Skipping ReportState to ensure ReportLog can be executed` to debug because I don't think it's useful for the average user. It will still appear until Forgejo Runner no longer enables debug logging by default. @mganter wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-69801: > I just saw that the response of `Declare`, which is executed before `FetchTask`, [contains](https://code.forgejo.org/forgejo/actions-proto/src/branch/main/proto/runner/v1/messages.proto#L27) the [`Runner`](https://code.forgejo.org/forgejo/actions-proto/src/branch/main/proto/runner/v1/messages.proto#L62-L73) object present on Forgejo side. That's great. > So I propose that we check the `.runner` file to be consistent with this, and crash otherwise. That way we rule out any user misconfiguration or accidental behaviour. Is it not possible to store the ephemeral bit in the context and act accordingly without having to change the runner file?

mganter force-pushed ephemeral-runners from 8a9970d710

cascade / forgejo (pull_request_target) Has been skipped

Details

cascade / debug (pull_request_target) Has been skipped

Details

cascade / end-to-end (pull_request_target) Has been skipped

Details

issue-labels / release-notes (pull_request_target) Successful in 10s

Details

checks / build and test (pull_request) Failing after 41s

Details

checks / runner exec tests (pull_request) Has been skipped

Details

checks / runner integration tests (pull_request) Has been skipped

Details

checks / validate pre-commit-hooks file (pull_request) Successful in 43s

Details

checks / validate mocks (pull_request) Successful in 55s

Details

checks / integration tests (pull_request) Failing after 15m8s

Details

to b37dcc1776

cascade / forgejo (pull_request_target) Has been skipped

Details

cascade / debug (pull_request_target) Has been skipped

Details

cascade / end-to-end (pull_request_target) Has been skipped

Details

issue-labels / release-notes (pull_request_target) Successful in 7s

Details

checks / build and test (pull_request) Failing after 32s

Details

checks / runner exec tests (pull_request) Has been skipped

Details

checks / runner integration tests (pull_request) Has been skipped

Details

checks / validate mocks (pull_request) Successful in 39s

Details

checks / validate pre-commit-hooks file (pull_request) Successful in 42s

Details

checks / integration tests (pull_request) Successful in 16m19s

Details

2025-12-12 13:04:01 +00:00

Compare

mganter force-pushed ephemeral-runners from b37dcc1776

cascade / forgejo (pull_request_target) Has been skipped

Details

cascade / debug (pull_request_target) Has been skipped

Details

cascade / end-to-end (pull_request_target) Has been skipped

Details

issue-labels / release-notes (pull_request_target) Successful in 7s

Details

checks / build and test (pull_request) Failing after 32s

Details

checks / runner exec tests (pull_request) Has been skipped

Details

checks / runner integration tests (pull_request) Has been skipped

Details

checks / validate mocks (pull_request) Successful in 39s

Details

checks / validate pre-commit-hooks file (pull_request) Successful in 42s

Details

checks / integration tests (pull_request) Successful in 16m19s

Details

to 0c7c2c379f

cascade / forgejo (pull_request_target) Has been skipped

Details

cascade / debug (pull_request_target) Has been skipped

Details

cascade / end-to-end (pull_request_target) Has been skipped

Details

issue-labels / release-notes (pull_request_target) Successful in 8s

Details

checks / build and test (pull_request) Failing after 33s

Details

checks / runner exec tests (pull_request) Has been skipped

Details

checks / runner integration tests (pull_request) Has been skipped

Details

checks / validate mocks (pull_request) Successful in 41s

Details

checks / validate pre-commit-hooks file (pull_request) Successful in 40s

Details

checks / integration tests (pull_request) Successful in 16m1s

Details

2025-12-12 13:04:42 +00:00

Compare

mganter commented

2025-12-12 13:05:09 +00:00

@aahlenst wrote in #1122 (comment):

Is it not possible to store the ephemeral bit in the context and act accordingly without having to change the runner file?

I changed my comment just as you answered :D i implemented it in 0c7c2c379f to take the server side as truth

@aahlenst wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-69803: > Is it not possible to store the ephemeral bit in the context and act accordingly without having to change the runner file? I changed my comment just as you answered :D i implemented it in https://code.forgejo.org/forgejo/runner/commit/0c7c2c379fcabf664b67070f0c3eea570f867193 to take the server side as truth

mganter commented

2025-12-12 13:32:05 +00:00

@aahlenst wrote in #1122 (comment):

However, if there's another reason, for example, turning daemon into one-job that waits for a job to run, we should rather consider extending one-job with a --wait flag.

implemented in !1122 (commit 539fba8315)

@aahlenst wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-69605: > However, if there's another reason, for example, turning `daemon` into `one-job` that waits for a job to run, we should rather consider extending `one-job` with a `--wait` flag. implemented in https://code.forgejo.org/forgejo/runner/pulls/1122/commits/539fba8315ed3458cc04738c95e30f4216e39e5c

aahlenst commented

2025-12-12 14:18:33 +00:00

@mganter wrote in #1122 (comment):

I changed my comment just as you answered :D i implemented it in forgejo/runner@0c7c2c379f to take the server side as truth

Splendid! --wait works great, too.

@mganter wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-69807: > I changed my comment just as you answered :D i implemented it in [`forgejo/runner@0c7c2c379f`](https://code.forgejo.org/forgejo/runner/commit/0c7c2c379fcabf664b67070f0c3eea570f867193) to take the server side as truth Splendid! `--wait` works great, too.

aahlenst requested changes

2025-12-12 15:14:10 +00:00

internal/app/cmd/cmd.go Outdated

					
				@ -37,6 +37,7 @@ func Execute(ctx context.Context) {

					registerCmd.Flags().StringVar(&regArgs.Token, "token", "", "Runner token")

					registerCmd.Flags().StringVar(&regArgs.RunnerName, "name", "", "Runner name")

					registerCmd.Flags().StringVar(&regArgs.Labels, "labels", "", "Runner tags, comma separated")

					registerCmd.Flags().BoolVar(&regArgs.Ephemeral, "ephemeral", false, "Configure the runner to be ephemeral and only ever be able to pick a single job")

What about "instruct Forgejo to delete this runner after it has run one job"? Explains what ephemeral is about.

What about "instruct Forgejo to delete this runner after it has run one job"? Explains what `ephemeral` is about.

fixed

mganter marked this conversation as resolved

internal/app/cmd/cmd.go Outdated

					
				@ -56,2 +58,3 @@

						RunE:  runJob(ctx, &configFile),

						RunE:  runJob(ctx, &configFile, &runJobArgs),

					}

					jobCmd.Flags().BoolVarP(&runJobArgs.wait, "wait", "w", false, "waits until task has been assigned")

What about "wait for a task to run?

fixed

mganter marked this conversation as resolved

internal/app/cmd/create-runner-file.go Outdated

					
				@ -41,6 +42,7 @@ func createRunnerFileCmd(ctx context.Context, configFile *string) *cobra.Command

					cmd.Flags().StringVar(&argsVar.Secret, "secret", "", "secret shared with the Forgejo instance via forgejo-cli actions register")

					_ = cmd.MarkFlagRequired("secret")

					cmd.Flags().StringVar(&argsVar.Name, "name", "", "Runner name")

					cmd.Flags().BoolVar(&argsVar.Ephemeral, "ephemeral", false, "Configure the runner to be ephemeral and only ever be able to pick a single job")

That's no longer needed, is it?

fixed

mganter marked this conversation as resolved

internal/app/cmd/create-runner-file.go Outdated

					
				@ -125,3 +128,2 @@

						//

						if err := ping(cfg, reg); err != nil {

							return err

						if args.Connect {

Can we do this separately and add a test for it? If you don't want to do it, that's okay, just let me know.

removed the change to do it in other pr

mganter marked this conversation as resolved

internal/app/cmd/daemon_test.go Outdated

					
				@ -118,0 +122,4 @@

					mockPoller := mock_poller.NewPoller(t)

					mockPoller.On("PollOnce").Return(nil)

					pollTask(context.TODO(), mockPoller, true)

I have a question because I don't know and want to learn something: Doesn't t.Context() or context.Background() work here? If they don't, why?

I have a question because I don't know and want to learn something: Doesn't `t.Context()` or `context.Background()` work here? If they don't, why?

thats something i missed t.Context() is the correct one.

fixed

thats something i missed `t.Context()` is the correct one. fixed

mganter marked this conversation as resolved

internal/app/cmd/daemon_test.go Outdated

					
				@ -118,0 +208,4 @@

					ctx := context.Background()

					tempDir := t.TempDir()

					runnerFile := tempDir + "/.runner"

filepath.Join(tempDir, ".runner")

`filepath.Join(tempDir, ".runner")`

fixed

mganter marked this conversation as resolved

internal/app/cmd/register.go Outdated

					
				@ -350,0 +355,4 @@

					reg.Ephemeral = resp.Msg.Runner.Ephemeral

					if inputs.Ephemeral != resp.Msg.Runner.Ephemeral {

						log.Error("poller: cannot register new runner as ephemeral upgrade Forgejo to enable this feature. The runner has been registered as not ephemeral.")

As --ephemeral can be seen as a security feature, this should be a hard error and not only a log message.

The message could be clearer: "cannot register runner as ephemeral because the Forgejo instance does not support it"

There should be an automated test, too.

As `--ephemeral` can be seen as a security feature, this should be a hard error and not only a log message. The message could be clearer: "cannot register runner as ephemeral because the Forgejo instance does not support it" There should be an automated test, too.

i changed it to be an error. The only thing bothering me is that there is currently no way to Deregister a runner with a registration token.

@mganter wrote in #1122 (comment):

The only thing bothering me is that there is currently no way to Deregister a runner with a registration token.

Is that related to the code in question?

The REST API can (soon) create and remove runners. Would that be an option for you?

@mganter wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-69838: > The only thing bothering me is that there is currently no way to Deregister a runner with a registration token. Is that related to the code in question? The REST API can (soon) create and remove runners. Would that be an option for you?

#182 might be relevant.

https://code.forgejo.org/forgejo/runner/issues/182 might be relevant.

I assume that a runner registration token cannot interact with the HTTP endpoints using a runner registration token. So i would rather prefer a Self Deregistration endpoint in the actions-proto spec, to fix it in the future.

I assume that a runner registration token cannot interact with the HTTP endpoints using a runner registration token. So i would rather prefer a Self Deregistration endpoint in the [actions-proto spec](https://code.forgejo.org/forgejo/actions-proto), to fix it in the future.

@mganter wrote in #1122 (comment):

I assume that a runner registration token cannot interact with the HTTP endpoints using a runner registration token.

That's correct. And it's also not something I'd like to add.

So i would rather prefer a Self Deregistration endpoint in the actions-proto spec, to fix it in the future.

I agree.

Do you have any opinions on the behaviour? I've started collecting requirements in #182 (comment).

@mganter wrote in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-70222: > I assume that a runner registration token cannot interact with the HTTP endpoints using a runner registration token. That's correct. And it's also not something I'd like to add. > So i would rather prefer a Self Deregistration endpoint in the [actions-proto spec](https://code.forgejo.org/forgejo/actions-proto), to fix it in the future. I agree. Do you have any opinions on the behaviour? I've started collecting requirements in https://code.forgejo.org/forgejo/runner/issues/182#issuecomment-69892.

aahlenst marked this conversation as resolved

internal/app/job/job.go Outdated

					
				@ -37,2 +38,3 @@

				func (j *Job) Run(ctx context.Context) error {

				func (j *Job) Run(ctx context.Context, wait bool) error {

					if wait {

Is there a test for it?

intoduced tests for Run function

aahlenst marked this conversation as resolved

internal/pkg/config/registration.go Outdated

					
				@ -26,0 +23,4 @@

					Token     string   `json:"token"`

					Address   string   `json:"address"`

					Labels    []string `json:"labels"`

					Ephemeral bool     `json:"-"`

Ephemeral is no longer needed, is it?

`Ephemeral` is no longer needed, is it?

it is used on forgejo-runner register

it is used on `forgejo-runner register`

aahlenst marked this conversation as resolved

internal/pkg/report/reporter_test.go Outdated

					
				@ -419,3 +419,3 @@

							testCase.fixture(t, reporter, client)

							err = reporter.ReportState()

							err = reporter.ReportState(true)

Is there a test that covers the false branch?

Is there a test that covers the `false` branch?

added test

aahlenst marked this conversation as resolved

mganter force-pushed ephemeral-runners from b3bb05ab3a

cascade / forgejo (pull_request_target) Has been skipped

Details

cascade / debug (pull_request_target) Has been skipped

Details

cascade / end-to-end (pull_request_target) Has been skipped

Details

issue-labels / release-notes (pull_request_target) Successful in 9s

Details

checks / validate pre-commit-hooks file (pull_request) Successful in 54s

Details

checks / validate mocks (pull_request) Successful in 58s

Details

checks / build and test (pull_request) Successful in 1m15s

Details

checks / runner exec tests (pull_request) Successful in 28s

Details

checks / runner integration tests (pull_request) Successful in 5m35s

Details

checks / integration tests (pull_request) Successful in 11m18s

Details

to 628ddd6dcd

cascade / debug (pull_request_target) Has been skipped

Details

cascade / forgejo (pull_request_target) Has been skipped

Details

cascade / end-to-end (pull_request_target) Has been skipped

Details

issue-labels / release-notes (pull_request_target) Successful in 30s

Details

checks / validate pre-commit-hooks file (pull_request) Successful in 55s

Details

checks / Build Forgejo Runner (pull_request) Failing after 1m2s

Details

checks / integration tests (docker-stable) (pull_request) Has been skipped

Details

checks / runner exec tests (pull_request) Has been skipped

Details

checks / integration tests (docker-latest) (pull_request) Has been skipped

Details

checks / Build unsupported platforms (pull_request) Has been skipped

Details

checks / validate mocks (pull_request) Successful in 1m6s

Details

2026-02-09 10:46:06 +00:00

Compare

mganter force-pushed ephemeral-runners from 628ddd6dcd

cascade / debug (pull_request_target) Has been skipped

Details

cascade / forgejo (pull_request_target) Has been skipped

Details

cascade / end-to-end (pull_request_target) Has been skipped

Details

issue-labels / release-notes (pull_request_target) Successful in 30s

Details

checks / validate pre-commit-hooks file (pull_request) Successful in 55s

Details

checks / Build Forgejo Runner (pull_request) Failing after 1m2s

Details

checks / integration tests (docker-stable) (pull_request) Has been skipped

Details

checks / runner exec tests (pull_request) Has been skipped

Details

checks / integration tests (docker-latest) (pull_request) Has been skipped

Details

checks / Build unsupported platforms (pull_request) Has been skipped

Details

checks / validate mocks (pull_request) Successful in 1m6s

Details

to a9582141ca

cascade / forgejo (pull_request_target) Has been skipped

Details

cascade / debug (pull_request_target) Has been skipped

Details

cascade / end-to-end (pull_request_target) Has been skipped

Details

issue-labels / release-notes (pull_request_target) Successful in 6s

Details

checks / Build Forgejo Runner (pull_request) Failing after 25s

Details

checks / Build unsupported platforms (pull_request) Has been skipped

Details

checks / integration tests (docker-stable) (pull_request) Has been skipped

Details

checks / runner exec tests (pull_request) Has been skipped

Details

checks / integration tests (docker-latest) (pull_request) Has been skipped

Details

checks / validate pre-commit-hooks file (pull_request) Successful in 31s

Details

checks / validate mocks (pull_request) Successful in 40s

Details

2026-02-09 10:48:39 +00:00

Compare

mganter commented

2026-02-09 10:50:44 +00:00

@aahlenst could you review internal/app/poll/poller.go as this needed to be changed during rebase due to 4b4e1dd75b

@aahlenst could you review [internal/app/poll/poller.go](https://code.forgejo.org/forgejo/runner/pulls/1122/files#diff-3192db22b957fdbf4d3b9063598cd41fd28eaacf) as this needed to be changed during rebase due to https://code.forgejo.org/forgejo/runner/commit/4b4e1dd75b8dc2cc44a72786bd4dbe950f559d87

mganter added 1 commit

2026-02-09 10:54:08 +00:00

fixed function call, fixed mock function

cascade / end-to-end (pull_request_target) Has been skipped

Details

cascade / debug (pull_request_target) Has been skipped

Details

cascade / forgejo (pull_request_target) Has been skipped

Details

issue-labels / release-notes (pull_request_target) Successful in 5s

Details

checks / validate pre-commit-hooks file (pull_request) Successful in 35s

Details

checks / Build Forgejo Runner (pull_request) Successful in 38s

Details

checks / validate mocks (pull_request) Successful in 39s

Details

checks / Build unsupported platforms (pull_request) Successful in 16s

Details

checks / runner exec tests (pull_request) Successful in 37s

Details

checks / integration tests (docker-latest) (pull_request) Failing after 8m6s

Details

checks / integration tests (docker-stable) (pull_request) Failing after 10m5s

Details

6be9b5a446

aahlenst approved these changes

2026-02-10 15:31:22 +00:00

aahlenst left a comment

Approved from a functional point of view. Functional testing I performed today.

The comments have to be resolved before I hit the "Approve" button.

Approved from a functional point of view. [Functional testing I performed today](https://codeberg.org/forgejo/forgejo/pulls/9962#issuecomment-10497110). The comments have to be resolved before I hit the "Approve" button.

internal/app/cmd/cmd.go Outdated

					
				@ -37,6 +37,7 @@ func Execute(ctx context.Context) {

					registerCmd.Flags().StringVar(&regArgs.Token, "token", "", "Runner token")

					registerCmd.Flags().StringVar(&regArgs.RunnerName, "name", "", "Runner name")

					registerCmd.Flags().StringVar(&regArgs.Labels, "labels", "", "Runner tags, comma separated")

					registerCmd.Flags().BoolVar(&regArgs.Ephemeral, "ephemeral", false, "Instructs Forgejo to delete this runner after it has run one job")

instruct Forgejo to delete this runner after it has run one job (lowercase i, no at the end of instruct)

`instruct Forgejo to delete this runner after it has run one job` (lowercase i, no at the end of instruct)

mganter marked this conversation as resolved

internal/app/cmd/register.go Outdated

					
				@ -351,0 +356,4 @@

					reg.Ephemeral = resp.Msg.Runner.Ephemeral

					if inputs.Ephemeral != resp.Msg.Runner.Ephemeral {

						return fmt.Errorf("poller: cannot register new runner as ephemeral upgrade Forgejo to enable this feature. The runner has been registered as not ephemeral")

Is poller the right prefix here?

A bit more explicit: "poller: aborting because this Forgejo instance does not support ephemeral runners; requires Forgejo 15 or newer. Attention: Forgejo has created a normal runner."

Not great that Forgejo 14 still creates a normal runner, but I don't think we can do anything about it.

Is `poller` the right prefix here? A bit more explicit: "poller: aborting because this Forgejo instance does not support ephemeral runners; requires Forgejo 15 or newer. Attention: Forgejo has created a normal runner." Not great that Forgejo 14 still creates a normal runner, but I don't think we can do anything about it.

i would rather go for poller: This Forgejo instance does not support ephemeral runners; requires Forgejo 15 or newer. Attention: Forgejo has created a normal runner.

what do you think about that? I thought about a WARNING in front of it, but we already error log this. And i would drop the word aborting, because it's not really aborting the registration

i would rather go for `poller: This Forgejo instance does not support ephemeral runners; requires Forgejo 15 or newer. Attention: Forgejo has created a normal runner.` what do you think about that? I thought about a WARNING in front of it, but we already error log this. And i would drop the word aborting, because it's not really aborting the registration

When I tried it (before reading the code), I wasn't sure what the outcome was. Has the registration been updated or not? I had to check the .runner file to see that it has not been updated.

I'm not attached to my message. I'm sure it can be improved. Whatever the improved message is, it should make it clear that nothing has been changed.

When I tried it (before reading the code), I wasn't sure what the outcome was. Has the registration been updated or not? I had to check the `.runner` file to see that it has not been updated. I'm not attached to my message. I'm sure it can be improved. Whatever the improved message is, it should make it clear that nothing has been changed.

currently, it results in the same behaviour as not being able to save the file, which is.

the runner has been registered in forgejo server
the configuration file has NOT been written, as we exit early out of that function
the registration has a non 0 exit code

not sure what the most reasonable approach is, as we cannot deregister a runner using the grpc endpoints. we can either:

write the config and exit normally, which would mislead careless users
do not write the config and exit with an error code, which will leave a runner in forgejo behind

in my opinion, neither one of these options is ideal

currently, it results in the same behaviour as not being able to save the file, which is. - the runner has been registered in forgejo server - the configuration file has NOT been written, as we exit early out of that function - the registration has a non 0 exit code not sure what the most reasonable approach is, as we cannot deregister a runner using the grpc endpoints. we can either: - write the config and exit normally, which would mislead careless users - do not write the config and exit with an error code, which will leave a runner in forgejo behind in my opinion, neither one of these options is ideal

The current approach ("do not write the config and exit with an error code, which will leave a runner in forgejo behind") is fine with me. We "only" need a message that is super clear about what happened.

is this ok?

"This Forgejo instance does not support ephemeral runners; requires Forgejo 15 or newer. The runner was registered as a non-ephemeral runner instead. Please manually delete the runner '%s' from the Forgejo UI to avoid a stale runner entry"

is this ok? > "This Forgejo instance does not support ephemeral runners; requires Forgejo 15 or newer. The runner was registered as a non-ephemeral runner instead. Please manually delete the runner '%s' from the Forgejo UI to avoid a stale runner entry"

👍 1

aahlenst marked this conversation as resolved

internal/app/poll/poller.go Outdated

					
				@ -102,0 +109,4 @@

					wg := &sync.WaitGroup{}

					// When we start a FetchTask, we'll be requesting (capacity - inProgressTasks) tasks from a remote and may receive

					// up to that number.  We can't perform multiple fetches simulanteously or else we could be overprovisioned for

simultaneously instead of simulanteously

`simultaneously` instead of `simulanteously`

😆 1

mganter marked this conversation as resolved

internal/app/poll/poller.go Outdated

					
				@ -102,0 +130,4 @@

				func (p *poller) pollForClient(limiter *rate.Limiter, client client.Client, capacity int64, fetchMutex chan any, taskVersions, inProgressTasks *atomic.Int64, wg *sync.WaitGroup, ephemeral bool) {

					if ephemeral && capacity > 1 {

						log.Infof("[poller] connot run ephemeral runner with more than 1 capacity")

cannot run ephemeral runner with capacity greater than 1

That's an invalid configuration. Would be good if we could catch that situation earlier and exit, perhaps in pollTask?

Ephemeral runner and multiple connections are mutually exclusive and a usage error. However, I don't know where that should be handled.

@mfenniak What do you think?

`cannot run ephemeral runner with capacity greater than 1` That's an invalid configuration. Would be good if we could catch that situation earlier and exit, perhaps in `pollTask`? Ephemeral runner and multiple connections are mutually exclusive and a usage error. However, I don't know where that should be handled. @mfenniak What do you think?

when running this in ephemeral mode, capacity is enforced by PollOnce to be 1

either here (for daemon):

 func pollTask(ctx context.Context, poller poll.Poller, ephemeral bool) {
 	if ephemeral {
 		done := make(chan struct{})
 		go func() {
 			defer close(done)
 			poller.PollOnce()
 		}()

or here (for one-job --wait):

 func (j *Job) Run(ctx context.Context, wait bool) error {
 	if wait {
 		poller := NewPoller(ctx, j.cfg, []client.Client{j.client}, j.runner)
 		poller.PollOnce()
 		return nil
 	}

when running this in ephemeral mode, capacity is enforced by PollOnce to be 1 either here (for daemon): https://code.forgejo.org/forgejo/runner/src/commit/1e2a8702c6d859ac584c42a77e66d1e4227f5a2a/internal/app/cmd/daemon.go#L89-L95 or here (for one-job --wait): https://code.forgejo.org/forgejo/runner/src/commit/1e2a8702c6d859ac584c42a77e66d1e4227f5a2a/internal/app/job/job.go#L41-L46

aahlenst requested reviews from aahlenst, mfenniak

2026-02-10 15:31:32 +00:00

mganter added 1 commit

2026-02-10 15:49:14 +00:00

introduced feedback from PR

cascade / end-to-end (pull_request_target) Has been skipped

Details

cascade / forgejo (pull_request_target) Has been skipped

Details

cascade / debug (pull_request_target) Has been skipped

Details

issue-labels / release-notes (pull_request_target) Successful in 7s

Details

checks / Build Forgejo Runner (pull_request) Failing after 1m3s

Details

checks / Build unsupported platforms (pull_request) Has been skipped

Details

checks / runner exec tests (pull_request) Has been skipped

Details

checks / integration tests (docker-latest) (pull_request) Has been skipped

Details

checks / integration tests (docker-stable) (pull_request) Has been skipped

Details

checks / validate pre-commit-hooks file (pull_request) Successful in 1m12s

Details

checks / validate mocks (pull_request) Successful in 1m36s

Details

Integration tests for the release process / release-simulation (pull_request) Successful in 5m36s

Details

30d2072bf0

mganter added 1 commit

2026-02-10 15:57:28 +00:00

go vet

cascade / forgejo (pull_request_target) Has been skipped

Details

cascade / debug (pull_request_target) Has been skipped

Details

cascade / end-to-end (pull_request_target) Has been skipped

Details

issue-labels / release-notes (pull_request_target) Successful in 8s

Details

checks / validate pre-commit-hooks file (pull_request) Successful in 1m4s

Details

checks / validate mocks (pull_request) Successful in 1m27s

Details

checks / Build Forgejo Runner (pull_request) Successful in 1m32s

Details

checks / Build unsupported platforms (pull_request) Successful in 36s

Details

checks / runner exec tests (pull_request) Successful in 51s

Details

Integration tests for the release process / release-simulation (pull_request) Successful in 6m44s

Details

checks / integration tests (docker-latest) (pull_request) Failing after 13m34s

Details

checks / integration tests (docker-stable) (pull_request) Failing after 16m1s

Details

1e2a8702c6

mganter added 1 commit

2026-02-10 16:21:30 +00:00

updated error message

cascade / debug (pull_request_target) Has been skipped

Details

cascade / end-to-end (pull_request_target) Has been skipped

Details

cascade / forgejo (pull_request_target) Has been skipped

Details

issue-labels / release-notes (pull_request_target) Successful in 5s

Details

checks / Build Forgejo Runner (pull_request) Successful in 44s

Details

checks / validate pre-commit-hooks file (pull_request) Successful in 48s

Details

checks / validate mocks (pull_request) Successful in 1m9s

Details

checks / Build unsupported platforms (pull_request) Successful in 38s

Details

checks / runner exec tests (pull_request) Successful in 38s

Details

Integration tests for the release process / release-simulation (pull_request) Successful in 5m7s

Details

checks / integration tests (docker-latest) (pull_request) Failing after 12m21s

Details

checks / integration tests (docker-stable) (pull_request) Failing after 15m38s

Details

fb02eb3aa4

mfenniak reviewed

2026-02-11 03:08:12 +00:00

internal/app/cmd/register.go Outdated

					
				@ -348,6 +353,11 @@ func doRegister(ctx context.Context, cfg *config.Config, inputs *registerInputs)

					reg.UUID = resp.Msg.GetRunner().GetUuid()

					reg.Name = resp.Msg.GetRunner().GetName()

					reg.Token = resp.Msg.GetRunner().GetToken()

					reg.Ephemeral = resp.Msg.Runner.Ephemeral

GetRunner().GetEphemeral() should be used for nil safe access, to avoid panics from unexpected responses from the server.

`GetRunner().GetEphemeral()` should be used for `nil` safe access, to avoid panics from unexpected responses from the server.

fixed in !1122 (commit 685940a99b)

fixed in https://code.forgejo.org/forgejo/runner/pulls/1122/commits/685940a99b3affbaf6859925aed4004099fc4e67

mganter marked this conversation as resolved

internal/app/poll/poller.go Outdated

					
				@ -99,3 +100,3 @@

				}

				func (p *poller) pollForClient(limiter *rate.Limiter, client client.Client, capacity int64, fetchMutex chan any, taskVersions, inProgressTasks *atomic.Int64, wg *sync.WaitGroup) {

				func (p *poller) PollOnce() {

This method is a lot of code to duplicate just to call pollForClient with slightly different arguments. Can you make this into a common implementation between Poll and PollOnce? eg. both invoke CommonPoll(...) and provide the overridden capacity and overridden ephemeral flag.

I'd also like to see unit tests implemented for the PollOnce capability here in poller.go please.

This method is a lot of code to duplicate just to call `pollForClient` with slightly different arguments. Can you make this into a common implementation between `Poll` and `PollOnce`? eg. both invoke `CommonPoll(...)` and provide the overridden capacity and overridden ephemeral flag. I'd also like to see unit tests implemented for the `PollOnce` capability here in `poller.go` please.

refactored and created test !1122 (commit e101ab3f77)

would you mind to double check

refactored and created test https://code.forgejo.org/forgejo/runner/pulls/1122/commits/e101ab3f77d74852764cbaeac0950b7992f9b2e5 would you mind to double check

mfenniak marked this conversation as resolved

internal/app/poll/poller.go Outdated

					
				@ -102,0 +131,4 @@

				func (p *poller) pollForClient(limiter *rate.Limiter, client client.Client, capacity int64, fetchMutex chan any, taskVersions, inProgressTasks *atomic.Int64, wg *sync.WaitGroup, ephemeral bool) {

					if ephemeral && capacity > 1 {

						log.Infof("[poller] connot run ephemeral runner with more than 1 capacity")

						wg.Done()

wg.Done() doesn't belong here -- it's not the responsibility of pollForClient to call wg.Done(), but rather just to return when done.

`wg.Done()` doesn't belong here -- it's not the responsibility of `pollForClient` to call `wg.Done()`, but rather just to return when done.

I missed that wg.Go() also calls wg.Done() internally.

fixed

I missed that wg.Go() also calls wg.Done() internally. fixed

mganter marked this conversation as resolved

internal/pkg/report/reporter.go Outdated

					
				@ -413,2 +414,4 @@

					r.stateMu.RUnlock()

					if !noMore && state.Result != runnerv1.Result_RESULT_UNSPECIFIED {

						// skip final ReportState so ReportLog called from reporter.Close() can send its log before the job finishes

I don't understand this change and how it relates to the ephemeral runner concept. Can you explain it in more detail?

On Forgejo, we do remove invalidate the runner, once it sends a status with a specified result. Without that change, the runner still sends logs AFTER it send the final status report. This causes the behaviour you can see below.

To ensure all the data is sent before sending the final call. We skip all final status reports in Reporter.RunDaemon() and we rely on the Reporter.Close() function.

@aahlenst found the issue in #1122 (comment):

$ ./forgejo-runner daemon
INFO[2025-12-10T15:12:47Z] Starting runner daemon
INFO[2025-12-10T15:12:47Z] runner: ephemeral, with version: v12.1.0+17-gf0f24ec1, with labels: [debian], declared successfully
INFO[2025-12-10T15:14:38Z] task 1 repo is andreas/test https://data.forgejo.org http://192.168.178.62:3000
INFO[2025-12-10T15:14:39Z] Cleaning up network for job test, and network name is: WORKFLOW-f315fe9928a983253749cc6cc9e5b48d
WARN[2025-12-10T15:14:39Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner
WARN[2025-12-10T15:14:39Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner
WARN[2025-12-10T15:14:39Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner

On Forgejo, we do remove invalidate the runner, once it sends a status with a specified result. Without that change, the runner still sends logs AFTER it send the final status report. This causes the behaviour you can see below. To ensure all the data is sent before sending the final call. We skip all final status reports in Reporter.RunDaemon() and we rely on the [Reporter.Close()](https://code.forgejo.org/forgejo/runner/src/commit/fb02eb3aa48822a026dc584407c8f6b230e91de1/internal/pkg/report/reporter.go#L320-L330) function. @aahlenst found the issue in https://code.forgejo.org/forgejo/runner/pulls/1122#issuecomment-69605: > $ ./forgejo-runner daemon > INFO[2025-12-10T15:12:47Z] Starting runner daemon > INFO[2025-12-10T15:12:47Z] runner: ephemeral, with version: v12.1.0+17-gf0f24ec1, with labels: [debian], declared successfully > INFO[2025-12-10T15:14:38Z] task 1 repo is andreas/test https://data.forgejo.org http://192.168.178.62:3000 > INFO[2025-12-10T15:14:39Z] Cleaning up network for job test, and network name is: WORKFLOW-f315fe9928a983253749cc6cc9e5b48d > WARN[2025-12-10T15:14:39Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner > WARN[2025-12-10T15:14:39Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner > WARN[2025-12-10T15:14:39Z] uploading final logs failed, but will be retried: unauthenticated: unregistered runner

Got it; I was concerned about this on the Forgejo side of reviewing this. 🤔 I am a little worried about this change, but I can't see a problem with it.

Just as reference for other ppl: https://codeberg.org/forgejo/forgejo/pulls/9962/files#issuecomment-10507690

mfenniak marked this conversation as resolved

mfenniak commented

2026-02-11 03:24:21 +00:00

Also of note -- the integration tests are failing with... well... a lack of useful error message. 👎 Boo, bad integration tests.

mganter added 2 commits

2026-02-13 09:48:07 +00:00

fixed: #1122/files \#issuecomment-77771 685940a99b

introduced common poll function between Poll and PollOnce

cascade / debug (pull_request_target) Has been skipped

Details

cascade / end-to-end (pull_request_target) Has been skipped

Details

cascade / forgejo (pull_request_target) Has been skipped

Details

issue-labels / release-notes (pull_request_target) Successful in 5s

Details

checks / validate pre-commit-hooks file (pull_request) Successful in 36s

Details

checks / Build Forgejo Runner (pull_request) Successful in 42s

Details

checks / validate mocks (pull_request) Successful in 48s

Details

checks / runner exec tests (pull_request) Successful in 41s

Details

checks / Build unsupported platforms (pull_request) Successful in 57s

Details

Integration tests for the release process / release-simulation (pull_request) Successful in 4m5s

Details

checks / integration tests (docker-latest) (pull_request) Failing after 8m36s

Details

checks / integration tests (docker-stable) (pull_request) Failing after 11m0s

Details

e101ab3f77

mfenniak commented

2026-02-14 01:33:46 +00:00

Needs a rebase/merge after other runner changes today, and the integration test failure needs to be resolved. Then I'll trigger the end-to-end tests to ensure no functional regression in those tests, but that's unlikely.

Otherwise I think this is good to go. 👍 At least one of the integration test failures (Open(/home/debian/.cache/actcache/bolt.db): timeout) may be fixed by the merge as #1373 encountered this error.

Needs a rebase/merge after other runner changes today, and the integration test failure needs to be resolved. Then I'll trigger the end-to-end tests to ensure no functional regression in those tests, but that's unlikely. Otherwise I think this is good to go. 👍 At least one of the integration test failures (`Open(/home/debian/.cache/actcache/bolt.db): timeout`) may be fixed by the merge as #1373 encountered this error.

👀 1

mganter force-pushed ephemeral-runners from e101ab3f77

cascade / debug (pull_request_target) Has been skipped

Details

cascade / end-to-end (pull_request_target) Has been skipped

Details

cascade / forgejo (pull_request_target) Has been skipped

Details

issue-labels / release-notes (pull_request_target) Successful in 5s

Details

checks / validate pre-commit-hooks file (pull_request) Successful in 36s

Details

checks / Build Forgejo Runner (pull_request) Successful in 42s

Details

checks / validate mocks (pull_request) Successful in 48s

Details

checks / runner exec tests (pull_request) Successful in 41s

Details

checks / Build unsupported platforms (pull_request) Successful in 57s

Details

Integration tests for the release process / release-simulation (pull_request) Successful in 4m5s

Details

checks / integration tests (docker-latest) (pull_request) Failing after 8m36s

Details

checks / integration tests (docker-stable) (pull_request) Failing after 11m0s

Details

to 4cd390cf65

cascade / end-to-end (pull_request_target) Has been skipped

Details

cascade / debug (pull_request_target) Has been skipped

Details

cascade / forgejo (pull_request_target) Has been skipped

Details

issue-labels / release-notes (pull_request_target) Successful in 36s

Details

checks / validate pre-commit-hooks file (pull_request) Successful in 1m4s

Details

checks / validate mocks (pull_request) Successful in 1m31s

Details

checks / Build Forgejo Runner (pull_request) Successful in 1m31s

Details

checks / runner exec tests (pull_request) Successful in 37s

Details

checks / Build unsupported platforms (pull_request) Successful in 1m1s

Details

Integration tests for the release process / release-simulation (pull_request) Successful in 4m56s

Details

checks / integration tests (docker-latest) (pull_request) Failing after 8m59s

Details

checks / integration tests (docker-stable) (pull_request) Failing after 10m59s

Details

2026-02-16 10:37:54 +00:00

Compare

mganter force-pushed ephemeral-runners from 4cd390cf65

cascade / end-to-end (pull_request_target) Has been skipped

Details

cascade / debug (pull_request_target) Has been skipped

Details

cascade / forgejo (pull_request_target) Has been skipped

Details

issue-labels / release-notes (pull_request_target) Successful in 36s

Details

checks / validate pre-commit-hooks file (pull_request) Successful in 1m4s

Details

checks / validate mocks (pull_request) Successful in 1m31s

Details

checks / Build Forgejo Runner (pull_request) Successful in 1m31s

Details

checks / runner exec tests (pull_request) Successful in 37s

Details

checks / Build unsupported platforms (pull_request) Successful in 1m1s

Details

Integration tests for the release process / release-simulation (pull_request) Successful in 4m56s

Details

checks / integration tests (docker-latest) (pull_request) Failing after 8m59s

Details

checks / integration tests (docker-stable) (pull_request) Failing after 10m59s

Details

to ecbfef30fb

checks / validate pre-commit-hooks file (pull_request) Successful in 1m12s

Details

checks / Build Forgejo Runner (pull_request) Successful in 1m31s

Details

checks / validate mocks (pull_request) Successful in 1m50s

Details

checks / runner exec tests (pull_request) Successful in 42s

Details

checks / Build unsupported platforms (pull_request) Successful in 1m32s

Details

checks / Run integration tests with Docker (docker-latest) (pull_request) Successful in 12m11s

Details

checks / Run integration tests with Docker (docker-stable) (pull_request) Successful in 14m27s

Details

checks / Run integration tests with Podman (pull_request) Successful in 17m28s

Details

issue-labels / release-notes (pull_request_target) Successful in 5s

Details

cascade / debug (pull_request_target) Has been skipped

Details

cascade / end-to-end (pull_request_target) Successful in 7s

Details

cascade / forgejo (pull_request_target) Successful in 1m43s

Details

2026-02-16 11:08:41 +00:00

Compare

mganter commented

2026-02-16 11:10:42 +00:00

btw in act/container/docker_run_test.go TestMergeJobOptions/Ignore fails on darwin, but i could not add a ignore condition as importing goos is forbidden by golangci-lint typecheck

mfenniak added the

run-end-to-end-tests

label

2026-02-16 18:11:04 +00:00

cascading-pr referenced this pull request from actions/setup-forgejo

2026-02-16 18:11:15 +00:00

cascading-pr from https://code.forgejo.org/forgejo/runner refs/pull/1122/head to forgejo/runner-1122 #888

cascading-pr commented

2026-02-16 18:11:16 +00:00

cascading-pr updated at actions/setup-forgejo#888

cascading-pr updated at https://code.forgejo.org/actions/setup-forgejo/pulls/888

mfenniak approved these changes