introducing ephemeral-runners #1122
Labels
No labels
FreeBSD
Kind/Breaking
Kind/Bug
Kind/Chore
Kind/DependencyUpdate
Kind/Documentation
Kind/Enhancement
Kind/Feature
Kind/Security
Kind/Testing
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Reviewed
Confirmed
Reviewed
Duplicate
Reviewed
Invalid
Reviewed
Won't Fix
Status
Abandoned
Status
Blocked
Status
Need More Info
Windows
linux-powerpc64le
linux-riscv64
linux-s390x
run-end-to-end-tests
run-forgejo-tests
No milestone
No project
No assignees
9 participants
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo/runner!1122
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "mganter/runner:ephemeral-runners"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This PR introduces a flag for runner registration to tell forgejo that the runner token should be invalidated after executing one task. To prevent unauthorized looping in daemon mode, the runner will now terminated in deamon mode after one task.
Rel: https://codeberg.org/forgejo/forgejo/pulls/9962
Rel: https://codeberg.org/forgejo/forgejo/issues/9407
Big thanks to @ChristopherHX for implementing this in gitea
Rel: https://codeberg.org/forgejo/docs/pulls/1575/files
@ -215,4 +215,2 @@mock.TestingTCleanup(func())},) *mockCaches {This is a cosmetic change that does not belong.
oh yeah sry :O removed another file format as well
pipeline reformats this when executing make fmt. which is enforced by the pipeline
https://code.forgejo.org/forgejo/runner/actions/runs/11777/jobs/0/attempt/1
It looks good 👍 It needs careful review but overall this is great.
04692452e164150d7ecfCan someone describe to me how this feature is intended to be used... in what system architecture you'd be doing this? 🤔 All of the documentation is very technical about how it works, great to have... but I don't understand what the motivation is behind the attention in this area.
@mfenniak wrote in #1122 (comment):
The problem resides with hijacking runner tokens. Depending on the mapped level (forgejo, orga, repo), a hijacked runner token would allow to receive secrets and code from tasks that are currently pending by fetching new tasks.
By restricting the access for a given runner token to a single task, the token can only be used to fetch data related to this task.
This also implies that if you want to create new ephemeral runners, you need to register them first to retrieve a runner token and provide it to the runner.
So in contrast to one-job runners, ephemeral runners can only do one job with their token.
Maybe this comment might help you understand the problem a little more:
https://gitea.com/gitea/act_runner/issues/19#issuecomment-739221
I understand that. But why are you using
one-job?Just to be clear, I'm not asking this because I intend to negatively review this change. I understand the security motivation. But I don't understand, as I said, "in what system architecture you'd be doing this?"
We are currently using autoscaling runners on k8s using garm and host mode, because we dont have access to the privileges requried by dind.
To build containers, we have a rootless buildkitd sidecar which acts as builder for docker buildx.
That sounds interesting. Could you point to a description / infrastructure-as-code of how you do that? I'm also interested to know about the concrete problem you have in this particular context and how you plan to deploy the ephemeral feature once it is implemented.
The security improvement provided by ephemeral runners makes sense to me. But if it's a security enhancement, then should something be done to follow-up on preventing any ongoing usage of the insecure
one-jobcapability?@mfenniak wrote in #1122 (comment):
We could deprecate the feature or enforce the ephemeral registration for one-job executions.
Maybe ppl distribute runner tokens but no registration tokens. Which would require an adaption from their side. Altough i dont know anyone with such an environment.
@earl-warren wrote in #1122 (comment):
As mentioned above, we are using garm and k8s to autoscale forgejo runners. But we are providing them cross teams and dont want the teams to interfere with each other. Also, we want to prevent leakage of code or secrets required for deployments from other repositories through this hole.
The plan is that we use an adapted version of the garm-k8s-provider and garm's gitea integration to provision forgejo runner for our forgejo instance. Garm's k8s-provider will deploy runner pods in host mode with a buildkitd sidecar. Sadly i cannot share the code for this with you, as it is still closed source.
At some point we might tackle the forgejo integration into garm, but thats out of scope yet.
Is there a documentation or infrastructure as code I could read to understand how you are doing that? It is something the Forgejo infrastructure itself could benefit from actually. It is k8s based and 100% Infrastructure as Code https://codeberg.org/forgejo/k8s-cluster, which is presumably very similar to what you are doing.
I realize this may seem out of scope but it will go a long way to give substance to the concrete need for this ephemeral feature. Without such an example to look at, it feels like a solution for an abstract problem. I trust you when you write that it is concrete for you. And I'm just looking for a way to observe that concrete example.
@earl-warren wrote in #1122 (comment):
This is a rather incomplete example how it could look like, but it is still missing either dind or buildkitd and proper images:
Hope this helps a little!
@earl-warren wrote in #1122 (comment):
I just took a look into the repo, couldn't find a runner config there. But i really like the flux setup there 👍. Thanks @danielsy for the nice example. If i can find a some time during theweekend, i can provide a little more suffisticated example with autoscaling and buildkitd or dind.
Is there any contribution guide to the infrastructure repo? If not i will just provide some flux defintions and steps to configure it.
@mganter wrote in #1122 (comment):
There is none. Hence my interest 😁 It would be a matter of getting it right the first time instead of reworking something that's not quite right.
@danielsy would you also have a snippet in the same spirit for how the token is obtained?
And also... what would create this? Would a script polling the Forgejo API have to be created?
I'd like to know whether you've considered emulating GitHub's just-in-time runner for a repository.
Disclaimer, because it has caused confusion in the past: I'm not affiliated with anyone involved in this PR. I just happen to have similar interests.
While I'm happy about the introduction of ephemeral runners, I'm not sure whether the overall workflow is good. There does not seem to be a central tracking issue that discusses it (I'd appreciate any pointers). So it's very well possible that I am missing something.
What worries me is that this PR exist. It should not because whether a runner is ephemeral or not should only be controlled by Forgejo. There should be zero involvement of the runner.
On a high level, my ideal workflow would look like this:
forgejo-runner register), passing the authentication token and the labels (to allow rewriting "trixie" to "trixie:host" or something else), optionally the name. Then, it starts Forgejo Runner (roughlyforgejo-runner one-job). (Note: Label rewriting is incompatible with GitHub's JIT config.)one-job. Forgejo returns the previously registered run with the ID from step 3. Forgejo does not hand out the run to any other runner and prevents reuse of the authentication token.Advantages:
forgejo-runnerinhostmode which seems to be an objective of this PR.--ephemeraloption?Disadvantages:
Hey,
Just another interested user working on autoscaling CI.
I don't think this part exists yet.
There are two ways that GitHub does this
I made a note to check where this PR is at when I get back from vacation (yeah!) in two weeks.
@earl-warren wrote in #1122 (comment):
In Garm environment, the runner calls a garm specific endpoint and garm calls
POST /repos/{owner}/{repo}/actions/runners/registration-tokenPOST /orgs/{org}/actions/runners/registration-tokenPOST /admin/actions/runners/registration-tokenauthenticated via personal access token. The retrieved token can be used by the init container to perform the registration and the runner can act upon this without having the registration token or any other secret than the runner token.
@earl-warren wrote in #1122 (comment):
As soon as https://codeberg.org/forgejo/forgejo/pulls/9803 is integrated, forgejo can notify the orchestration tool (e.g. garm) to scale up workers.
@dharsanb wrote in #1122 (comment):
I created a PR solving this https://codeberg.org/forgejo/forgejo/pulls/9803
@aahlenst wrote in #1122 (comment):
I guess we just move the problem from the runner to the autoscaler.
As forgejo is a multi tenant environment, we need to take into account that we have multiple autoscalers for a specific repo.
E.g. there is a global autoscaler the whole platform and there are possibly org autoscalers and repo autoscalers, which serve sometimes runners with the same tags, sometimes with different ones.
I think your idea a valuable thing to achieve, but i would rather introduce a attribute to the workflow file, to assign tasks to the different runner types (global/org/repo).
I dont think so, as the jobs are bound to a runner as soon as the runner fetches a job. (already pre-ephmeral)
I dont really know what you mean by that, as forgejo always enforces all the rules.
In my mind I differ between runner registration and job execution. They dont need to be the same entity. In your example (step 3), this can also be done by the autoscaler.
As you can see in @danielsy comment, only the GITEA_RUNNER_FILE needs to be exposed to the job execution.
When the runner does not know about its ephemeral state, as of now it will be stuck in a 401 loop when run in daemon mode.
@aahlenst wrote in #1122 (comment):
👍 thats a win for sure
@aahlenst wrote in #1122 (comment):
I just looked it up, and to be honest, i don't really understand this endpoint.
Generates a configuration that can be passed to the runner application at startup.is kind of ambigous. Did you already try it out? Do you have a nice resource for getting me up to speed?This one is about ephemeral runners, maybe we can continue the discussion there.
forgejo/forgejo-actions-feature-requests#43
The runner is on a potentially attacker-controlled machine. Therefore, moving the problem somewhere else sounds good to me.
I believe we're talking about different things here. Let's say two jobs with equal labels are pending in a single repository, 1 and 2. In response, the autoscaler provisions two containers/VMs/whatever, A (for 1) and B (for 2). Right now, you cannot guarantee that 1 will actually end up on A and 2 on B.
If job 1 has a superset of labels of job 2, it might even happen that 2 ends up on A and 1 cannot run because B is missing labels. A workaround would be to ensure that label sets are unique, but that shifts the responsibility to the workflow author and allows PR authors to cause havoc.
If the Forgejo API exposes that information, it would be possible to figure out on which container/VM/whatever a job has ended up by matching runner names. But for that to work, a job must be successfully assigned to a runner. If something goes wrong earlier (see previous example), you're out of luck.
Is there currently an API to do this? And if there is, why should the runner also be able to register itself as ephemeral runner?
That is very nice solution for the tool that you're using. For somebody using, let's say OpenStack, that would require launching multiple VMs after another and transferring data between them. That wastes a lot of resources and takes forever.
I don't have a link ready, but I'm using it with GitHub. In short: It registers an ephemeral runner and returns Base64-encoded JSON. That JSON contains all configuration information for the runner. You can pass that directly to
actions-runner/run.sh --jitconfigwhich immediately starts running the job. No separate configuration is necessary and it avoids all quoting problems. But it might be tricky to do that with Forgejo's label rewriting.Might be difficult now. We'd lose all people that are following this issue.
I've thought some more about it.
I believe that using
forgejo-runner registerisn't beneficial in this case due to its reliance on the runner registration token. If the token leaks, adversaries can create new runners. Preventing it from leaking is not always easy. For example, people that wanted to start an ephemeral runner using cloud-init couldn't do so safely because cloud-init data sources are usually accessible during the lifetime of a machine.Using the offline registration token with
forgejo-runner create-runner-filelooks more promising because it is bound to a single runner. Therefore, it could be invalidated immediately after the runner has connected to Forgejo.forgejo-runner one-jobor evenforgejo-runner daemonwould still work and no changes to Forgejo Runner itself are necessary. I'm not opposed to changing Forgejo Runner if there's a compelling reason.I'm much less clear on the changes required in Forgejo itself and what APIs to introduce. Endpoints for runner creation based on the offline registration token would certainly be necessary.
I have compiled a lengthy document about ephemeral runners that I use for research. It could serve as a basis for a design discussion. I'm willing to share it if anyone wants to read it. Just tell me where I should put it.
First, i guess we still dont understand each other with the registration procedure.
In this graphic, you can see that the registration process, in your case it might be the thing that provisions a VM, can register the runner. After that, the one that made the registration can share the
runner.jsonwith the runner.@aahlenst wrote in #1122 (comment):
Fair point, but I would consider this a problem of the autoscaler, as the autoscaler should be able to see that one job is still pending and has not a matching runner.
@aahlenst wrote in #1122 (comment):
As mentioned above, you need something that scales your VMs that can also register the runner. The API call is currently implemented as GRPC call with this message.
@aahlenst wrote in #1122 (comment):
In my opinion, offline registration and remote runner registration that is not executed on the runner host and the grpc call, are equivalent.
So there is actually already an endpoint doing that.
Garm does have an provider for Openstack and LXD that you could use as an inspiration.
In all the cases the registration token is not exposed to the actual runner.
@mganter wrote in #1122 (comment):
I understand that this is a possibility. It works well for GARM with containers. It might work well for some of the other tools out there.
But: Why not make it easier, more accessible, and safer by default for everybody? Why require running a separate binary somewhere else instead of an API call? What are the advantages of your solution? What would you lose by not using
forgejo-runner register?The autoscaler has provisioned something useless, compute time was wasted, and some precious resource like a GPU that the assigned job does not need (but the waiting one) is suddenly occupied.
Why not prevent it from happening in the first place? Why not make it more efficient and foolproof by default?
All problems I pointed out can somehow be solved without changing the current proposal. But we are talking about a new feature. So, why don't we try to come up with the absolute best possible solution as long as it does not require a rewrite of Forgejo? Why settle for less? It's unlikely we get a do-over soon. Also, keep in mind that people are great at using tools in unforeseen ways. Therefore, the number of problems with any solution will only go up.
To be clear, I don't expect that you implement any of the improvements we come up with.
@kindlerw wrote in #1122 (comment):
I scrolled over the OpenStack provider. I can see how machines are provisioned and configuration is provided to the metadata service. I don't see there how runners are started. I found https://github.com/cloudbase/garm-provider-openstack/blob/main/vendor/github.com/cloudbase/garm-provider-common/cloudconfig/templates.go in there which does pretty much what I would expect: Fetch credentials from the metadata service, including the JIT config. Alternatively, it uses a normal runner token and configures the runner. If that script isn't executed on the same VM as the runner, where does it happen?
After extensive discussions with @mganter and colleagues, our conclusion1 is:
forgejo-runner registerin a separate container) andforgejo-runner registerdoes not require additional plumbing in that particular case.forgejo-runner registerwith an HTTP API call and using the offline registration token greatly simplifies developing integration while removing the risks posed byforgejo-runner register.A new feature request will be filed for the second option.
Then, there's the problem of job binding. I have filed a separate feature request for that. It is slightly different than what was initially discussed because I discovered some flaws (the original proposal is still listed as an alternative).
From my side, all questions are resolved.
Please correct me if I am misrepresenting anything. ↩︎
e47d3908a732ea44cab3I gave it a whirl and encountered some problems.
The error "cannot register new runner as ephemeral upgrade Forgejo to gain security, one-job will be used automatically" contains too many different messages. While "cannot register new runner as ephemeral" is indeed an error, "upgrade Forgejo to gain security" seems unrelated. one-job will be used automatically" (one space too many between used and automatically) should be
WARNor evenINFO. I also find the message "one-job will be used automatically" confusing.The resulting
.runnerfile:Unfortunately, Forgejo does not recognize the runner as ephemeral:
Forgejo Runner is now in a weird state. I think the registration error should be fatal.
When I switch the runner manually to be ephemeral in the database, it works. However, the jobs do not terminate cleanly:
The output of
forgejo-runner daemonis confusing when the runner is configured to be ephemeral:The last message should state why the runner is shutting down.
Offline registration works well 👍
The help messages for the
--ephemeralflag could be improved. It states: "Configure the runner to be ephemeral and only ever be able to pick a single job." What about "instruct Forgejo to delete this runner after it has run one job"?@aahlenst Thanks for doing a functional review on this.
I've had this item sitting in my inbox with the intent to review it for a while, and I apologize for not getting to it. I'm a little skeptical of the overall architecture and that's caused me to drag my feet... but honestly it's not an problem area that I've tackled for the runner, so my skepticism isn't warranted. So I'll offer some clarity on my thoughts: if it gets a ✅ functional and code review from aahlenst, then I'm happy to get this merged and supported.
@mfenniak wrote in #1122 (comment):
I'd still like to read it. When I started to think about the feature a couple of months back, my ideas looked very different than today.
Thanks a lot for your confidence, but I'll need some help. For example, my Go skills aren't good enough yet to asses any substantial PR.
@mganter I forgot to ask: Is there a particular reason for the presence of the
ephemeralbit in the runner file and for altering the behaviour ofdaemonin case of its presence?ephemeraltogether withdaemonfeels wrong and causes complexity. There's also the risk of Forgejo and Runner disagreeing about the value ofephemeral. So I'd like to explore whether we could live without it or, if necessary, replace it.@aahlenst wrote in #1122 (comment):
No problem, I'll be available for a detailed code review when appropriate, just let me know.
@aahlenst and @mfenniak thanks for reviewing
@aahlenst wrote in #1122 (comment):
I missed a bug where the registration was not sent properly. Fixed it in !1122 (commit
46db91068c) sorry about that@aahlenst wrote in #1122 (comment):
This was intended as it allows users to run the deamon command, even though they registered their runner as ephemeral.
If this is not implemented, the runner would loop in a 403 error.
Similiar to this (even though i executed the daemon command twice in this example):
@aahlenst wrote in #1122 (comment):
thats an interesting error, do you have an example for a matching workflow file?
@aahlenst wrote in #1122 (comment):
You're right, I update the message
@aahlenst wrote in #1122 (comment):
Not quite the last last message but i hope that helps (
f0f24ec15a)abc79b83c8f0f24ec15a@mganter wrote in #1122 (comment):
No worries. Works great now. Thanks a lot.
I see two problems with the
ephemeralbit as it's currently implemented:forgejo-runner daemonwith an ephemeral runner) for another (forgetting or accidentally adding--ephemeral). The former only impacts people with ephemeral runners and might even be easier to recognize than the latter because theephemeralflag is in the somewhat obscure.runnerfile.If avoiding the 403 error in a loop is the only reason for the
ephemeralbit, I'd like to forego it for the time being. Adding it later is easier than taking it out. If we recognize that it is a real problem, we should try to come up with a more robust solution where a disagreement between Forgejo Runner and Forgejo results in an hard error.However, if there's another reason, for example, turning
daemonintoone-jobthat waits for a job to run, we should rather consider extendingone-jobwith a--waitflag.@mganter wrote in #1122 (comment):
The label definition:
debian:docker://node:24-trixieLogs from Ubuntu 24.04 with the latest Docker:
Logs from Fedora 43 with rootless Podman:
However, it works flawlessly when
forgejo-runneroperates inhostmode.In my latest insights, this is a race condition between RunDaemon() and Close() calling ReportLog() and ReportState() in
internal/pkg/report/reporter.go. ReportState() draws its to be reported state from the struct it is attached to (Reporter). If RunDaemon executes ReportState() when the Reporter would consider the task finished, it will tell forgejo to deregister the runner. After that, the ReportLog() called from the Close() function, resulting in the error above.The log below originates from the fix version with more informational logs.
Notice, that
ReportState: result: RESULT_SUCCESSappears twice, and in betweenReportLog: TaskId: 34, Index: 12, NoMore: true, Rows: 0(NoMore: true indicates that it is called from the Close function)This should solve the issue. !1122 (commit
db9ad857f0)Can you test this fix?
@aahlenst wrote in #1122 (comment):
I didnt dive into this error, yet. But the other fix might fix this as well.
@aahlenst wrote in #1122 (comment):
I just saw that the response of
Declare, which is executed beforeFetchTask, contains theRunnerobject present on Forgejo side. So I propose that we should use the server state as truth, and use it to configure the runner during execution time, without needing theephemeralflag in.runner.@aahlenst wrote in #1122 (comment):
This is a good idea, I will implement that.
@mganter wrote in #1122 (comment):
Yep. Much better now, thanks a lot.
It might make sense to downgrade
Skipping ReportState to ensure ReportLog can be executedto debug because I don't think it's useful for the average user. It will still appear until Forgejo Runner no longer enables debug logging by default.@mganter wrote in #1122 (comment):
That's great.
Is it not possible to store the ephemeral bit in the context and act accordingly without having to change the runner file?
8a9970d710b37dcc1776b37dcc17760c7c2c379f@aahlenst wrote in #1122 (comment):
I changed my comment just as you answered :D i implemented it in
0c7c2c379fto take the server side as truth@aahlenst wrote in #1122 (comment):
implemented in !1122 (commit
539fba8315)@mganter wrote in #1122 (comment):
Splendid!
--waitworks great, too.@ -37,6 +37,7 @@ func Execute(ctx context.Context) {registerCmd.Flags().StringVar(®Args.Token, "token", "", "Runner token")registerCmd.Flags().StringVar(®Args.RunnerName, "name", "", "Runner name")registerCmd.Flags().StringVar(®Args.Labels, "labels", "", "Runner tags, comma separated")registerCmd.Flags().BoolVar(®Args.Ephemeral, "ephemeral", false, "Configure the runner to be ephemeral and only ever be able to pick a single job")What about "instruct Forgejo to delete this runner after it has run one job"? Explains what
ephemeralis about.fixed
@ -56,2 +58,3 @@RunE: runJob(ctx, &configFile),RunE: runJob(ctx, &configFile, &runJobArgs),}jobCmd.Flags().BoolVarP(&runJobArgs.wait, "wait", "w", false, "waits until task has been assigned")What about "wait for a task to run?
fixed
@ -41,6 +42,7 @@ func createRunnerFileCmd(ctx context.Context, configFile *string) *cobra.Commandcmd.Flags().StringVar(&argsVar.Secret, "secret", "", "secret shared with the Forgejo instance via forgejo-cli actions register")_ = cmd.MarkFlagRequired("secret")cmd.Flags().StringVar(&argsVar.Name, "name", "", "Runner name")cmd.Flags().BoolVar(&argsVar.Ephemeral, "ephemeral", false, "Configure the runner to be ephemeral and only ever be able to pick a single job")That's no longer needed, is it?
fixed
@ -125,3 +128,2 @@//if err := ping(cfg, reg); err != nil {return errif args.Connect {Can we do this separately and add a test for it? If you don't want to do it, that's okay, just let me know.
removed the change to do it in other pr
@ -118,0 +122,4 @@mockPoller := mock_poller.NewPoller(t)mockPoller.On("PollOnce").Return(nil)pollTask(context.TODO(), mockPoller, true)I have a question because I don't know and want to learn something: Doesn't
t.Context()orcontext.Background()work here? If they don't, why?thats something i missed
t.Context()is the correct one.fixed
@ -118,0 +208,4 @@ctx := context.Background()tempDir := t.TempDir()runnerFile := tempDir + "/.runner"filepath.Join(tempDir, ".runner")fixed
@ -350,0 +355,4 @@reg.Ephemeral = resp.Msg.Runner.Ephemeralif inputs.Ephemeral != resp.Msg.Runner.Ephemeral {log.Error("poller: cannot register new runner as ephemeral upgrade Forgejo to enable this feature. The runner has been registered as not ephemeral.")As
--ephemeralcan be seen as a security feature, this should be a hard error and not only a log message.The message could be clearer: "cannot register runner as ephemeral because the Forgejo instance does not support it"
There should be an automated test, too.
i changed it to be an error. The only thing bothering me is that there is currently no way to Deregister a runner with a registration token.
@mganter wrote in #1122 (comment):
Is that related to the code in question?
The REST API can (soon) create and remove runners. Would that be an option for you?
#182 might be relevant.
I assume that a runner registration token cannot interact with the HTTP endpoints using a runner registration token. So i would rather prefer a Self Deregistration endpoint in the actions-proto spec, to fix it in the future.
@mganter wrote in #1122 (comment):
That's correct. And it's also not something I'd like to add.
I agree.
Do you have any opinions on the behaviour? I've started collecting requirements in #182 (comment).
@ -37,2 +38,3 @@func (j *Job) Run(ctx context.Context) error {func (j *Job) Run(ctx context.Context, wait bool) error {if wait {Is there a test for it?
intoduced tests for Run function
@ -26,0 +23,4 @@Token string `json:"token"`Address string `json:"address"`Labels []string `json:"labels"`Ephemeral bool `json:"-"`Ephemeralis no longer needed, is it?it is used on
forgejo-runner register@ -419,3 +419,3 @@testCase.fixture(t, reporter, client)err = reporter.ReportState()err = reporter.ReportState(true)Is there a test that covers the
falsebranch?added test
b3bb05ab3a628ddd6dcd628ddd6dcda9582141ca@aahlenst could you review internal/app/poll/poller.go as this needed to be changed during rebase due to
4b4e1dd75bApproved from a functional point of view. Functional testing I performed today.
The comments have to be resolved before I hit the "Approve" button.
@ -37,6 +37,7 @@ func Execute(ctx context.Context) {registerCmd.Flags().StringVar(®Args.Token, "token", "", "Runner token")registerCmd.Flags().StringVar(®Args.RunnerName, "name", "", "Runner name")registerCmd.Flags().StringVar(®Args.Labels, "labels", "", "Runner tags, comma separated")registerCmd.Flags().BoolVar(®Args.Ephemeral, "ephemeral", false, "Instructs Forgejo to delete this runner after it has run one job")instruct Forgejo to delete this runner after it has run one job(lowercase i, no at the end of instruct)@ -351,0 +356,4 @@reg.Ephemeral = resp.Msg.Runner.Ephemeralif inputs.Ephemeral != resp.Msg.Runner.Ephemeral {return fmt.Errorf("poller: cannot register new runner as ephemeral upgrade Forgejo to enable this feature. The runner has been registered as not ephemeral")Is
pollerthe right prefix here?A bit more explicit: "poller: aborting because this Forgejo instance does not support ephemeral runners; requires Forgejo 15 or newer. Attention: Forgejo has created a normal runner."
Not great that Forgejo 14 still creates a normal runner, but I don't think we can do anything about it.
i would rather go for
poller: This Forgejo instance does not support ephemeral runners; requires Forgejo 15 or newer. Attention: Forgejo has created a normal runner.what do you think about that? I thought about a WARNING in front of it, but we already error log this. And i would drop the word aborting, because it's not really aborting the registration
When I tried it (before reading the code), I wasn't sure what the outcome was. Has the registration been updated or not? I had to check the
.runnerfile to see that it has not been updated.I'm not attached to my message. I'm sure it can be improved. Whatever the improved message is, it should make it clear that nothing has been changed.
currently, it results in the same behaviour as not being able to save the file, which is.
not sure what the most reasonable approach is, as we cannot deregister a runner using the grpc endpoints. we can either:
in my opinion, neither one of these options is ideal
The current approach ("do not write the config and exit with an error code, which will leave a runner in forgejo behind") is fine with me. We "only" need a message that is super clear about what happened.
is this ok?
@ -102,0 +109,4 @@wg := &sync.WaitGroup{}// When we start a FetchTask, we'll be requesting (capacity - inProgressTasks) tasks from a remote and may receive// up to that number. We can't perform multiple fetches simulanteously or else we could be overprovisioned forsimultaneouslyinstead ofsimulanteously@ -102,0 +130,4 @@func (p *poller) pollForClient(limiter *rate.Limiter, client client.Client, capacity int64, fetchMutex chan any, taskVersions, inProgressTasks *atomic.Int64, wg *sync.WaitGroup, ephemeral bool) {if ephemeral && capacity > 1 {log.Infof("[poller] connot run ephemeral runner with more than 1 capacity")cannot run ephemeral runner with capacity greater than 1That's an invalid configuration. Would be good if we could catch that situation earlier and exit, perhaps in
pollTask?Ephemeral runner and multiple connections are mutually exclusive and a usage error. However, I don't know where that should be handled.
@mfenniak What do you think?
when running this in ephemeral mode, capacity is enforced by PollOnce to be 1
either here (for daemon):
func pollTask(ctx context.Context, poller poll.Poller, ephemeral bool) {if ephemeral {done := make(chan struct{})go func() {defer close(done)poller.PollOnce()}()or here (for one-job --wait):
func (j *Job) Run(ctx context.Context, wait bool) error {if wait {poller := NewPoller(ctx, j.cfg, []client.Client{j.client}, j.runner)poller.PollOnce()return nil}@ -348,6 +353,11 @@ func doRegister(ctx context.Context, cfg *config.Config, inputs *registerInputs)reg.UUID = resp.Msg.GetRunner().GetUuid()reg.Name = resp.Msg.GetRunner().GetName()reg.Token = resp.Msg.GetRunner().GetToken()reg.Ephemeral = resp.Msg.Runner.EphemeralGetRunner().GetEphemeral()should be used fornilsafe access, to avoid panics from unexpected responses from the server.fixed in !1122 (commit
685940a99b)@ -99,3 +100,3 @@}func (p *poller) pollForClient(limiter *rate.Limiter, client client.Client, capacity int64, fetchMutex chan any, taskVersions, inProgressTasks *atomic.Int64, wg *sync.WaitGroup) {func (p *poller) PollOnce() {This method is a lot of code to duplicate just to call
pollForClientwith slightly different arguments. Can you make this into a common implementation betweenPollandPollOnce? eg. both invokeCommonPoll(...)and provide the overridden capacity and overridden ephemeral flag.I'd also like to see unit tests implemented for the
PollOncecapability here inpoller.goplease.refactored and created test !1122 (commit
e101ab3f77)would you mind to double check
@ -102,0 +131,4 @@func (p *poller) pollForClient(limiter *rate.Limiter, client client.Client, capacity int64, fetchMutex chan any, taskVersions, inProgressTasks *atomic.Int64, wg *sync.WaitGroup, ephemeral bool) {if ephemeral && capacity > 1 {log.Infof("[poller] connot run ephemeral runner with more than 1 capacity")wg.Done()wg.Done()doesn't belong here -- it's not the responsibility ofpollForClientto callwg.Done(), but rather just to return when done.I missed that wg.Go() also calls wg.Done() internally.
fixed
@ -413,2 +414,4 @@r.stateMu.RUnlock()if !noMore && state.Result != runnerv1.Result_RESULT_UNSPECIFIED {// skip final ReportState so ReportLog called from reporter.Close() can send its log before the job finishesI don't understand this change and how it relates to the ephemeral runner concept. Can you explain it in more detail?
On Forgejo, we do remove invalidate the runner, once it sends a status with a specified result. Without that change, the runner still sends logs AFTER it send the final status report. This causes the behaviour you can see below.
To ensure all the data is sent before sending the final call. We skip all final status reports in Reporter.RunDaemon() and we rely on the Reporter.Close() function.
@aahlenst found the issue in #1122 (comment):
Got it; I was concerned about this on the Forgejo side of reviewing this. 🤔 I am a little worried about this change, but I can't see a problem with it.
Just as reference for other ppl: https://codeberg.org/forgejo/forgejo/pulls/9962/files#issuecomment-10507690
Also of note -- the integration tests are failing with... well... a lack of useful error message. 👎 Boo, bad integration tests.
Needs a rebase/merge after other runner changes today, and the integration test failure needs to be resolved. Then I'll trigger the end-to-end tests to ensure no functional regression in those tests, but that's unlikely.
Otherwise I think this is good to go. 👍 At least one of the integration test failures (
Open(/home/debian/.cache/actcache/bolt.db): timeout) may be fixed by the merge as #1373 encountered this error.e101ab3f774cd390cf654cd390cf65ecbfef30fbbtw in act/container/docker_run_test.go TestMergeJobOptions/Ignore fails on darwin, but i could not add a ignore condition as importing goos is forbidden by golangci-lint typecheck
cascading-pr updated at actions/setup-forgejo#888
Thanks a lot @mganter for your great work and your patience.