bug: jobs are assigned to a Forgejo Runner, and fail without any logs being reported

mfenniak commented

2026-01-17 02:20:30 +00:00

Owner

Roughly starting today (Jan 16th), there have been multiple cases where Forgejo shows a job was in the running state but stuck in "Set up job", no logs were reported, and the job later fails after slightly longer than 1 hour. Examples:

The earliest incident noted on forgejo/forgejo is this run, which is reported by the step action at a timestamp 2026-01-15 21:20:24 UTC; it's not clear if that's a start or an end time. Forgejo v14 was deployed on Codeberg at 2026-01-15 20:00:00 UTC, which is a pretty tight correlation with the beginning of these incidents.

Regarding the "job later fails after slightly longer than 1 hour" -- this is Codeberg's configured ZOMBIE_TASK_TIMEOUT. this isn't clearly aligned with any known timeout. The forgejo/forgejo runners have a 2 hour job timeout, and a task often fails at a couple minutes past 2hr -- for example this test-sqlite failure appears to be an example of a known Forgejo testing bug, unrelated to this problem, but demonstrates a 2 hour job timeout. I'm not currently aware of anything that would trigger a 1 hour timeout.

Attached as runner-logs.txt is a set of runner logs related to the failure https://codeberg.org/forgejo/forgejo/actions/runs/131451/jobs/0/attempt/1 and possibly others.

Roughly starting today (Jan 16th), there have been multiple cases where Forgejo shows a job was in the running state but stuck in "Set up job", no logs were reported, and the job later fails after slightly longer than 1 hour. Examples: - https://codeberg.org/forgejo/forgejo/actions/runs/131358/jobs/11/attempt/2 marked as cancelled rather than failed - https://codeberg.org/forgejo/forgejo/actions/runs/131451/jobs/11/attempt/2 -- more typical, "Set up job" ends up reported as success, but all other steps are fail and no logs - Multiple jobs in the same run: - https://codeberg.org/forgejo/forgejo/actions/runs/131566/jobs/3/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/131566/jobs/8/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/131566/jobs/9/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/131566/jobs/10/attempt/1 - Multiple jobs in the same run: - https://codeberg.org/forgejo/forgejo/actions/runs/131554/jobs/4/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/131554/jobs/5/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/131554/jobs/6/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/131534/jobs/11/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/131532/jobs/11/attempt/1 - Multiple jobs in the same run: - https://codeberg.org/forgejo/forgejo/actions/runs/131530/jobs/2/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/131530/jobs/3/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/131530/jobs/8/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/131530/jobs/9/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/131530/jobs/10/attempt/1 - Multiple jobs in the same run: - https://codeberg.org/forgejo/forgejo/actions/runs/131506/jobs/2/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/131506/jobs/3/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/131506/jobs/8/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/131506/jobs/9/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/131506/jobs/10/attempt/1 - Multiple jobs in the same run: - https://codeberg.org/forgejo/forgejo/actions/runs/131504/jobs/2/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/131504/jobs/3/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/131504/jobs/8/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/131504/jobs/9/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/131504/jobs/10/attempt/1 - Multiple jobs in the same run: - https://codeberg.org/forgejo/forgejo/actions/runs/131445/jobs/4/attempt/2 - https://codeberg.org/forgejo/forgejo/actions/runs/131445/jobs/5/attempt/2 - https://codeberg.org/forgejo/forgejo/actions/runs/131436/jobs/2/attempt/1 - Entire run, all jobs: https://codeberg.org/forgejo/forgejo/actions/runs/131430/jobs/0/attempt/1 - Entire run, all jobs: https://codeberg.org/forgejo/forgejo/actions/runs/131353/jobs/0/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/131351/jobs/0/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/131350/jobs/0/attempt/1 - Multiple jobs in this run: https://codeberg.org/forgejo/forgejo/actions/runs/131253/jobs/4/attempt/1 - Multiple jobs in this run: https://codeberg.org/forgejo/forgejo/actions/runs/131225/jobs/2/attempt/1 - Multiple jobs in this run: https://codeberg.org/forgejo/forgejo/actions/runs/131215/jobs/4/attempt/1 - https://codeberg.org/forgejo/forgejo/actions/runs/131208/jobs/11/attempt/1 - Multiple jobs in this run: https://codeberg.org/forgejo/forgejo/actions/runs/131098/jobs/5/attempt/1 The earliest incident noted on forgejo/forgejo is [this run](https://codeberg.org/forgejo/forgejo/actions/runs/131098/jobs/5/attempt/1), which is reported by the step action at a timestamp 2026-01-15 21:20:24 UTC; it's not clear if that's a start or an end time. Forgejo v14 was deployed on Codeberg at 2026-01-15 20:00:00 UTC, which is a pretty tight correlation with the beginning of these incidents. Regarding the "job later fails after slightly longer than 1 hour" -- this is Codeberg's configured `ZOMBIE_TASK_TIMEOUT`. ~~this isn't clearly aligned with any known timeout. The forgejo/forgejo runners have a 2 hour job timeout, and a task often fails at a couple minutes past 2hr -- for example this [test-sqlite](https://codeberg.org/forgejo/forgejo/actions/runs/131532/jobs/10/attempt/1) failure appears to be an example of a known [Forgejo testing bug](https://codeberg.org/forgejo/forgejo/issues/10633), unrelated to this problem, but demonstrates a 2 hour job timeout. I'm not currently aware of anything that would trigger a 1 hour timeout.~~ Attached as `runner-logs.txt` is a set of runner logs related to the failure https://codeberg.org/forgejo/forgejo/actions/runs/131451/jobs/0/attempt/1 and possibly others.

runner-logs.txt

8.9 KiB

mfenniak commented

2026-01-17 02:21:33 +00:00

Author

Owner

#1299 is additional diagnostics in the reporter that were identified as missing when reviewing the attached logs.

Although the logs indicate sporadic connection failures to Codeberg; (a) the number of incidents of this occurring is far too high to explain by connection failures, and (b) attempts to reproduce similar errors of an offline Forgejo did not result in any similar long-term impact and timed out runs. The network disconnect errors feel like a red-herring unrelated to the issue.

#1299 is additional diagnostics in the reporter that were identified as missing when reviewing the attached logs. Although the logs indicate sporadic connection failures to Codeberg; (a) the number of incidents of this occurring is far too high to explain by connection failures, and (b) attempts to reproduce similar errors of an offline Forgejo did not result in any similar long-term impact and timed out runs. The network disconnect errors feel like a red-herring unrelated to the issue.

mfenniak commented

2026-01-17 02:56:35 +00:00

Author

Owner

time="2026-01-16T16:02:30Z" level=error msg="failed to fetch task" error="internal: pick task: CreateTaskForRunner: update run 2358339: run has changed"

This error would originate from...

internal/app/poll/poller.go
Line 163 in b8b64aa
log.WithError(err).Error("failed to fetch task")
- Runner is attempting to fetch a task; received a server error. It will log, return false to indicate no jobs, and go into another polling loop.
codeberg.org/forgejo/forgejo@3252fd5134/routers/api/actions/runner/runner.go (L162)
- Forgejo attempted to PickTask and got an internal error. It will return the error. This must occur on the first picked task (for this specific error) of a potential multi-task pick, and so error handling is appropriate here to return internal server error.
- If Forgejo had picked a second task, this error would be logged server-side and an internal server error would not be returned. Any previously picked tasks would be returned.
codeberg.org/forgejo/forgejo@3252fd5134/services/actions/task.go (L32)
- While trying to search for any available task for a runner, an error occurred. Error is returned.
- db.WithTx is used here. The error that occurs in picking a task should cause a database transaction rollback, leaving the task available to be picked up by another runner.
PickTask (above) -> CreateTaskForRunner -> UpdateRunJobWithoutNotification (below) codeberg.org/forgejo/forgejo@3252fd5134/models/actions/task.go (L409)
- Note: CreateTaskForRunner uses the old db.TxContext(ctx) transaction management.
codeberg.org/forgejo/forgejo@3252fd5134/models/actions/run_job.go (L197)
- CreateTaskForRunner -> UpdateRunJobWithoutNotification
- The status, started, and stopped fields are being updated on the run because the job is being started.
- Errors here are going to leave an action in the database which is assigned to a runner, but if the error prevents that from getting back to the runner it will be a zombie task.
codeberg.org/forgejo/forgejo@3252fd5134/models/actions/run.go (L498)
- This is an unconditional update on a record that already exists in the DB, so, the only reason that this will return 0 records updated is because ActionRun uses xorm's optimistic lock control, which is implemented here
- So if two runners get jobs from the same run at the same time, that can cause the update to ActionRun to fail.

So based upon this analysis... the "run has changed" error could be related to the problem, if:

Codeberg's zombie task timeout is 1 hour. Forgejo's default configuration is 10 minutes.
- ✅ Codeberg's zombie task timeout is 1 hour. codeberg.org/Codeberg-Infrastructure/build-deploy-forgejo@680f3a4222/etc/forgejo/conf/base.ini (L125-L126)
We can experimentally verify that an error of this nature causes the observed outcome -- a zombie termination, with a "set up job" that is successful, every other step failed
- ✅ This matches the observed behaviour. I've reproduced this by changing the runner so that it fetches a task in the poller and then never runs it -- it's not the same thing that is occurring, but it reproduces Forgejo assigning a job to a runner in the DB only. When Forgejo detects it as a zombie task, it performs a cancellation that looks identical to the observed cancellations.
We can somehow explain how db.WithTx is affecting the database despite returning an error. A context escape (eg. using context.Background() inside the tx), a more serious problem like db.WithTx leaking data, or an infrastructure-level problem like Codeberg's MySQL not working with transactions (which could be a side-effect of something like Galera HA if there's a misconfiguration, maybe).
- PickTask uses db.WithTx, and CreateTaskForRunner uses db.TxContext(). Both appear to be used correctly from code analysis. It isn't clear to me that the two of them interact and work together correctly.
- There are changes here in Forgejo 14: https://codeberg.org/forgejo/forgejo/issues/10130 changed some of the transaction code in order to introduce AfterTx().
- Experimental integration testing with MariaDB indicates that the combination of WithTx and TxContext works correctly to rollback the database. This doesn't eliminate the possibility that there's a problem here, but eliminates a straightforward "well it doesn't work".

If we can establish reason to believe it's the problem, there could be multiple fixes related:

Whatever would be causing WithTx to not work as expected
It would make sense for the entire "update the run started/stopped/status" section of code to retry on this concurrent modification state, refetching the run, performing the same conditional changes, and update statement. codeberg.org/forgejo/forgejo@3252fd5134/models/actions/run_job.go (L178-L198) Proposed patch: https://codeberg.org/forgejo/forgejo/pulls/10893

I'll dig into some of this a little further and see if I can provide some evidence to support or contradict this.

``` time="2026-01-16T16:02:30Z" level=error msg="failed to fetch task" error="internal: pick task: CreateTaskForRunner: update run 2358339: run has changed" ``` This error would originate from... - https://code.forgejo.org/forgejo/runner/src/commit/b8b64aa3462ae6ec768e4fbefee3b63c723c5827/internal/app/poll/poller.go#L163 - Runner is attempting to fetch a task; received a server error. It will log, return false to indicate no jobs, and go into another polling loop. - https://codeberg.org/forgejo/forgejo/src/commit/3252fd5134e640c63931136bc166780a67874954/routers/api/actions/runner/runner.go#L162 - Forgejo attempted to `PickTask` and got an internal error. It will return the error. This must occur on the first picked task (*for this specific error*) of a potential multi-task pick, and so error handling is appropriate here to return internal server error. - If Forgejo had picked a second task, this error would be logged server-side and an internal server error would not be returned. Any previously picked tasks would be returned. - https://codeberg.org/forgejo/forgejo/src/commit/3252fd5134e640c63931136bc166780a67874954/services/actions/task.go#L32 - While trying to search for any available task for a runner, an error occurred. Error is returned. - `db.WithTx` is used here. The error that occurs in picking a task should cause a database transaction rollback, leaving the task available to be picked up by another runner. - `PickTask` (above) -> `CreateTaskForRunner` -> `UpdateRunJobWithoutNotification` (below) https://codeberg.org/forgejo/forgejo/src/commit/3252fd5134e640c63931136bc166780a67874954/models/actions/task.go#L409 - Note: CreateTaskForRunner uses the old `db.TxContext(ctx)` transaction management. - https://codeberg.org/forgejo/forgejo/src/commit/3252fd5134e640c63931136bc166780a67874954/models/actions/run_job.go#L197 - `CreateTaskForRunner` -> `UpdateRunJobWithoutNotification` - The `status`, `started`, and `stopped` fields are being updated on the *run* because the *job* is being started. - Errors here are going to leave an action in the database which is assigned to a runner, but if the error prevents that from getting back to the runner it will be a zombie task. - https://codeberg.org/forgejo/forgejo/src/commit/3252fd5134e640c63931136bc166780a67874954/models/actions/run.go#L498 - This is an unconditional update on a record that already exists in the DB, so, the only reason that this will return 0 records updated is because `ActionRun` uses [xorm's optimistic lock control](https://xorm.io/docs/chapter-06/1.lock/), which is implemented [here](https://codeberg.org/forgejo/forgejo/src/commit/3252fd5134e640c63931136bc166780a67874954/models/actions/run.go#L62) - So if two runners get jobs from the same run at the same time, that can cause the update to `ActionRun` to fail. So based upon this analysis... the "run has changed" error *could* be related to the problem, if: - Codeberg's zombie task timeout is 1 hour. Forgejo's default configuration is 10 minutes. - ✅ Codeberg's zombie task timeout is 1 hour. https://codeberg.org/Codeberg-Infrastructure/build-deploy-forgejo/src/commit/680f3a4222f67a688d94d78f5811e13f9e6de798/etc/forgejo/conf/base.ini#L125-L126 - We can experimentally verify that an error of this nature causes the observed outcome -- a zombie termination, with a "set up job" that is successful, every other step failed - ✅ This matches the observed behaviour. I've reproduced this by changing the runner so that it fetches a task in the poller and then never runs it -- it's not the same thing that is occurring, but it reproduces Forgejo assigning a job to a runner in the DB only. When Forgejo detects it as a zombie task, it performs a cancellation that looks identical to the observed cancellations. - ![image](/attachments/1d3c5d9a-cade-4ae1-b3bb-2cb57a0fc804) - We can somehow explain how `db.WithTx` is affecting the database despite returning an error. A context escape (eg. using `context.Background()` inside the tx), a more serious problem like `db.WithTx` leaking data, or an infrastructure-level problem like Codeberg's MySQL not working with transactions (which could be a side-effect of something like Galera HA if there's a misconfiguration, maybe). - `PickTask` uses `db.WithTx`, and `CreateTaskForRunner` uses `db.TxContext()`. Both appear to be used correctly from code analysis. It isn't clear to me that the two of them interact and work together correctly. - **There are changes here in Forgejo 14:** https://codeberg.org/forgejo/forgejo/issues/10130 changed some of the transaction code in order to introduce `AfterTx()`. - [Experimental integration testing](https://codeberg.org/forgejo/forgejo/compare/forgejo...mfenniak:withtx-txcontext-test) with MariaDB indicates that the combination of `WithTx` and `TxContext` works correctly to rollback the database. This doesn't eliminate the possibility that there's a problem here, but eliminates a straightforward "well it doesn't work". If we can establish reason to believe it's the problem, there could be multiple fixes related: - Whatever would be causing `WithTx` to not work as expected - It would make sense for the entire "update the run started/stopped/status" section of code to retry on this concurrent modification state, refetching the run, performing the same conditional changes, and update statement. https://codeberg.org/forgejo/forgejo/src/commit/3252fd5134e640c63931136bc166780a67874954/models/actions/run_job.go#L178-L198 Proposed patch: https://codeberg.org/forgejo/forgejo/pulls/10893 I'll dig into some of this a little further and see if I can provide some evidence to support or contradict this.

image.png

15 KiB

mfenniak commented

2026-01-17 04:04:09 +00:00

Author

Owner

Although the analysis above has a gap or two, I'm proposing a patch to Forgejo that should eliminate the error: https://codeberg.org/forgejo/forgejo/pulls/10893. I think in the absence of more diagnostically relevant information, at the moment at least, it would be very useful if it was possible to deploy this patch to Codeberg to see if there is any change in the zombie tasks.

@Gusted Could you evaluate the linked PR for being cherry-picked into Codeberg to see if it addresses the issue documented here? It has only manual testing, and of course Forgejo's existing regression test suite (running). But the change isn't very complicated either. I'll continue work on this either way, but the information for or against it fixing the problem would be really nice. 🙂

(FYI, tagging @viceice since this is the issue he was raising today and I haven't tagged him on it yet -- lots more information and research here than we had earlier, still no conclusions)

Although the analysis above has a gap or two, I'm proposing a patch to Forgejo that should eliminate the error: https://codeberg.org/forgejo/forgejo/pulls/10893. I think in the absence of more diagnostically relevant information, at the moment at least, it would be very useful if it was possible to deploy this patch to Codeberg to see if there is any change in the zombie tasks. @Gusted Could you evaluate the linked PR for being cherry-picked into Codeberg to see if it addresses the issue documented here? It has only manual testing, and of course Forgejo's existing regression test suite (running). But the change isn't very complicated either. I'll continue work on this either way, but the information for or against it fixing the problem would be really nice. 🙂 (FYI, tagging @viceice since this is the issue he was raising today and I haven't tagged him on it yet -- lots more information and research here than we had earlier, still no conclusions)

mfenniak commented

2026-01-17 17:50:22 +00:00

Author

Owner

I've identified a small bug that corresponds to everything observed here, but, has such a small window of possibility that I'm not sure it's a likely culprit.

FetchTask invokes PickTask once for the "first task" to be returned. If no tasks are found, then task remains nil.
- codeberg.org/forgejo/forgejo@fdf4dfd2a5/routers/api/actions/runner/runner.go (L160)
FetchTask continues through the number of capacity entries, picking more tasks.
- Bug: if the first task to be picked was nil, it should not proceed through more capacity entries
- codeberg.org/forgejo/forgejo@fdf4dfd2a5/routers/api/actions/runner/runner.go (L167-L170)
The runner's poller will ignore any tasks in AdditionalTasks if the Task entry is nil
- internal/app/poll/poller.go
  Lines 176 to 178 in b8b64aa
  if resp.Msg.Task == nil {
  return nil, false
  }

This would perfectly correspond to the problem, as all the returned tasks would end up as zombies since the runner doesn't act on them. But it requires that all the impacted tasks appear in the narrow window between the first PickTask function call and the second PickTask function call. I'll prepare a patch for this, regardless. But it feels unlikely.

Patch in Forgejo: https://codeberg.org/forgejo/forgejo/pulls/10899
Patch in Runner: #1303

I've identified a small bug that corresponds to everything observed here, but, has such a small window of possibility that I'm not sure it's a likely culprit. - `FetchTask` invokes `PickTask` once for the "first task" to be returned. If no tasks are found, then `task` remains `nil`. - https://codeberg.org/forgejo/forgejo/src/commit/fdf4dfd2a592ff1c605b883e520c4bbad7f64afc/routers/api/actions/runner/runner.go#L160 - `FetchTask` continues through the number of capacity entries, picking more tasks. - **Bug:** if the first task to be picked was `nil`, it should not proceed through more capacity entries - https://codeberg.org/forgejo/forgejo/src/commit/fdf4dfd2a592ff1c605b883e520c4bbad7f64afc/routers/api/actions/runner/runner.go#L167-L170 - The runner's poller will ignore any tasks in `AdditionalTasks` if the `Task` entry is nil - https://code.forgejo.org/forgejo/runner/src/commit/b8b64aa3462ae6ec768e4fbefee3b63c723c5827/internal/app/poll/poller.go#L176-L178 This would perfectly correspond to the problem, as all the returned tasks would end up as zombies since the runner doesn't act on them. But it requires that all the impacted tasks appear in the narrow window between the first `PickTask` function call and the second `PickTask` function call. I'll prepare a patch for this, regardless. But it feels unlikely. Patch in Forgejo: https://codeberg.org/forgejo/forgejo/pulls/10899 Patch in Runner: https://code.forgejo.org/forgejo/runner/pulls/1303

mfenniak referenced this issue

2026-01-17 18:03:26 +00:00

fix: support Forgejo returning AdditionalTasks but not Task #1303

mfenniak commented

2026-01-17 18:17:47 +00:00

Author

Owner

There are two places where CreateTaskForRunner breaks the TxContext "contract" ("Always call Commit() before returning if there are no errors") in a way that is a little suspect:

If no jobs are found that match the runner's labels, then we return no error but don't call Commit: codeberg.org/forgejo/forgejo@fdf4dfd2a5/models/actions/task.go (L346-L348)

If the job can't be assigned due to concurrency conflict with another runner trying to pick up the job, then we return no error but don't call Commit: codeberg.org/forgejo/forgejo@fdf4dfd2a5/models/actions/task.go (L409-L413)

I think this behaviour is dangerous, but haven't put together all the pieces yet.

CreateTaskForRunner will attempt to close the transaction without committing it. But it's a halfCommitter because it's within a nested transaction...
CreateTaskForRunner will not return an error, so PickTask will not return an error to WithTx, which will attempt to commit the transaction.

Based upon a reading of the db/context.go all of this seems to be handled... but I'm working on building a test case to exercise it thoroughly. It seems to be in the realm of "something that can cause the transaction to commit when it should rollback".

There are two places where `CreateTaskForRunner` breaks the [`TxContext` "contract"](https://codeberg.org/forgejo/forgejo/src/commit/fdf4dfd2a592ff1c605b883e520c4bbad7f64afc/models/db/context.go#L151-L162) ("Always call `Commit()` before returning if there are no errors") in a way that is a little suspect: If no jobs are found that match the runner's labels, then we return no error but don't call `Commit`: https://codeberg.org/forgejo/forgejo/src/commit/fdf4dfd2a592ff1c605b883e520c4bbad7f64afc/models/actions/task.go#L346-L348 If the job can't be assigned due to concurrency conflict with another runner trying to pick up the job, then we return no error but don't call `Commit`: https://codeberg.org/forgejo/forgejo/src/commit/fdf4dfd2a592ff1c605b883e520c4bbad7f64afc/models/actions/task.go#L409-L413 I think this behaviour is dangerous, but haven't put together all the pieces yet. - `CreateTaskForRunner` will attempt to close the transaction without committing it. But it's a `halfCommitter` because it's within a nested transaction... - `CreateTaskForRunner` will *not* return an error, so `PickTask` will not return an error to `WithTx`, which will attempt to *commit* the transaction. Based upon a reading of the `db/context.go` all of this seems to be handled... but I'm working on building a test case to exercise it thoroughly. It seems to be in the realm of "something that can cause the transaction to commit when it should rollback".

mfenniak referenced this issue from a commit

2026-01-17 18:51:15 +00:00

fix: support Forgejo returning AdditionalTasks but not Task (#1303)

mfenniak commented

2026-01-17 22:21:01 +00:00

Author

Owner

Early indications are that Runner v12.5.3 fixes this problem with PR #1303, and almost immediately after upgrade @viceice noted a logged warning indicating that the behaviour had been triggered:

time="2026-01-17T21:51:07Z" level=warning msg="FetchTask received tasks in AdditionalTasks field but not Task field; this is unexpected but runner will run them"

I think based upon this experimental evidence, it's more likely now that this is the source of the problem. My efforts to recreate the theorized transaction problems (https://codeberg.org/mfenniak/forgejo/pulls/4/files#diff-ad94ee6a091601c54a1e1429e4ce0a419038e5e7) have all been met with Forgejo handling the faults with complete success.

Following the deployment of v12.5.3 runner, Codeberg has been upgraded with PR 10893 and PR 10899, the later of which will prevent any more of these warnings.

I'll leave this issue open for a day or two to observe some Forgejo runs and keep an eye out for stalled tasks.

A compatibility warning has been added to the runner v12.4 - v12.5.2 release notes.

Early indications are that Runner v12.5.3 fixes this problem with PR #1303, and almost immediately after upgrade @viceice noted a logged warning indicating that the behaviour had been triggered: ``` time="2026-01-17T21:51:07Z" level=warning msg="FetchTask received tasks in AdditionalTasks field but not Task field; this is unexpected but runner will run them" ``` I think based upon this experimental evidence, it's more likely now that this is the source of the problem. My efforts to recreate the theorized transaction problems (https://codeberg.org/mfenniak/forgejo/pulls/4/files#diff-ad94ee6a091601c54a1e1429e4ce0a419038e5e7) have all been met with Forgejo handling the faults with complete success. Following the deployment of v12.5.3 runner, Codeberg has been upgraded with [PR 10893](https://codeberg.org/Codeberg-Infrastructure/forgejo/commit/5276bbbc451b8b4e76326db0a6f1b0fc2b3e8553) and [PR 10899](https://codeberg.org/Codeberg-Infrastructure/forgejo/commit/3b50c1baa667de9aab450bb2d2dac25fbb9ef2de), the later of which will prevent any more of these warnings. I'll leave this issue open for a day or two to observe some Forgejo runs and keep an eye out for stalled tasks. A compatibility warning has been added to the runner v12.4 - v12.5.2 release notes.