bug: jobs are assigned to a Forgejo Runner, and fail without any logs being reported #1302
Labels
No labels
FreeBSD
Kind/Breaking
Kind/Bug
Kind/Chore
Kind/DependencyUpdate
Kind/Documentation
Kind/Enhancement
Kind/Feature
Kind/Security
Kind/Testing
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Reviewed
Confirmed
Reviewed
Duplicate
Reviewed
Invalid
Reviewed
Won't Fix
Status
Abandoned
Status
Blocked
Status
Need More Info
Windows
linux-powerpc64le
linux-riscv64
linux-s390x
run-end-to-end-tests
run-forgejo-tests
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo/runner#1302
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Roughly starting today (Jan 16th), there have been multiple cases where Forgejo shows a job was in the running state but stuck in "Set up job", no logs were reported, and the job later fails after slightly longer than 1 hour. Examples:
The earliest incident noted on forgejo/forgejo is this run, which is reported by the step action at a timestamp 2026-01-15 21:20:24 UTC; it's not clear if that's a start or an end time. Forgejo v14 was deployed on Codeberg at 2026-01-15 20:00:00 UTC, which is a pretty tight correlation with the beginning of these incidents.
Regarding the "job later fails after slightly longer than 1 hour" -- this is Codeberg's configured
ZOMBIE_TASK_TIMEOUT.this isn't clearly aligned with any known timeout. The forgejo/forgejo runners have a 2 hour job timeout, and a task often fails at a couple minutes past 2hr -- for example this test-sqlite failure appears to be an example of a known Forgejo testing bug, unrelated to this problem, but demonstrates a 2 hour job timeout. I'm not currently aware of anything that would trigger a 1 hour timeout.Attached as
runner-logs.txtis a set of runner logs related to the failure https://codeberg.org/forgejo/forgejo/actions/runs/131451/jobs/0/attempt/1 and possibly others.#1299 is additional diagnostics in the reporter that were identified as missing when reviewing the attached logs.
Although the logs indicate sporadic connection failures to Codeberg; (a) the number of incidents of this occurring is far too high to explain by connection failures, and (b) attempts to reproduce similar errors of an offline Forgejo did not result in any similar long-term impact and timed out runs. The network disconnect errors feel like a red-herring unrelated to the issue.
This error would originate from...
log.WithError(err).Error("failed to fetch task")codeberg.org/forgejo/forgejo@3252fd5134/routers/api/actions/runner/runner.go (L162)PickTaskand got an internal error. It will return the error. This must occur on the first picked task (for this specific error) of a potential multi-task pick, and so error handling is appropriate here to return internal server error.codeberg.org/forgejo/forgejo@3252fd5134/services/actions/task.go (L32)db.WithTxis used here. The error that occurs in picking a task should cause a database transaction rollback, leaving the task available to be picked up by another runner.PickTask(above) ->CreateTaskForRunner->UpdateRunJobWithoutNotification(below)codeberg.org/forgejo/forgejo@3252fd5134/models/actions/task.go (L409)db.TxContext(ctx)transaction management.codeberg.org/forgejo/forgejo@3252fd5134/models/actions/run_job.go (L197)CreateTaskForRunner->UpdateRunJobWithoutNotificationstatus,started, andstoppedfields are being updated on the run because the job is being started.codeberg.org/forgejo/forgejo@3252fd5134/models/actions/run.go (L498)ActionRunuses xorm's optimistic lock control, which is implemented hereActionRunto fail.So based upon this analysis... the "run has changed" error could be related to the problem, if:
codeberg.org/Codeberg-Infrastructure/build-deploy-forgejo@680f3a4222/etc/forgejo/conf/base.ini (L125-L126)db.WithTxis affecting the database despite returning an error. A context escape (eg. usingcontext.Background()inside the tx), a more serious problem likedb.WithTxleaking data, or an infrastructure-level problem like Codeberg's MySQL not working with transactions (which could be a side-effect of something like Galera HA if there's a misconfiguration, maybe).PickTaskusesdb.WithTx, andCreateTaskForRunnerusesdb.TxContext(). Both appear to be used correctly from code analysis. It isn't clear to me that the two of them interact and work together correctly.AfterTx().WithTxandTxContextworks correctly to rollback the database. This doesn't eliminate the possibility that there's a problem here, but eliminates a straightforward "well it doesn't work".If we can establish reason to believe it's the problem, there could be multiple fixes related:
WithTxto not work as expectedcodeberg.org/forgejo/forgejo@3252fd5134/models/actions/run_job.go (L178-L198)Proposed patch: https://codeberg.org/forgejo/forgejo/pulls/10893I'll dig into some of this a little further and see if I can provide some evidence to support or contradict this.
Although the analysis above has a gap or two, I'm proposing a patch to Forgejo that should eliminate the error: https://codeberg.org/forgejo/forgejo/pulls/10893. I think in the absence of more diagnostically relevant information, at the moment at least, it would be very useful if it was possible to deploy this patch to Codeberg to see if there is any change in the zombie tasks.
@Gusted Could you evaluate the linked PR for being cherry-picked into Codeberg to see if it addresses the issue documented here? It has only manual testing, and of course Forgejo's existing regression test suite (running). But the change isn't very complicated either. I'll continue work on this either way, but the information for or against it fixing the problem would be really nice. 🙂
(FYI, tagging @viceice since this is the issue he was raising today and I haven't tagged him on it yet -- lots more information and research here than we had earlier, still no conclusions)
I've identified a small bug that corresponds to everything observed here, but, has such a small window of possibility that I'm not sure it's a likely culprit.
FetchTaskinvokesPickTaskonce for the "first task" to be returned. If no tasks are found, thentaskremainsnil.codeberg.org/forgejo/forgejo@fdf4dfd2a5/routers/api/actions/runner/runner.go (L160)FetchTaskcontinues through the number of capacity entries, picking more tasks.nil, it should not proceed through more capacity entriescodeberg.org/forgejo/forgejo@fdf4dfd2a5/routers/api/actions/runner/runner.go (L167-L170)AdditionalTasksif theTaskentry is nilif resp.Msg.Task == nil {return nil, false}This would perfectly correspond to the problem, as all the returned tasks would end up as zombies since the runner doesn't act on them. But it requires that all the impacted tasks appear in the narrow window between the first
PickTaskfunction call and the secondPickTaskfunction call. I'll prepare a patch for this, regardless. But it feels unlikely.Patch in Forgejo: https://codeberg.org/forgejo/forgejo/pulls/10899
Patch in Runner: #1303
There are two places where
CreateTaskForRunnerbreaks theTxContext"contract" ("Always callCommit()before returning if there are no errors") in a way that is a little suspect:If no jobs are found that match the runner's labels, then we return no error but don't call
Commit:codeberg.org/forgejo/forgejo@fdf4dfd2a5/models/actions/task.go (L346-L348)If the job can't be assigned due to concurrency conflict with another runner trying to pick up the job, then we return no error but don't call
Commit:codeberg.org/forgejo/forgejo@fdf4dfd2a5/models/actions/task.go (L409-L413)I think this behaviour is dangerous, but haven't put together all the pieces yet.
CreateTaskForRunnerwill attempt to close the transaction without committing it. But it's ahalfCommitterbecause it's within a nested transaction...CreateTaskForRunnerwill not return an error, soPickTaskwill not return an error toWithTx, which will attempt to commit the transaction.Based upon a reading of the
db/context.goall of this seems to be handled... but I'm working on building a test case to exercise it thoroughly. It seems to be in the realm of "something that can cause the transaction to commit when it should rollback".Early indications are that Runner v12.5.3 fixes this problem with PR #1303, and almost immediately after upgrade @viceice noted a logged warning indicating that the behaviour had been triggered:
I think based upon this experimental evidence, it's more likely now that this is the source of the problem. My efforts to recreate the theorized transaction problems (https://codeberg.org/mfenniak/forgejo/pulls/4/files#diff-ad94ee6a091601c54a1e1429e4ce0a419038e5e7) have all been met with Forgejo handling the faults with complete success.
Following the deployment of v12.5.3 runner, Codeberg has been upgraded with PR 10893 and PR 10899, the later of which will prevent any more of these warnings.
I'll leave this issue open for a day or two to observe some Forgejo runs and keep an eye out for stalled tasks.
A compatibility warning has been added to the runner v12.4 - v12.5.2 release notes.
Closing as completed, as the issue has not reoccurred to my knowledge.