fix: reduce the time during which the database stays open #109

earl-warren · 2025-03-23T21:38:20Z

earl-warren commented

2025-03-23 21:38:20 +00:00

This fixes a race condition introduced in 6.3.0 by keeping the bolt database open and locked while an upload is in progress.

bolt Open has a lock to ensure there only is one read-write lock at any given time and is configured to fail after waiting 5 seconds.

The following race is likely to happen:

goroutine A opens and lock and starts uploading
goroutine B waits for the lock, waking up to checking every 50ms by default
goroutine A finishes and releases the lock
goroutine C opens and lock and starts uploading
goroutine B wakes up to find the lock is still held, it missed its shot and will timeout after 5 seconds if that happens a few times and/or the upload is slow enough

During get/upload, close the database while reading/writing so
it does not stay open for longer than necessary. This may be helpful
when uploads run in parallel.
Be more informative when returning error 500
Make useCache handle errors
Return 500 immediately when writing the cache fails instead of falling
through to 200

Refs: forgejo/runner#509

This fixes a race condition introduced in 6.3.0 by keeping the bolt database open and locked while an upload is in progress. bolt `Open` has a lock to ensure there only is [one read-write lock at any given time](https://github.com/etcd-io/bbolt/blob/main/db.go#L248) and is configured to [fail after waiting 5 seconds](https://code.forgejo.org/forgejo/act/src/commit/110b6bd72c607fb189d78ebc7b62285b21f20387/pkg/artifactcache/handler.go#L154). The following race is likely to happen: * goroutine A opens and lock and starts uploading * goroutine B waits for the lock, waking up to checking every 50ms by default * goroutine A finishes and releases the lock * goroutine C opens and lock and starts uploading * goroutine B wakes up to find the lock is still held, it missed its shot and will timeout after 5 seconds if that happens a few times and/or the upload is slow enough --- * During get/upload, close the database while reading/writing so it does not stay open for longer than necessary. This may be helpful when uploads run in parallel. * Be more informative when returning error 500 * Make useCache handle errors * Return 500 immediately when writing the cache fails instead of falling through to 200 Refs: https://code.forgejo.org/forgejo/runner/issues/509

earl-warren self-assigned this

2025-03-23 21:38:20 +00:00

earl-warren requested review from Kwonunn

2025-03-23 21:38:36 +00:00

earl-warren commented

2025-03-23 21:39:31 +00:00

@Kwonunn please let me know if that makes any difference in the podman setup you have 🙏 I'm hoping it does but I'm not convinced it will.

earl-warren changed title from ~~fix: reduce the time during which the database stays open [skip cascade]~~ to fix: reduce the time during which the database stays open

2025-03-23 21:40:39 +00:00

Kwonunn commented

2025-03-23 21:40:48 +00:00

I think I'll have time to test it tomorrow.

❤️ 1

earl-warren force-pushed wip-proxy-lock from ca3b3776ae

/ cascade (pull_request_target) Has been skipped

Details

checks / unit (pull_request) Successful in 1m33s

Details

checks / integration (pull_request) Successful in 1m13s

Details

to a07128b75c

checks / unit (pull_request) Successful in 47s

Details

checks / integration (pull_request) Successful in 37s

Details

/ cascade (pull_request_target) Successful in 38m15s

Details

2025-03-23 21:40:49 +00:00

Compare

cascading-pr referenced this pull request from forgejo/runner

2025-03-23 21:41:37 +00:00

cascading-pr from https://code.forgejo.org/forgejo/act refs/pull/109/head to forgejo/act-109 #511

cascading-pr commented

2025-03-23 21:41:38 +00:00

cascading-pr updated at forgejo/runner#511

cascading-pr updated at https://code.forgejo.org/forgejo/runner/pulls/511

earl-warren commented

2025-03-23 21:42:39 +00:00

I wish I could do it myself, but I'll have to learn podman from nothing. Maybe it is time to have a podman based integration test in end-to-end.

viceice approved these changes

2025-03-23 21:43:43 +00:00

Dismissed

viceice left a comment

otherwise LGTM

pkg/artifactcache/handler.go Outdated

					
				@ -279,3 +279,3 @@

					// Should not happen

					if cache.Repo != repo {

						h.responseJSON(w, r, 500, ErrValidation)

						h.responseJSON(w, r, 500, fmt.Errorf("cache %s != %s", cache.Repo, repo))

I wouldn't disclose this information, could be used for malicious stuff

I think not: if a request with a repo different than the expected one is set, it will show which repo was expected. But it won't allow the caller to guess the MAC used for communication between the proxy and the cache.

@Kwonunn knows best though.

I think not: if a request with a repo different than the expected one is set, it will show which repo was expected. But it won't allow the caller to guess the MAC used for communication between the proxy and the cache. @Kwonunn knows best though.

The only malicious thing I can see coming from this is that it could leak the name of a private repository.

I replaced it with a message that is specific but with no data displayed.

earl-warren marked this conversation as resolved

pkg/artifactcache/handler.go Outdated

					
				@ -333,3 +337,3 @@

					// Should not happen

					if cache.Repo != repo {

						h.responseJSON(w, r, 500, ErrValidation)

						h.responseJSON(w, r, 500, fmt.Errorf("cache %s != %s", cache.Repo, repo))

same

earl-warren marked this conversation as resolved

programmerjake commented

2025-03-23 22:02:28 +00:00

First-time contributor

@earl-warren wrote in #109 (comment):

I wish I could do it myself, but I'll have to learn podman from nothing. Maybe it is time to have a podman based integration test in end-to-end.

testing podman as a one-off is easy, on debian 12 just run:

sudo apt install podman
podman system service -t 0 &
DOCKER_HOST=unix://${XDG_RUNTIME_DIR}/podman/podman.sock ./forgejo-runner daemon

as described here: https://forgejo.org/docs/latest/admin/runner-installation/#setting-up-the-container-environment

@earl-warren wrote in https://code.forgejo.org/forgejo/act/pulls/109#issuecomment-35806: > I wish I could do it myself, but I'll have to learn podman from nothing. Maybe it is time to have a podman based integration test in end-to-end. testing podman as a one-off is easy, on debian 12 just run: ```bash sudo apt install podman podman system service -t 0 & DOCKER_HOST=unix://${XDG_RUNTIME_DIR}/podman/podman.sock ./forgejo-runner daemon ``` as described here: https://forgejo.org/docs/latest/admin/runner-installation/#setting-up-the-container-environment

earl-warren commented

2025-03-23 22:20:19 +00:00

Looking at https://github.com/etcd-io/bbolt and the open function, there seem to be some potential for undesirable side effects if multiple read-write open are in flight at the same time.

Open has a lock to ensure there only is one read-write lock at any given time and is configured to fail after waiting 5 seconds.

Timeout: 5 * time.Second,

So... when uploading a significant amount of data which may block an open for more than 5 seconds, it can lead to failure.

Keeping the database open for the duration of the upload is therefore canceling the benefit of running multiple uploads in parallel.

The following pathological case is more likely to happen as well:

goroutine A opens and lock and starts uploading
goroutine B waits for the lock, waking up to checking every 50ms by default
goroutine A finishes and releases the lock
goroutine C opens and lock and starts uploading
goroutine B wakes up to find the lock is still held, it missed its shot

Looking at https://github.com/etcd-io/bbolt and the [open function](https://github.com/etcd-io/bbolt/blob/main/db.go#L178), there seem to be some potential for undesirable side effects if multiple read-write open are in flight at the same time. Open has a lock to ensure there only is [one read-write lock at any given time](https://github.com/etcd-io/bbolt/blob/main/db.go#L248) and is configured to fail after waiting 5 seconds. https://code.forgejo.org/forgejo/act/src/commit/110b6bd72c607fb189d78ebc7b62285b21f20387/pkg/artifactcache/handler.go#L154 So... when uploading a significant amount of data which may block an open for more than 5 seconds, it can lead to failure. Keeping the database open for the duration of the upload is therefore canceling the benefit of running multiple uploads in parallel. The following pathological case is more likely to happen as well: * goroutine A opens and lock and starts uploading * goroutine B waits for the lock, waking up to checking every 50ms by default * goroutine A finishes and releases the lock * goroutine C opens and lock and starts uploading * goroutine B wakes up to find the lock is still held, it missed its shot

earl-warren commented

2025-03-23 22:23:11 +00:00

as described here: https://forgejo.org/docs/latest/admin/runner-installation/#setting-up-the-container-environment

I've been procrastinating for a long time.

> as described here: https://forgejo.org/docs/latest/admin/runner-installation/#setting-up-the-container-environment I've been procrastinating for a long time.

earl-warren force-pushed wip-proxy-lock from a07128b75c

checks / unit (pull_request) Successful in 47s

Details

checks / integration (pull_request) Successful in 37s

Details

/ cascade (pull_request_target) Successful in 38m15s

Details

to e3b67f8794

checks / unit (pull_request) Successful in 1m13s

Details

checks / integration (pull_request) Successful in 26s

Details

/ cascade (pull_request_target) Failing after 16m7s

Details

2025-03-23 22:25:25 +00:00

Compare

earl-warren dismissed viceice's review

2025-03-23 22:25:25 +00:00

Reason:

New commits pushed, approval review dismissed automatically according to repository settings

cascading-pr commented

2025-03-23 22:26:06 +00:00

cascading-pr updated at forgejo/runner#511

cascading-pr updated at https://code.forgejo.org/forgejo/runner/pulls/511

earl-warren requested review from viceice

2025-03-23 22:35:52 +00:00

Kwonunn commented

2025-03-24 08:10:16 +00:00

@earl-warren

After testing 3 times with podman, it is looking like this has solved the problem.

@earl-warren After testing 3 times with podman, it is looking like this has solved the problem.

forgejo-runner-output-new.txt

98 KiB

forgejo-runner-output-new2.txt

99 KiB

forgejo-runner-output-new3.txt

99 KiB

forgejo-runner-output-old.txt

86 KiB

🎉 2

Kwonunn referenced this pull request from forgejo/runner

2025-03-24 08:29:31 +00:00

bug: runner v6.3.0 randomly fails to save caches #509

earl-warren referenced this pull request from forgejo/runner

2025-03-24 08:38:04 +00:00

bug: runner v6.3.0 randomly fails to save caches #509

earl-warren commented

2025-03-24 08:44:09 +00:00

Updated the description with a description of the race for forensic analysis.

Kwonunn reviewed

2025-03-24 09:11:43 +00:00

pkg/artifactcache/handler.go Outdated

					
				@ -389,3 +393,3 @@

						return

					}

					defer db.Close()

					db.Close()

This should still be a defer, otherwise you're immediately closing the DB after opening it.

earl-warren marked this conversation as resolved

pkg/artifactcache/handler.go Outdated

					
				@ -481,1 +488,3 @@

				func (h *Handler) useCache(db *bolthold.Store, cache *Cache) {

				func (h *Handler) useCache(id uint64) error {

					db, err := h.openDB()

					if err != nil {

When this function is called from the get function, even after fixing the defer comment above, this consistently results in a timeout.

When this function is called from the `get` function, even after fixing the `defer` comment above, this consistently results in a timeout.

earl-warren marked this conversation as resolved

Kwonunn added 1 commit

2025-03-24 09:17:18 +00:00

fix: do not immediately close the db after opening it

checks / unit (pull_request) Successful in 46s

Details

checks / integration (pull_request) Successful in 26s

Details

/ cascade (pull_request_target) Successful in 37m3s

Details

e741fd2145

cascading-pr commented

2025-03-24 09:18:00 +00:00

cascading-pr updated at forgejo/runner#511

cascading-pr updated at https://code.forgejo.org/forgejo/runner/pulls/511

earl-warren reviewed

2025-03-24 09:29:45 +00:00

pkg/artifactcache/handler.go Outdated

					
				@ -392,3 +390,1 @@

					if err := db.Get(id, cache); err != nil {

						if errors.Is(err, bolthold.ErrNotFound) {

							h.responseJSON(w, r, 404, fmt.Errorf("cache %d: not reserved", id))

					{

with this scoping db will be out of scope before the end of the function, which is when defer will be called. I'm not sure what the side effect of this is. Can you try without adding this scope?

earl-warren marked this conversation as resolved

earl-warren reviewed

2025-03-24 09:36:20 +00:00

pkg/artifactcache/handler.go Outdated

					
				@ -403,3 +409,3 @@

						h.responseJSON(w, r, 500, ErrValidation)

						h.responseJSON(w, r, 500, fmt.Errorf("cache repo is not valid"))

						return

					}

there must be a db.Close() here so the database is closed before starting the read. It will be opened again to update the last used data when it is completed. Similar to upload.

there must be a `db.Close()` here so the database is closed before starting the read. It will be opened again to update the last used data when it is completed. Similar to upload.

Kwonunn added 1 commit

2025-03-24 09:48:39 +00:00

fix: move reading cache to separate function

checks / unit (pull_request) Successful in 46s

Details

checks / integration (pull_request) Successful in 25s

Details

/ cascade (pull_request_target) Successful in 1m18s

Details

c51c6ab2bc

cascading-pr commented

2025-03-24 09:49:21 +00:00

cascading-pr updated at forgejo/runner#511

cascading-pr updated at https://code.forgejo.org/forgejo/runner/pulls/511

cascading-pr commented

2025-03-24 11:14:25 +00:00

cascading-pr updated at forgejo/runner#511

cascading-pr updated at https://code.forgejo.org/forgejo/runner/pulls/511

earl-warren commented

2025-03-24 11:40:30 +00:00

Reviewed the latest version and it looks good. Introducing the read function was a good idea: it is simpler to read and maintain while also fixing the problem 👍 @Kwonunn

earl-warren commented

2025-03-24 11:46:33 +00:00

@programmerjake FYI: the broken version that you manually tested was undetected in the CI because of a regression introduced in the test suite two days ago. The outcome was false positive instead of a failure. It is now fixed and the tests have been verified to run against the patched version.

@programmerjake FYI: the broken version that you manually tested was undetected in the CI because of a regression introduced in the test suite two days ago. The outcome was false positive instead of a failure. It is [now fixed](https://code.forgejo.org/actions/setup-forgejo/commit/c866283b9e9e45e93878278dda7d000ce30dcac8) and the tests have been verified to run against the patched version.

Kwonunn approved these changes

2025-03-24 19:30:59 +00:00

Kwonunn left a comment

has been tested by programmerjake, solves runner issue #509