fix: reduce the time during which the database stays open #109

Merged
Kwonunn merged 3 commits from earl-warren/act:wip-proxy-lock into main 2025-03-24 19:31:10 +00:00
Contributor

This fixes a race condition introduced in 6.3.0 by keeping the bolt database open and locked while an upload is in progress.

bolt Open has a lock to ensure there only is one read-write lock at any given time and is configured to fail after waiting 5 seconds.

The following race is likely to happen:

  • goroutine A opens and lock and starts uploading
  • goroutine B waits for the lock, waking up to checking every 50ms by default
  • goroutine A finishes and releases the lock
  • goroutine C opens and lock and starts uploading
  • goroutine B wakes up to find the lock is still held, it missed its shot and will timeout after 5 seconds if that happens a few times and/or the upload is slow enough

  • During get/upload, close the database while reading/writing so
    it does not stay open for longer than necessary. This may be helpful
    when uploads run in parallel.
  • Be more informative when returning error 500
  • Make useCache handle errors
  • Return 500 immediately when writing the cache fails instead of falling
    through to 200

Refs: forgejo/runner#509

This fixes a race condition introduced in 6.3.0 by keeping the bolt database open and locked while an upload is in progress. bolt `Open` has a lock to ensure there only is [one read-write lock at any given time](https://github.com/etcd-io/bbolt/blob/main/db.go#L248) and is configured to [fail after waiting 5 seconds](https://code.forgejo.org/forgejo/act/src/commit/110b6bd72c607fb189d78ebc7b62285b21f20387/pkg/artifactcache/handler.go#L154). The following race is likely to happen: * goroutine A opens and lock and starts uploading * goroutine B waits for the lock, waking up to checking every 50ms by default * goroutine A finishes and releases the lock * goroutine C opens and lock and starts uploading * goroutine B wakes up to find the lock is still held, it missed its shot and will timeout after 5 seconds if that happens a few times and/or the upload is slow enough --- * During get/upload, close the database while reading/writing so it does not stay open for longer than necessary. This may be helpful when uploads run in parallel. * Be more informative when returning error 500 * Make useCache handle errors * Return 500 immediately when writing the cache fails instead of falling through to 200 Refs: https://code.forgejo.org/forgejo/runner/issues/509
Author
Contributor

@Kwonunn please let me know if that makes any difference in the podman setup you have 🙏 I'm hoping it does but I'm not convinced it will.

@Kwonunn please let me know if that makes any difference in the podman setup you have 🙏 I'm hoping it does but I'm not convinced it will.
earl-warren changed title from fix: reduce the time during which the database stays open [skip cascade] to fix: reduce the time during which the database stays open 2025-03-23 21:40:39 +00:00
Member

I think I'll have time to test it tomorrow.

I think I'll have time to test it tomorrow.
earl-warren force-pushed wip-proxy-lock from ca3b3776ae
All checks were successful
/ cascade (pull_request_target) Has been skipped
checks / unit (pull_request) Successful in 1m33s
checks / integration (pull_request) Successful in 1m13s
to a07128b75c
All checks were successful
checks / unit (pull_request) Successful in 47s
checks / integration (pull_request) Successful in 37s
/ cascade (pull_request_target) Successful in 38m15s
2025-03-23 21:40:49 +00:00
Compare
Contributor

cascading-pr updated at forgejo/runner#511

cascading-pr updated at https://code.forgejo.org/forgejo/runner/pulls/511
Author
Contributor

I wish I could do it myself, but I'll have to learn podman from nothing. Maybe it is time to have a podman based integration test in end-to-end.

I wish I could do it myself, but I'll have to learn podman from nothing. Maybe it is time to have a podman based integration test in end-to-end.
viceice approved these changes 2025-03-23 21:43:43 +00:00
Dismissed
viceice left a comment
Owner

otherwise LGTM

otherwise LGTM
@ -279,3 +279,3 @@
// Should not happen
if cache.Repo != repo {
h.responseJSON(w, r, 500, ErrValidation)
h.responseJSON(w, r, 500, fmt.Errorf("cache %s != %s", cache.Repo, repo))

I wouldn't disclose this information, could be used for malicious stuff

I wouldn't disclose this information, could be used for malicious stuff
Author
Contributor

I think not: if a request with a repo different than the expected one is set, it will show which repo was expected. But it won't allow the caller to guess the MAC used for communication between the proxy and the cache.

@Kwonunn knows best though.

I think not: if a request with a repo different than the expected one is set, it will show which repo was expected. But it won't allow the caller to guess the MAC used for communication between the proxy and the cache. @Kwonunn knows best though.

The only malicious thing I can see coming from this is that it could leak the name of a private repository.

The only malicious thing I can see coming from this is that it could leak the name of a private repository.
Author
Contributor

I replaced it with a message that is specific but with no data displayed.

I replaced it with a message that is specific but with no data displayed.
earl-warren marked this conversation as resolved
@ -333,3 +337,3 @@
// Should not happen
if cache.Repo != repo {
h.responseJSON(w, r, 500, ErrValidation)
h.responseJSON(w, r, 500, fmt.Errorf("cache %s != %s", cache.Repo, repo))

same

same
earl-warren marked this conversation as resolved
First-time contributor

@earl-warren wrote in #109 (comment):

I wish I could do it myself, but I'll have to learn podman from nothing. Maybe it is time to have a podman based integration test in end-to-end.

testing podman as a one-off is easy, on debian 12 just run:

sudo apt install podman
podman system service -t 0 &
DOCKER_HOST=unix://${XDG_RUNTIME_DIR}/podman/podman.sock ./forgejo-runner daemon

as described here: https://forgejo.org/docs/latest/admin/runner-installation/#setting-up-the-container-environment

@earl-warren wrote in https://code.forgejo.org/forgejo/act/pulls/109#issuecomment-35806: > I wish I could do it myself, but I'll have to learn podman from nothing. Maybe it is time to have a podman based integration test in end-to-end. testing podman as a one-off is easy, on debian 12 just run: ```bash sudo apt install podman podman system service -t 0 & DOCKER_HOST=unix://${XDG_RUNTIME_DIR}/podman/podman.sock ./forgejo-runner daemon ``` as described here: https://forgejo.org/docs/latest/admin/runner-installation/#setting-up-the-container-environment
Author
Contributor

Looking at https://github.com/etcd-io/bbolt and the open function, there seem to be some potential for undesirable side effects if multiple read-write open are in flight at the same time.

Open has a lock to ensure there only is one read-write lock at any given time and is configured to fail after waiting 5 seconds.

Timeout: 5 * time.Second,

So... when uploading a significant amount of data which may block an open for more than 5 seconds, it can lead to failure.

Keeping the database open for the duration of the upload is therefore canceling the benefit of running multiple uploads in parallel.

The following pathological case is more likely to happen as well:

  • goroutine A opens and lock and starts uploading
  • goroutine B waits for the lock, waking up to checking every 50ms by default
  • goroutine A finishes and releases the lock
  • goroutine C opens and lock and starts uploading
  • goroutine B wakes up to find the lock is still held, it missed its shot
Looking at https://github.com/etcd-io/bbolt and the [open function](https://github.com/etcd-io/bbolt/blob/main/db.go#L178), there seem to be some potential for undesirable side effects if multiple read-write open are in flight at the same time. Open has a lock to ensure there only is [one read-write lock at any given time](https://github.com/etcd-io/bbolt/blob/main/db.go#L248) and is configured to fail after waiting 5 seconds. https://code.forgejo.org/forgejo/act/src/commit/110b6bd72c607fb189d78ebc7b62285b21f20387/pkg/artifactcache/handler.go#L154 So... when uploading a significant amount of data which may block an open for more than 5 seconds, it can lead to failure. Keeping the database open for the duration of the upload is therefore canceling the benefit of running multiple uploads in parallel. The following pathological case is more likely to happen as well: * goroutine A opens and lock and starts uploading * goroutine B waits for the lock, waking up to checking every 50ms by default * goroutine A finishes and releases the lock * goroutine C opens and lock and starts uploading * goroutine B wakes up to find the lock is still held, it missed its shot
Author
Contributor
> as described here: https://forgejo.org/docs/latest/admin/runner-installation/#setting-up-the-container-environment I've been procrastinating for a long time.
earl-warren force-pushed wip-proxy-lock from a07128b75c
All checks were successful
checks / unit (pull_request) Successful in 47s
checks / integration (pull_request) Successful in 37s
/ cascade (pull_request_target) Successful in 38m15s
to e3b67f8794
Some checks failed
checks / unit (pull_request) Successful in 1m13s
checks / integration (pull_request) Successful in 26s
/ cascade (pull_request_target) Failing after 16m7s
2025-03-23 22:25:25 +00:00
Compare
earl-warren dismissed viceice's review 2025-03-23 22:25:25 +00:00
Reason:

New commits pushed, approval review dismissed automatically according to repository settings

Contributor

cascading-pr updated at forgejo/runner#511

cascading-pr updated at https://code.forgejo.org/forgejo/runner/pulls/511
Member

@earl-warren

After testing 3 times with podman, it is looking like this has solved the problem.

@earl-warren After testing 3 times with podman, it is looking like this has solved the problem.
Author
Contributor

Updated the description with a description of the race for forensic analysis.

Updated the description with a description of the race for forensic analysis.
@ -389,3 +393,3 @@
return
}
defer db.Close()
db.Close()

This should still be a defer, otherwise you're immediately closing the DB after opening it.

This should still be a defer, otherwise you're immediately closing the DB after opening it.
earl-warren marked this conversation as resolved
@ -481,1 +488,3 @@
func (h *Handler) useCache(db *bolthold.Store, cache *Cache) {
func (h *Handler) useCache(id uint64) error {
db, err := h.openDB()
if err != nil {

When this function is called from the get function, even after fixing the defer comment above, this consistently results in a timeout.

When this function is called from the `get` function, even after fixing the `defer` comment above, this consistently results in a timeout.
earl-warren marked this conversation as resolved
fix: do not immediately close the db after opening it
All checks were successful
checks / unit (pull_request) Successful in 46s
checks / integration (pull_request) Successful in 26s
/ cascade (pull_request_target) Successful in 37m3s
e741fd2145
Contributor

cascading-pr updated at forgejo/runner#511

cascading-pr updated at https://code.forgejo.org/forgejo/runner/pulls/511
@ -392,3 +390,1 @@
if err := db.Get(id, cache); err != nil {
if errors.Is(err, bolthold.ErrNotFound) {
h.responseJSON(w, r, 404, fmt.Errorf("cache %d: not reserved", id))
{
Author
Contributor

with this scoping db will be out of scope before the end of the function, which is when defer will be called. I'm not sure what the side effect of this is. Can you try without adding this scope?

with this scoping db will be out of scope before the end of the function, which is when defer will be called. I'm not sure what the side effect of this is. Can you try without adding this scope?
earl-warren marked this conversation as resolved
@ -403,3 +409,3 @@
h.responseJSON(w, r, 500, ErrValidation)
h.responseJSON(w, r, 500, fmt.Errorf("cache repo is not valid"))
return
}
Author
Contributor

there must be a db.Close() here so the database is closed before starting the read. It will be opened again to update the last used data when it is completed. Similar to upload.

there must be a `db.Close()` here so the database is closed before starting the read. It will be opened again to update the last used data when it is completed. Similar to upload.
fix: move reading cache to separate function
All checks were successful
checks / unit (pull_request) Successful in 46s
checks / integration (pull_request) Successful in 25s
/ cascade (pull_request_target) Successful in 1m18s
c51c6ab2bc
Contributor

cascading-pr updated at forgejo/runner#511

cascading-pr updated at https://code.forgejo.org/forgejo/runner/pulls/511
Contributor

cascading-pr updated at forgejo/runner#511

cascading-pr updated at https://code.forgejo.org/forgejo/runner/pulls/511
Author
Contributor

Reviewed the latest version and it looks good. Introducing the read function was a good idea: it is simpler to read and maintain while also fixing the problem 👍 @Kwonunn

Reviewed the latest version and it looks good. Introducing the read function was a good idea: it is simpler to read and maintain while also fixing the problem 👍 @Kwonunn
Author
Contributor

@programmerjake FYI: the broken version that you manually tested was undetected in the CI because of a regression introduced in the test suite two days ago. The outcome was false positive instead of a failure. It is now fixed and the tests have been verified to run against the patched version.

@programmerjake FYI: the broken version that you manually tested was undetected in the CI because of a regression introduced in the test suite two days ago. The outcome was false positive instead of a failure. It is [now fixed](https://code.forgejo.org/actions/setup-forgejo/commit/c866283b9e9e45e93878278dda7d000ce30dcac8) and the tests have been verified to run against the patched version.
Kwonunn approved these changes 2025-03-24 19:30:59 +00:00
Kwonunn left a comment
Member

has been tested by programmerjake, solves runner issue #509

has been tested by programmerjake, solves runner issue #509
Commenting is not possible because the repository is archived.
No reviewers
No milestone
No project
No assignees
5 participants
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
forgejo/act!109
No description provided.