fix: reduce the time during which the database stays open #109
Labels
No labels
Compat/Breaking
Kind/Bug
Kind
Chore
Kind/Documentation
Kind/Enhancement
Kind/Feature
Kind/Security
Kind/Testing
Priority
Critical
Priority
High
Priority
Low
Priority
Medium
Reviewed
Confirmed
Reviewed
Duplicate
Reviewed
Invalid
Reviewed
Won't Fix
Status
Abandoned
Status
Blocked
Status
Need More Info
No milestone
No project
No assignees
5 participants
Due date
No due date set.
Dependencies
No dependencies set.
Reference
forgejo/act!109
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "earl-warren/act:wip-proxy-lock"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
This fixes a race condition introduced in 6.3.0 by keeping the bolt database open and locked while an upload is in progress.
bolt
Openhas a lock to ensure there only is one read-write lock at any given time and is configured to fail after waiting 5 seconds.The following race is likely to happen:
it does not stay open for longer than necessary. This may be helpful
when uploads run in parallel.
through to 200
Refs: forgejo/runner#509
@Kwonunn please let me know if that makes any difference in the podman setup you have 🙏 I'm hoping it does but I'm not convinced it will.
fix: reduce the time during which the database stays open [skip cascade]to fix: reduce the time during which the database stays openI think I'll have time to test it tomorrow.
ca3b3776aea07128b75ccascading-pr updated at forgejo/runner#511
I wish I could do it myself, but I'll have to learn podman from nothing. Maybe it is time to have a podman based integration test in end-to-end.
otherwise LGTM
@ -279,3 +279,3 @@// Should not happenif cache.Repo != repo {h.responseJSON(w, r, 500, ErrValidation)h.responseJSON(w, r, 500, fmt.Errorf("cache %s != %s", cache.Repo, repo))I wouldn't disclose this information, could be used for malicious stuff
I think not: if a request with a repo different than the expected one is set, it will show which repo was expected. But it won't allow the caller to guess the MAC used for communication between the proxy and the cache.
@Kwonunn knows best though.
The only malicious thing I can see coming from this is that it could leak the name of a private repository.
I replaced it with a message that is specific but with no data displayed.
@ -333,3 +337,3 @@// Should not happenif cache.Repo != repo {h.responseJSON(w, r, 500, ErrValidation)h.responseJSON(w, r, 500, fmt.Errorf("cache %s != %s", cache.Repo, repo))same
@earl-warren wrote in #109 (comment):
testing podman as a one-off is easy, on debian 12 just run:
as described here: https://forgejo.org/docs/latest/admin/runner-installation/#setting-up-the-container-environment
Looking at https://github.com/etcd-io/bbolt and the open function, there seem to be some potential for undesirable side effects if multiple read-write open are in flight at the same time.
Open has a lock to ensure there only is one read-write lock at any given time and is configured to fail after waiting 5 seconds.
Timeout: 5 * time.Second,So... when uploading a significant amount of data which may block an open for more than 5 seconds, it can lead to failure.
Keeping the database open for the duration of the upload is therefore canceling the benefit of running multiple uploads in parallel.
The following pathological case is more likely to happen as well:
I've been procrastinating for a long time.
a07128b75ce3b67f8794New commits pushed, approval review dismissed automatically according to repository settings
cascading-pr updated at forgejo/runner#511
@earl-warren
After testing 3 times with podman, it is looking like this has solved the problem.
Updated the description with a description of the race for forensic analysis.
@ -389,3 +393,3 @@return}defer db.Close()db.Close()This should still be a defer, otherwise you're immediately closing the DB after opening it.
@ -481,1 +488,3 @@func (h *Handler) useCache(db *bolthold.Store, cache *Cache) {func (h *Handler) useCache(id uint64) error {db, err := h.openDB()if err != nil {When this function is called from the
getfunction, even after fixing thedefercomment above, this consistently results in a timeout.cascading-pr updated at forgejo/runner#511
@ -392,3 +390,1 @@if err := db.Get(id, cache); err != nil {if errors.Is(err, bolthold.ErrNotFound) {h.responseJSON(w, r, 404, fmt.Errorf("cache %d: not reserved", id)){with this scoping db will be out of scope before the end of the function, which is when defer will be called. I'm not sure what the side effect of this is. Can you try without adding this scope?
@ -403,3 +409,3 @@h.responseJSON(w, r, 500, ErrValidation)h.responseJSON(w, r, 500, fmt.Errorf("cache repo is not valid"))return}there must be a
db.Close()here so the database is closed before starting the read. It will be opened again to update the last used data when it is completed. Similar to upload.cascading-pr updated at forgejo/runner#511
cascading-pr updated at forgejo/runner#511
Reviewed the latest version and it looks good. Introducing the read function was a good idea: it is simpler to read and maintain while also fixing the problem 👍 @Kwonunn
@programmerjake FYI: the broken version that you manually tested was undetected in the CI because of a regression introduced in the test suite two days ago. The outcome was false positive instead of a failure. It is now fixed and the tests have been verified to run against the patched version.
has been tested by programmerjake, solves runner issue #509