Skip to content

[generic worker] credential refresh blocked by queueMux during artifact uploads #8289

@matt-boris

Description

@matt-boris

Summary

The background goroutine that refreshes task credentials via ReclaimTask can be blocked from updating task.Queue when many artifacts are being uploaded concurrently. This causes credentials to expire, resulting in 401 errors and worker panics during artifact uploads.

Root Cause

task.queueMux (a sync.RWMutex) is held as an RLock for the entire duration of HTTP calls like CreateArtifact, including httpbackoff retries (up to 15 minutes). When the reclaim goroutine obtains fresh credentials, it needs a write lock on queueMux to swap in the new task.Queue client. That write lock is blocked until all in-flight CreateArtifact RLocks are released.

Under high artifact load:

  1. Multiple upload goroutines hold queueMux.RLock() during CreateArtifact calls
  2. If the queue is under load, httpbackoff retries extend the time these RLocks are held
  3. The reclaim goroutine successfully calls ReclaimTask and gets new credentials, but blocks on queueMux.Lock() waiting for upload goroutines to finish
  4. Upload goroutines continue using old credentials until they expire
  5. Expired credentials cause 401 (ext.certificate.expiry < now), which panics the worker

Additionally, FinishArtifact and DownloadArtifactToFile read task.Queue without any lock at all, which is a data race with the reclaim goroutine's write.

Related

Those PRs reduced the likelihood of this occurring by limiting concurrency, but didn't fix the underlying locking issue.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions