-
Notifications
You must be signed in to change notification settings - Fork 268
Description
Summary
The background goroutine that refreshes task credentials via ReclaimTask can be blocked from updating task.Queue when many artifacts are being uploaded concurrently. This causes credentials to expire, resulting in 401 errors and worker panics during artifact uploads.
Root Cause
task.queueMux (a sync.RWMutex) is held as an RLock for the entire duration of HTTP calls like CreateArtifact, including httpbackoff retries (up to 15 minutes). When the reclaim goroutine obtains fresh credentials, it needs a write lock on queueMux to swap in the new task.Queue client. That write lock is blocked until all in-flight CreateArtifact RLocks are released.
Under high artifact load:
- Multiple upload goroutines hold
queueMux.RLock()duringCreateArtifactcalls - If the queue is under load, httpbackoff retries extend the time these RLocks are held
- The reclaim goroutine successfully calls
ReclaimTaskand gets new credentials, but blocks onqueueMux.Lock()waiting for upload goroutines to finish - Upload goroutines continue using old credentials until they expire
- Expired credentials cause 401 (
ext.certificate.expiry < now), which panics the worker
Additionally, FinishArtifact and DownloadArtifactToFile read task.Queue without any lock at all, which is a data race with the reclaim goroutine's write.
Related
- Micro-ddos when uploading 100k+ artifacts at once #8023 — original issue about micro-DDoS from 100k+ artifact uploads
- fix(generic-worker): limit concurrent artifact uploads to 100 #8032 — limited concurrent uploads to 100
- fix(generic-worker): limit concurrent artifact uploads to 10 #8167 — reduced limit to 10
Those PRs reduced the likelihood of this occurring by limiting concurrency, but didn't fix the underlying locking issue.