Handle AlreadyExists race in cluster image build#337
Conversation
`build_cluster_image` does a check-then-build that can race when parallel test sessions try to build the same image. The `images.get()` check passes for both, then one build succeeds and the other gets a `BuildError` with `AlreadyExists`. This showed up as every test failing in a CI cluster job: https://github.com/chrisguidry/docket/actions/runs/22020705636/job/63639829973 Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
📚 Documentation has been built for this PR! You can download the documentation directly here: |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #337 +/- ##
==========================================
- Coverage 98.68% 98.63% -0.05%
==========================================
Files 103 103
Lines 10386 10391 +5
Branches 496 497 +1
==========================================
Hits 10249 10249
- Misses 121 126 +5
Partials 16 16
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 02922d06b8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if "AlreadyExists" in str(e): | ||
| pass |
There was a problem hiding this comment.
Wait for image availability after
AlreadyExists
When build_cluster_image catches BuildError containing AlreadyExists, it immediately returns the tag without confirming the image is actually present. This race occurs while another process is still building, so in parallel test runs with a cold cache (where build/pull can take longer than ~3s), containers.run can still fail with ImageNotFound despite this change because the retry window in with_image_retry is short; you should block/poll on client.images.get(tag) (or retry build) before returning from this branch.
Useful? React with 👍 / 👎.
The `AlreadyExists` fix in #337 handled one symptom of parallel xdist workers racing to build the same cluster image, but there's a second failure mode showing up in CI: https://github.com/chrisguidry/docket/actions/runs/22025132964/job/63640478732 When concurrent builds target the same tag, the Docker SDK's `build()` completes successfully in the daemon, then tries to inspect the resulting image by its short ID. If another worker's build re-tagged the image in the meantime, the first image ID gets orphaned and the inspect 404s. This knocked out 485 of 686 tests in the cluster job. Rather than catching yet another exception type, this serializes the builds with `fcntl.flock` so only one worker builds at a time. The others wait and find it already built. Eliminates both the `AlreadyExists` and `ImageNotFound` races structurally. Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ob (#338) Two changes to improve CI reliability: **Serialize cluster image builds with file lock** The `AlreadyExists` fix in #337 handled one symptom of parallel xdist workers racing to build the same cluster image, but there's a second failure mode showing up in CI: https://github.com/chrisguidry/docket/actions/runs/22025132964/job/63640478732 When concurrent builds target the same tag, the Docker SDK's `build()` completes successfully in the daemon, then tries to inspect the resulting image by its short ID. If another worker's build re-tagged the image in the meantime, the first image ID gets orphaned and the inspect 404s. This knocked out 485 of 686 tests in the cluster job. Rather than catching yet another exception type, this serializes the builds with `fcntl.flock` so only one worker builds at a time. The others wait and find it already built. Eliminates both the `AlreadyExists` and `ImageNotFound` races structurally. **Split CLI tests into separate CI job** Cluster CI jobs consistently run right at the 4-minute timeout, and when any test runs slightly slow the whole job gets cancelled. This has been showing up in roughly a third of recent CI runs: https://github.com/chrisguidry/docket/actions/runs/22025359927/job/63641245074 The 91 CLI tests are subprocess-based and don't exercise backend-specific behavior — they spawn `python -m docket ...` processes and check output. Running them against every Python x Backend combination (30 matrix entries) is wasted effort. This moves CLI tests to their own job that varies by Python version but uses a single Redis backend (8.0). The main test matrix now passes `--ignore=tests/cli` so cluster/valkey/memory jobs only run the tests that actually care about the backend. Local `pytest` runs are unaffected. --------- Co-authored-by: Claude Opus 4.6 <[email protected]>
build_cluster_imagedoes a check-then-build that can race when paralleltest sessions try to build the same image. The
images.get()check passesfor both, then one build succeeds and the other gets a
BuildErrorwithAlreadyExists. This showed up as every test failing in a CI cluster job:https://github.com/chrisguidry/docket/actions/runs/22020705636/job/63639829973
🤖 Generated with Claude Code