Handle AlreadyExists race in cluster image build by chrisguidry · Pull Request #337 · chrisguidry/docket

chrisguidry · 2026-02-14T21:59:09Z

build_cluster_image does a check-then-build that can race when parallel
test sessions try to build the same image. The images.get() check passes
for both, then one build succeeds and the other gets a BuildError with
AlreadyExists. This showed up as every test failing in a CI cluster job:

https://github.com/chrisguidry/docket/actions/runs/22020705636/job/63639829973

🤖 Generated with Claude Code

`build_cluster_image` does a check-then-build that can race when parallel test sessions try to build the same image. The `images.get()` check passes for both, then one build succeeds and the other gets a `BuildError` with `AlreadyExists`. This showed up as every test failing in a CI cluster job: https://github.com/chrisguidry/docket/actions/runs/22020705636/job/63639829973 Co-Authored-By: Claude Opus 4.6 <[email protected]>

github-actions · 2026-02-14T21:59:31Z

📚 Documentation has been built for this PR!

You can download the documentation directly here:
https://github.com/chrisguidry/docket/actions/runs/22024942386/artifacts/5513014403

codecov-commenter · 2026-02-14T22:01:22Z

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 98.63%. Comparing base (61bc093) to head (02922d0).

Files with missing lines	Patch %	Lines
tests/_container.py	0.00%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #337      +/-   ##
==========================================
- Coverage   98.68%   98.63%   -0.05%     
==========================================
  Files         103      103              
  Lines       10386    10391       +5     
  Branches      496      497       +1     
==========================================
  Hits        10249    10249              
- Misses        121      126       +5     
  Partials       16       16

Flag	Coverage Δ
python-3.10	`98.63% <0.00%> (-0.05%)`	⬇️
python-3.11	`97.30% <0.00%> (-0.04%)`	⬇️
python-3.12	`98.63% <0.00%> (-0.05%)`	⬇️
python-3.13	`98.63% <0.00%> (-0.05%)`	⬇️
python-3.14	`98.62% <0.00%> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
tests/_container.py	`25.69% <0.00%> (-0.93%)`	⬇️

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 02922d06b8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-02-14T22:02:14Z

tests/_container.py

+        if "AlreadyExists" in str(e):
+            pass


Wait for image availability after AlreadyExists

When build_cluster_image catches BuildError containing AlreadyExists, it immediately returns the tag without confirming the image is actually present. This race occurs while another process is still building, so in parallel test runs with a cold cache (where build/pull can take longer than ~3s), containers.run can still fail with ImageNotFound despite this change because the retry window in with_image_retry is short; you should block/poll on client.images.get(tag) (or retry build) before returning from this branch.

Useful? React with 👍 / 👎.

The `AlreadyExists` fix in #337 handled one symptom of parallel xdist workers racing to build the same cluster image, but there's a second failure mode showing up in CI: https://github.com/chrisguidry/docket/actions/runs/22025132964/job/63640478732 When concurrent builds target the same tag, the Docker SDK's `build()` completes successfully in the daemon, then tries to inspect the resulting image by its short ID. If another worker's build re-tagged the image in the meantime, the first image ID gets orphaned and the inspect 404s. This knocked out 485 of 686 tests in the cluster job. Rather than catching yet another exception type, this serializes the builds with `fcntl.flock` so only one worker builds at a time. The others wait and find it already built. Eliminates both the `AlreadyExists` and `ImageNotFound` races structurally. Co-Authored-By: Claude Opus 4.6 <[email protected]>

…ob (#338) Two changes to improve CI reliability: **Serialize cluster image builds with file lock** The `AlreadyExists` fix in #337 handled one symptom of parallel xdist workers racing to build the same cluster image, but there's a second failure mode showing up in CI: https://github.com/chrisguidry/docket/actions/runs/22025132964/job/63640478732 When concurrent builds target the same tag, the Docker SDK's `build()` completes successfully in the daemon, then tries to inspect the resulting image by its short ID. If another worker's build re-tagged the image in the meantime, the first image ID gets orphaned and the inspect 404s. This knocked out 485 of 686 tests in the cluster job. Rather than catching yet another exception type, this serializes the builds with `fcntl.flock` so only one worker builds at a time. The others wait and find it already built. Eliminates both the `AlreadyExists` and `ImageNotFound` races structurally. **Split CLI tests into separate CI job** Cluster CI jobs consistently run right at the 4-minute timeout, and when any test runs slightly slow the whole job gets cancelled. This has been showing up in roughly a third of recent CI runs: https://github.com/chrisguidry/docket/actions/runs/22025359927/job/63641245074 The 91 CLI tests are subprocess-based and don't exercise backend-specific behavior — they spawn `python -m docket ...` processes and check output. Running them against every Python x Backend combination (30 matrix entries) is wasted effort. This moves CLI tests to their own job that varies by Python version but uses a single Redis backend (8.0). The main test matrix now passes `--ignore=tests/cli` so cluster/valkey/memory jobs only run the tests that actually care about the backend. Local `pytest` runs are unaffected. --------- Co-authored-by: Claude Opus 4.6 <[email protected]>

chatgpt-codex-connector bot reviewed Feb 14, 2026

View reviewed changes

chrisguidry merged commit aace0e1 into main Feb 14, 2026
40 checks passed

chrisguidry deleted the fix-cluster-image-race branch February 14, 2026 22:13

chrisguidry mentioned this pull request Feb 14, 2026

Serialize cluster image builds and split CLI tests into separate CI job #338

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle AlreadyExists race in cluster image build#337

Handle AlreadyExists race in cluster image build#337
chrisguidry merged 1 commit intomainfrom
fix-cluster-image-race

chrisguidry commented Feb 14, 2026

Uh oh!

github-actions bot commented Feb 14, 2026

Uh oh!

codecov-commenter commented Feb 14, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Feb 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chrisguidry commented Feb 14, 2026

Uh oh!

github-actions bot commented Feb 14, 2026

Uh oh!

codecov-commenter commented Feb 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Feb 14, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

codecov-commenter commented Feb 14, 2026 •

edited

Loading