Skip to content

Handle AlreadyExists race in cluster image build#337

Merged
chrisguidry merged 1 commit intomainfrom
fix-cluster-image-race
Feb 14, 2026
Merged

Handle AlreadyExists race in cluster image build#337
chrisguidry merged 1 commit intomainfrom
fix-cluster-image-race

Conversation

@chrisguidry
Copy link
Owner

build_cluster_image does a check-then-build that can race when parallel
test sessions try to build the same image. The images.get() check passes
for both, then one build succeeds and the other gets a BuildError with
AlreadyExists. This showed up as every test failing in a CI cluster job:

https://github.com/chrisguidry/docket/actions/runs/22020705636/job/63639829973

🤖 Generated with Claude Code

`build_cluster_image` does a check-then-build that can race when parallel
test sessions try to build the same image. The `images.get()` check passes
for both, then one build succeeds and the other gets a `BuildError` with
`AlreadyExists`. This showed up as every test failing in a CI cluster job:

https://github.com/chrisguidry/docket/actions/runs/22020705636/job/63639829973

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@github-actions
Copy link

📚 Documentation has been built for this PR!

You can download the documentation directly here:
https://github.com/chrisguidry/docket/actions/runs/22024942386/artifacts/5513014403

@codecov-commenter
Copy link

codecov-commenter commented Feb 14, 2026

Codecov Report

❌ Patch coverage is 0% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 98.63%. Comparing base (61bc093) to head (02922d0).

Files with missing lines Patch % Lines
tests/_container.py 0.00% 6 Missing ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #337      +/-   ##
==========================================
- Coverage   98.68%   98.63%   -0.05%     
==========================================
  Files         103      103              
  Lines       10386    10391       +5     
  Branches      496      497       +1     
==========================================
  Hits        10249    10249              
- Misses        121      126       +5     
  Partials       16       16              
Flag Coverage Δ
python-3.10 98.63% <0.00%> (-0.05%) ⬇️
python-3.11 97.30% <0.00%> (-0.04%) ⬇️
python-3.12 98.63% <0.00%> (-0.05%) ⬇️
python-3.13 98.63% <0.00%> (-0.05%) ⬇️
python-3.14 98.62% <0.00%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
tests/_container.py 25.69% <0.00%> (-0.93%) ⬇️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 02922d06b8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +187 to +188
if "AlreadyExists" in str(e):
pass

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Wait for image availability after AlreadyExists

When build_cluster_image catches BuildError containing AlreadyExists, it immediately returns the tag without confirming the image is actually present. This race occurs while another process is still building, so in parallel test runs with a cold cache (where build/pull can take longer than ~3s), containers.run can still fail with ImageNotFound despite this change because the retry window in with_image_retry is short; you should block/poll on client.images.get(tag) (or retry build) before returning from this branch.

Useful? React with 👍 / 👎.

@chrisguidry chrisguidry merged commit aace0e1 into main Feb 14, 2026
40 checks passed
@chrisguidry chrisguidry deleted the fix-cluster-image-race branch February 14, 2026 22:13
chrisguidry added a commit that referenced this pull request Feb 14, 2026
The `AlreadyExists` fix in #337 handled one symptom of parallel xdist
workers racing to build the same cluster image, but there's a second
failure mode showing up in CI:

https://github.com/chrisguidry/docket/actions/runs/22025132964/job/63640478732

When concurrent builds target the same tag, the Docker SDK's `build()`
completes successfully in the daemon, then tries to inspect the resulting
image by its short ID. If another worker's build re-tagged the image in
the meantime, the first image ID gets orphaned and the inspect 404s.
This knocked out 485 of 686 tests in the cluster job.

Rather than catching yet another exception type, this serializes the
builds with `fcntl.flock` so only one worker builds at a time. The
others wait and find it already built. Eliminates both the
`AlreadyExists` and `ImageNotFound` races structurally.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
chrisguidry added a commit that referenced this pull request Feb 15, 2026
…ob (#338)

Two changes to improve CI reliability:

**Serialize cluster image builds with file lock**

The `AlreadyExists` fix in #337 handled one symptom of parallel xdist
workers racing to build the same cluster image, but there's a second
failure mode showing up in CI:


https://github.com/chrisguidry/docket/actions/runs/22025132964/job/63640478732

When concurrent builds target the same tag, the Docker SDK's `build()`
completes successfully in the daemon, then tries to inspect the
resulting image by its short ID. If another worker's build re-tagged the
image in the meantime, the first image ID gets orphaned and the inspect
404s. This knocked out 485 of 686 tests in the cluster job.

Rather than catching yet another exception type, this serializes the
builds with `fcntl.flock` so only one worker builds at a time. The
others wait and find it already built. Eliminates both the
`AlreadyExists` and `ImageNotFound` races structurally.

**Split CLI tests into separate CI job**

Cluster CI jobs consistently run right at the 4-minute timeout, and when
any test runs slightly slow the whole job gets cancelled. This has been
showing up in roughly a third of recent CI runs:


https://github.com/chrisguidry/docket/actions/runs/22025359927/job/63641245074

The 91 CLI tests are subprocess-based and don't exercise
backend-specific behavior — they spawn `python -m docket ...` processes
and check output. Running them against every Python x Backend
combination (30 matrix entries) is wasted effort.

This moves CLI tests to their own job that varies by Python version but
uses a single Redis backend (8.0). The main test matrix now passes
`--ignore=tests/cli` so cluster/valkey/memory jobs only run the tests
that actually care about the backend. Local `pytest` runs are
unaffected.

---------

Co-authored-by: Claude Opus 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants