Serialize cluster image builds and split CLI tests into separate CI job#338
Serialize cluster image builds and split CLI tests into separate CI job#338chrisguidry merged 11 commits intomainfrom
Conversation
The `AlreadyExists` fix in #337 handled one symptom of parallel xdist workers racing to build the same cluster image, but there's a second failure mode showing up in CI: https://github.com/chrisguidry/docket/actions/runs/22025132964/job/63640478732 When concurrent builds target the same tag, the Docker SDK's `build()` completes successfully in the daemon, then tries to inspect the resulting image by its short ID. If another worker's build re-tagged the image in the meantime, the first image ID gets orphaned and the inspect 404s. This knocked out 485 of 686 tests in the cluster job. Rather than catching yet another exception type, this serializes the builds with `fcntl.flock` so only one worker builds at a time. The others wait and find it already built. Eliminates both the `AlreadyExists` and `ImageNotFound` races structurally. Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
📚 Documentation has been built for this PR! You can download the documentation directly here: |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #338 +/- ##
===========================================
+ Coverage 98.63% 100.00% +1.36%
===========================================
Files 103 99 -4
Lines 10391 3090 -7301
Branches 497 28 -469
===========================================
- Hits 10249 3090 -7159
+ Misses 126 0 -126
+ Partials 16 0 -16
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1538e9f043
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| including single-node Redis, Redis Cluster, and Valkey variants. | ||
| """ | ||
|
|
||
| import fcntl |
There was a problem hiding this comment.
Guard Unix-only fcntl import for cross-platform tests
This introduces a hard dependency on fcntl, which is unavailable on Windows; because tests/conftest.py imports tests._container unconditionally, pytest collection now fails immediately on Windows even for non-cluster or memory-backend runs. If contributors are expected to run tests cross-platform (the project metadata is OS-independent), this should be behind a platform guard or use a portable locking fallback.
Useful? React with 👍 / 👎.
Cluster CI jobs consistently run right at the 4-minute timeout, and when any test runs slightly slow the whole job gets cancelled. This has been showing up in roughly a third of recent CI runs: https://github.com/chrisguidry/docket/actions/runs/22025359927/job/63641245074 The 91 CLI tests are subprocess-based and don't exercise backend-specific behavior — they spawn `python -m docket ...` processes and check output. Running them against every Python x Backend combination (30 matrix entries) is wasted effort. This moves CLI tests to their own job that varies by Python version but uses a single Redis backend (8.0). The main test matrix now passes `--ignore=tests/cli` so cluster/valkey/memory jobs only run the tests that actually care about the backend. Local `pytest` runs are unaffected — they still use the `pyproject.toml` coverage config, run all tests, and cover everything. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
The old .coveragerc-memory was missing three files from its omit list that pyproject.toml had (tests/_container.py, tests/_key_leak_checker.py, src/docket/_prometheus_exporter.py). Memory backend uploads included those files with partial coverage, dragging the Codecov project total to 98.63%. The new .coveragerc-core has a consistent omit list, but adding an ignore section to codecov.yml as well so Codecov never counts these files regardless of what coverage.py reports. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
The first CI run of the split showed core tests failing at 99% because `tests/cli/run.py` and `tests/cli/waiting.py` leaked through the `test_*.py` glob, and `register_collection` in `docket.py` was only tested via CLI tests. Changes: - Add `--cov-fail-under=100` to pyproject.toml as the local default - Scope CLI job coverage to just `cli.py`, `_cli_support.py`, and `tests/cli/` so 100% is achievable for that slice - Widen coveragerc-core omit to `tests/cli/*.py` (not just `test_*.py`) - Add `test_register_collection` to core tests - Remove stale `.coveragerc-memory` reference in `tests/cli/run.py` Co-Authored-By: Claude Opus 4.6 <[email protected]>
--cov expects packages/directories, not individual files, so scoping coverage to just cli.py and _cli_support.py doesn't work with subprocess coverage. Instead, the CLI job uploads coverage for everything and Codecov merges it with core test uploads to enforce 100% across the board. Co-Authored-By: Claude Opus 4.6 <[email protected]>
The previous attempt used --cov with file paths, but --cov only accepts packages/directories. Now using --cov=src/docket --cov=tests/cli with an explicit omit list in .coveragerc-cli that excludes all non-CLI source. Only cli.py, _cli_support.py, and tests/cli/ helper modules are measured. Co-Authored-By: Claude Opus 4.6 <[email protected]>
--cov accepts package/module names, not file paths. Using --cov=docket.cli --cov=docket._cli_support scopes coverage to just those modules without needing a .coveragerc-cli omit list to maintain. Co-Authored-By: Claude Opus 4.6 <[email protected]>
The `--cov=docket.cli` approach triggered a beartype circular import because coverage.py imports the module early to find its path. Moving CLI into `src/docket/cli/` lets us use `--cov=src/docket/cli` instead, which is just a directory lookup — no imports, no beartype drama. Also sets the stage for splitting CLI into multiple files down the road. Co-Authored-By: Claude Opus 4.6 <[email protected]>
These tests cover `cli/_support.py` code that's now excluded from core coverage. Without moving them, the `StopAsyncIteration` break path wasn't covered by the CLI job (it was covered by core tests, but core omits CLI source). Co-Authored-By: Claude Opus 4.6 <[email protected]>
Two changes to improve CI reliability:
Serialize cluster image builds with file lock
The
AlreadyExistsfix in #337 handled one symptom of parallel xdist workers racing to build the same cluster image, but there's a second failure mode showing up in CI:https://github.com/chrisguidry/docket/actions/runs/22025132964/job/63640478732
When concurrent builds target the same tag, the Docker SDK's
build()completes successfully in the daemon, then tries to inspect the resulting image by its short ID. If another worker's build re-tagged the image in the meantime, the first image ID gets orphaned and the inspect 404s. This knocked out 485 of 686 tests in the cluster job.Rather than catching yet another exception type, this serializes the builds with
fcntl.flockso only one worker builds at a time. The others wait and find it already built. Eliminates both theAlreadyExistsandImageNotFoundraces structurally.Split CLI tests into separate CI job
Cluster CI jobs consistently run right at the 4-minute timeout, and when any test runs slightly slow the whole job gets cancelled. This has been showing up in roughly a third of recent CI runs:
https://github.com/chrisguidry/docket/actions/runs/22025359927/job/63641245074
The 91 CLI tests are subprocess-based and don't exercise backend-specific behavior — they spawn
python -m docket ...processes and check output. Running them against every Python x Backend combination (30 matrix entries) is wasted effort.This moves CLI tests to their own job that varies by Python version but uses a single Redis backend (8.0). The main test matrix now passes
--ignore=tests/cliso cluster/valkey/memory jobs only run the tests that actually care about the backend. Localpytestruns are unaffected.