fix: self-install pinned spaCy model at runtime with SHA256 verification#4258
fix: self-install pinned spaCy model at runtime with SHA256 verification#4258
Conversation
Replace the en-core-web-sm direct URL dependency in pyproject.toml with the installer library. The spaCy model is now downloaded and installed on first use with hash verification, removing the need for [tool.uv.sources] and making the install more portable. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Use installer.utils.get_launcher_kind() instead of hardcoded "win-ia32" for correct 64-bit Windows support - Wrap file handle in with statement for SHA256 computation - Call importlib.invalidate_caches() after runtime model install so Python's FileFinder discovers the new package Co-Authored-By: Claude Opus 4.6 <[email protected]>
Code reviewNo issues found. Checked for bugs and CLAUDE.md compliance. 🤖 Generated with Claude Code |
- Catch FileExistsError during wheel install so parallel processes (e.g. pytest-xdist workers) don't crash when another process already installed the model - Replace urlretrieve with urlopen + 120s socket timeout to prevent indefinite hangs in network-restricted environments - Trigger spaCy model self-install during Docker build so offline deployments (HF_HUB_OFFLINE=1) have the model available at runtime Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ondition Install the wheel into a temp staging directory first, then move items into site-packages. This prevents the incomplete-install race where one process sees a partial model (e.g. __init__.py exists but model data is missing) and fails with OSError: [E050] Can't find model. Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
@claude review |
|
Claude finished @badGarnet's task —— View job Code Review for spaCy Model Self-InstallationReviewing the changes that replace build-time spaCy model dependency with runtime self-installation. Todo List:
Review SummaryThis PR implements a significant architectural change from build-time spaCy model dependency to runtime self-installation. The implementation is generally well-designed with good concurrency handling and error messages, but has several areas that need attention. Critical Issues1. Docker Offline Mode Conflict
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.
|
@claude: the hf offline mode is for huggingface, not for spacy models; please review with that in mind |
|
Claude finished @badGarnet's task —— View job Updated Code Review for spaCy Model Self-InstallationRe-reviewing the implementation with clarification that Todo List:
Updated Review SummaryThank you for the clarification about Critical Issues Resolved ✅1. Docker Integration (Previously Flagged - Now Resolved)With the understanding that
2. Bugbot Issues - Most Addressed ✅Looking at the current implementation, I can see that most Bugbot concerns have been resolved:
Current Implementation Strengths ✅Excellent Design Decisions
Minor Areas for Enhancement1. Error Recovery EnhancementConsider adding verification that the installed model version matches expectations: # After successful spacy.load(), verify version
if nlp.meta.get("version") != _SPACY_MODEL_VERSION:
logger.warning("Installed model version mismatch: expected %s, got %s",
_SPACY_MODEL_VERSION, nlp.meta.get("version"))2. Timeout ConfigurationThe 120-second timeout is reasonable but could be made configurable: _DOWNLOAD_TIMEOUT_SECONDS: Final[int] = int(os.getenv("SPACY_MODEL_DOWNLOAD_TIMEOUT", "120"))3. Debug LoggingConsider adding debug-level logs for troubleshooting: logger.debug("Attempting to install spaCy model to %s", site_packages)
logger.debug("Downloaded model hash: %s", sha256)Remaining Bugbot AssessmentValid Concerns (Addressed)
Invalid/Low Priority Concerns
Final AssessmentThis is now a solid, production-ready implementation. The architectural change from build-time to runtime installation is well-executed with:
Recommendation: APPROVEThe implementation addresses all major concerns and follows best practices. The minor enhancements suggested above are optional improvements that don't block merging. |
Summary
en-core-web-smdirect URL dependency inpyproject.tomlwith theinstallerlibrary[tool.uv.sources]section, making the install more portable across package managersTest plan
tokenize.pydownloads and installs the spaCy model on first use🤖 Generated with Claude Code
Note
Medium Risk
Introduces runtime download-and-install behavior into the NLP path and writes into
site-packages, which can fail under restricted networking/permissions or in unusual multi-process environments despite locking and hash checks.Overview
Updates NLP tokenization to lazy-load and self-install the pinned
en_core_web_smspaCy model on first use, downloading the wheel from GitHub and verifying it via SHA256, with a cross-processFileLockto avoid concurrent installs.Removes the
en-core-web-smwheel URL dependency and[tool.uv.sources]override, addinginstaller(for wheel installation) andfilelock(for install locking) to dependencies;uv.lockis updated accordingly and the version is bumped to0.21.2.Adjusts the
Dockerfileto trigger model installation during image build (viauv runimporting_get_nlp) so the model is present beforeHF_HUB_OFFLINE=1is set.Written by Cursor Bugbot for commit df62a9c. This will update automatically on new commits. Configure here.