Skip to content

fix: self-install pinned spaCy model at runtime with SHA256 verification#4258

Merged
badGarnet merged 9 commits intomainfrom
fix/spacy-model-self-install
Feb 23, 2026
Merged

fix: self-install pinned spaCy model at runtime with SHA256 verification#4258
badGarnet merged 9 commits intomainfrom
fix/spacy-model-self-install

Conversation

@lawrence-u10d
Copy link
Copy Markdown
Contributor

@lawrence-u10d lawrence-u10d commented Feb 22, 2026

Summary

  • Replace en-core-web-sm direct URL dependency in pyproject.toml with the installer library
  • spaCy model is now downloaded and installed on first use with SHA256 hash verification
  • Removes [tool.uv.sources] section, making the install more portable across package managers

Test plan

  • Verify tokenize.py downloads and installs the spaCy model on first use
  • Verify SHA256 hash check rejects tampered wheels
  • Verify existing NLP tokenization tests pass

🤖 Generated with Claude Code


Note

Medium Risk
Introduces runtime download-and-install behavior into the NLP path and writes into site-packages, which can fail under restricted networking/permissions or in unusual multi-process environments despite locking and hash checks.

Overview
Updates NLP tokenization to lazy-load and self-install the pinned en_core_web_sm spaCy model on first use, downloading the wheel from GitHub and verifying it via SHA256, with a cross-process FileLock to avoid concurrent installs.

Removes the en-core-web-sm wheel URL dependency and [tool.uv.sources] override, adding installer (for wheel installation) and filelock (for install locking) to dependencies; uv.lock is updated accordingly and the version is bumped to 0.21.2.

Adjusts the Dockerfile to trigger model installation during image build (via uv run importing _get_nlp) so the model is present before HF_HUB_OFFLINE=1 is set.

Written by Cursor Bugbot for commit df62a9c. This will update automatically on new commits. Configure here.

Replace the en-core-web-sm direct URL dependency in pyproject.toml with
the installer library. The spaCy model is now downloaded and installed on
first use with hash verification, removing the need for [tool.uv.sources]
and making the install more portable.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
lawrence-u10d and others added 3 commits February 22, 2026 15:59
Co-Authored-By: Claude Opus 4.6 <[email protected]>
- Use installer.utils.get_launcher_kind() instead of hardcoded
  "win-ia32" for correct 64-bit Windows support
- Wrap file handle in with statement for SHA256 computation
- Call importlib.invalidate_caches() after runtime model install
  so Python's FileFinder discovers the new package

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@lawrence-u10d
Copy link
Copy Markdown
Contributor Author

Code review

No issues found. Checked for bugs and CLAUDE.md compliance.

🤖 Generated with Claude Code

lawrence-u10d and others added 2 commits February 22, 2026 16:20
- Catch FileExistsError during wheel install so parallel processes
  (e.g. pytest-xdist workers) don't crash when another process already
  installed the model
- Replace urlretrieve with urlopen + 120s socket timeout to prevent
  indefinite hangs in network-restricted environments
- Trigger spaCy model self-install during Docker build so offline
  deployments (HF_HUB_OFFLINE=1) have the model available at runtime

Co-Authored-By: Claude Opus 4.6 <[email protected]>
…ondition

Install the wheel into a temp staging directory first, then move items
into site-packages. This prevents the incomplete-install race where one
process sees a partial model (e.g. __init__.py exists but model data is
missing) and fails with OSError: [E050] Can't find model.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@badGarnet
Copy link
Copy Markdown
Collaborator

@claude review

@claude
Copy link
Copy Markdown

claude bot commented Feb 22, 2026

Claude finished @badGarnet's task —— View job


Code Review for spaCy Model Self-Installation

Reviewing the changes that replace build-time spaCy model dependency with runtime self-installation.

Todo List:

  • Read and analyze main implementation in tokenize.py
  • Review Docker integration changes
  • Examine dependency changes in pyproject.toml and lockfile
  • Assess security implications of runtime downloads
  • Evaluate error handling and edge cases
  • Consider concurrency and race condition handling
  • Review existing Bugbot feedback and user responses
  • Provide comprehensive review feedback

Review Summary

This PR implements a significant architectural change from build-time spaCy model dependency to runtime self-installation. The implementation is generally well-designed with good concurrency handling and error messages, but has several areas that need attention.

Critical Issues

1. Docker Offline Mode Conflict ⚠️ HIGH PRIORITY

The Docker setup triggers spaCy model installation at build time (line 76), but then sets HF_HUB_OFFLINE=1 for runtime. This creates an inconsistency:

  • If the build-time load fails, the container will ship without the model
  • Runtime downloads will fail in offline mode if the model wasn't installed during build

Recommendation: Either remove HF_HUB_OFFLINE=1 or ensure the build definitively fails if the model can't be installed.

2. Module Import Blocking

The current design with _get_nlp() being called from module-level code means importing tokenize.py can block for up to 2+ minutes on first use. This affects:

  • Application startup time
  • Error handling during import failures
  • Testing and development workflows

Recommendation: Consider implementing true lazy loading where the model is only loaded when tokenization functions are actually called.

3. Error Handling Gaps

While the recent commits added better error handling, there are still edge cases:

Permission Errors: Installation failures due to read-only site-packages or temp directory issues could leave users with unclear error messages.

Network Failures: The error handling for network issues is good, but the user experience could be improved with more specific guidance (firewall, proxy, DNS issues).

Design Assessment

Strengths

Excellent concurrency handling: The file locking mechanism with double-check pattern is well-implemented
Security: SHA256 verification prevents tampering
Staging installation: Atomic installation prevents corruption
Good error messages: Recent improvements provide helpful guidance
Proper cleanup: Temporary directories and file handles are managed correctly

Areas for Improvement

1. Error Recovery

Currently, a failed installation can leave remnants that prevent subsequent attempts. Consider:

# Better cleanup on failure
try:
    shutil.move(src, dst)
except OSError:
    # Clean up any partial installations before raising
    self._cleanup_partial_install()
    raise

2. Installation Verification

The current verification only checks that spacy.load() succeeds after installation. Consider adding:

  • Verification that the installed model version matches expected version
  • Basic functionality test (e.g., tokenize a simple sentence)

3. Timeout Configuration

The 120-second timeout is hardcoded. Consider making it configurable via environment variable for restricted networks.

Bugbot Feedback Assessment

Reviewing the Cursor Bugbot findings and your responses:

Addressed Issues ✅

  • Import cache invalidation: Fixed with importlib.invalidate_caches()
  • File handle leaks: Fixed with proper with statements
  • Windows launcher kind: Fixed using get_launcher_kind()
  • Download timeout: Implemented with urllib.request.urlopen(timeout=...)
  • Concurrent installation: Fixed with FileLock
  • Better error messages: Implemented comprehensive error handling

Remaining Concerns

  • Docker offline mode: As mentioned above, this needs resolution
  • Module-level initialization: This design choice may need reconsideration for better user experience

Minor Issues

  1. Code Style: Consider extracting constants for magic numbers like the lock timeout (-1)
  2. Logging: The current logging is good, but consider adding debug-level logs for troubleshooting
  3. Type Hints: Some return types could be more specific (e.g., the installer imports)

Testing Recommendations

Since I cannot run tests due to permissions, please ensure coverage for:

  • First-time installation in clean environment
  • Concurrent installation attempts
  • Network failure scenarios
  • Permission error scenarios
  • Docker build with and without network access
  • Hash mismatch scenarios

Final Assessment

This is a solid implementation of a complex feature. The core logic is sound and the error handling has been significantly improved. The main concerns are around Docker integration and user experience during import. With the Docker offline mode issue resolved, this would be ready for production use.

Recommendation: Address the Docker offline mode conflict before merging, as this could cause silent failures in production deployments.

Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Bugbot Autofix is OFF. To automatically fix reported issues with Cloud Agents, enable Autofix in the Cursor dashboard.

@badGarnet
Copy link
Copy Markdown
Collaborator

@claude: the hf offline mode is for huggingface, not for spacy models; please review with that in mind

@claude
Copy link
Copy Markdown

claude bot commented Feb 23, 2026

Claude finished @badGarnet's task —— View job


Updated Code Review for spaCy Model Self-Installation

Re-reviewing the implementation with clarification that HF_HUB_OFFLINE=1 is for HuggingFace models, not spaCy.

Todo List:

  • Read current tokenize.py implementation
  • Review Docker integration with corrected HF understanding
  • Assess remaining Bugbot feedback validity
  • Evaluate recent commits and fixes
  • Provide updated comprehensive review feedback

Updated Review Summary

Thank you for the clarification about HF_HUB_OFFLINE=1 being specific to HuggingFace models. This significantly changes my assessment of the Docker integration.

Critical Issues Resolved ✅

1. Docker Integration (Previously Flagged - Now Resolved)

With the understanding that HF_HUB_OFFLINE=1 only affects HuggingFace models, the Docker setup is actually well-designed:

  • Line 76: uv run triggers spaCy model installation during build when network is available
  • Line 81: HF_HUB_OFFLINE=1 only restricts HuggingFace downloads at runtime, not spaCy
  • This ensures the spaCy model is installed at build time while allowing HF model restrictions

2. Bugbot Issues - Most Addressed ✅

Looking at the current implementation, I can see that most Bugbot concerns have been resolved:

  • File handle leaks: Fixed with proper with open() context manager (line 64)
  • Import cache invalidation: Fixed with importlib.invalidate_caches() (lines 123, 129)
  • Launcher kind: Fixed using get_launcher_kind() (line 82)
  • Download timeout: Implemented with 120-second timeout (line 42)
  • Concurrent installation: Fixed with FileLock (line 121)
  • Better error messages: Comprehensive error handling throughout

Current Implementation Strengths ✅

Excellent Design Decisions

  1. True Lazy Loading: Changed from module-level initialization to @lru_cache(maxsize=1) decorator on _get_nlp() - this addresses the import blocking concern
  2. Atomic Installation: Staging directory approach prevents corruption from concurrent installs
  3. Comprehensive Error Handling: All failure modes now provide actionable error messages
  4. Security: SHA256 verification prevents tampering
  5. Cross-Process Safety: FileLock with double-check pattern handles concurrency perfectly
  6. Proper Cleanup: Staging remnants are cleaned up before installation

Minor Areas for Enhancement

1. Error Recovery Enhancement

Consider adding verification that the installed model version matches expectations:

# After successful spacy.load(), verify version
if nlp.meta.get("version") != _SPACY_MODEL_VERSION:
    logger.warning("Installed model version mismatch: expected %s, got %s", 
                   _SPACY_MODEL_VERSION, nlp.meta.get("version"))

2. Timeout Configuration

The 120-second timeout is reasonable but could be made configurable:

_DOWNLOAD_TIMEOUT_SECONDS: Final[int] = int(os.getenv("SPACY_MODEL_DOWNLOAD_TIMEOUT", "120"))

3. Debug Logging

Consider adding debug-level logs for troubleshooting:

logger.debug("Attempting to install spaCy model to %s", site_packages)
logger.debug("Downloaded model hash: %s", sha256)

Remaining Bugbot Assessment

Valid Concerns (Addressed)

  • Module-level initialization: ✅ Fixed with lazy loading via _get_nlp()
  • Docker offline mode: ✅ Not an issue - HF_HUB_OFFLINE doesn't affect spaCy
  • Network failures: ✅ Fixed with comprehensive error handling and timeouts
  • Permission errors: ✅ Fixed with helpful error messages
  • Concurrency issues: ✅ Fixed with FileLock

Invalid/Low Priority Concerns

  • Staging directory creation: User correctly noted that installer creates directories as needed
  • Lock file permissions: As noted, if temp dir isn't writable, installation would fail anyway
  • Empty staging directory: Recent error handling improvements make this non-silent

Final Assessment

This is now a solid, production-ready implementation. The architectural change from build-time to runtime installation is well-executed with:

  • ✅ Proper lazy loading (no import blocking)
  • ✅ Excellent error handling and user guidance
  • ✅ Robust concurrency protection
  • ✅ Security via hash verification
  • ✅ Clean Docker integration that respects both spaCy and HuggingFace requirements

Recommendation: APPROVE

The implementation addresses all major concerns and follows best practices. The minor enhancements suggested above are optional improvements that don't block merging.


@badGarnet badGarnet enabled auto-merge February 23, 2026 00:25
@badGarnet badGarnet added this pull request to the merge queue Feb 23, 2026
Merged via the queue into main with commit 4a77a8c Feb 23, 2026
52 checks passed
@badGarnet badGarnet deleted the fix/spacy-model-self-install branch February 23, 2026 01:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants