Skip to content

[BUG FIX]: hf_hub_download crashes when stderr lacks a real file descriptor#4065

Merged
Wauplin merged 4 commits intohuggingface:mainfrom
tobocop2:fix/tqdm-bad-fd-safety
Apr 27, 2026
Merged

[BUG FIX]: hf_hub_download crashes when stderr lacks a real file descriptor#4065
Wauplin merged 4 commits intohuggingface:mainfrom
tobocop2:fix/tqdm-bad-fd-safety

Conversation

@tobocop2
Copy link
Copy Markdown
Contributor

@tobocop2 tobocop2 commented Apr 7, 2026

Problem

hf_hub_download crashes with ValueError: bad value(s) in fds_to_keep when sys.stderr.fileno() returns -1. The crash happens during tqdm's progress bar setup, before the download itself runs.

In practice this hits Textual TUI apps (Textual's app.py returns -1 from fileno()) and any custom stderr wrapper that does the same. CPython's multiprocessing/resource_tracker silently swallows exceptions from fileno(), so streams like io.StringIO, Jupyter's OutStream, and pytest's CaptureIO — which all raise on fileno() — are not affected. Two more preconditions are needed for the crash: the multiprocessing start method must be spawn or forkserver, and the resource tracker must not have been started already.

Root cause

The first time tqdm's default lock is touched, it lazily constructs TqdmDefaultWriteLock, which calls cls.create_mp_lock(). That spawns the multiprocessing resource tracker via fork_exec, which rejects stderr.fileno() == -1.

Why this fix

Setting a thread-only RLock on the HF tqdm subclass at module import short-circuits the lazy init: tqdm.get_lock() finds _lock already set and skips the create_mp_lock path entirely. The patch is one line plus a comment.

The trade-off is no inter-process tqdm bar coordination on the HF subclass — multiple separate Python processes printing tqdm bars to the same TTY may visually interleave. Downloads themselves are unaffected, and huggingface_hub parallelizes via threads inside one process (not subprocesses), so this is not a scenario the library targets.

Crash demo

Crash demo

Fixes #4066


Note

Medium Risk
Changes global progress-bar locking behavior by overriding tqdm's default multiprocessing lock at import time, which could affect multi-process progress bar coordination. Download logic is unchanged, and risk is largely limited to progress display behavior.

Overview
Prevents hf_hub_download from crashing when sys.stderr.fileno() returns -1 (e.g. Textual TUIs) by setting a thread-only RLock via tqdm.set_lock(...), avoiding tqdm’s lazy multiprocessing lock/resource-tracker initialization.

Adds a regression test that patches sys.stderr with a bad fileno() implementation and verifies the download completes successfully.

Reviewed by Cursor Bugbot for commit 16d5a58. Bugbot is set up for automated code reviews on this repo. Configure here.

@tobocop2 tobocop2 force-pushed the fix/tqdm-bad-fd-safety branch from f08d2e6 to 6eff9fb Compare April 7, 2026 21:52
@tobocop2 tobocop2 changed the title fix: handle tqdm crash when stderr lacks a real file descriptor fix: handle hf_hub_download crash when stderr lacks a real file descriptor Apr 7, 2026
@tobocop2 tobocop2 force-pushed the fix/tqdm-bad-fd-safety branch from 6f4b2a4 to 43399d3 Compare April 8, 2026 03:52
@tobocop2 tobocop2 changed the title fix: handle hf_hub_download crash when stderr lacks a real file descriptor fix: hf_hub_download crashes when stderr lacks a real file descriptor Apr 8, 2026
@tobocop2 tobocop2 changed the title fix: hf_hub_download crashes when stderr lacks a real file descriptor [BUG FIX]: hf_hub_download crashes when stderr lacks a real file descriptor Apr 8, 2026
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 59aa035. Configure here.

Comment thread src/huggingface_hub/utils/tqdm.py Outdated
@tobocop2 tobocop2 force-pushed the fix/tqdm-bad-fd-safety branch 2 times, most recently from 98872dc to cf1615c Compare April 13, 2026 17:11
Add _SafeTqdm fallback with thread lock for environments where
stderr.fileno() is invalid (Textual, Jupyter, pytest, WSGI, etc.).
Wrap progress bar construction in try/except to degrade gracefully
instead of crashing the download.
@tobocop2 tobocop2 force-pushed the fix/tqdm-bad-fd-safety branch from cf1615c to 037f4ad Compare April 13, 2026 17:12
@tobocop2
Copy link
Copy Markdown
Contributor Author

Friendly bump on this, happy to address any feedback if needed. Let me know if there's anything blocking review.

@Wauplin
Copy link
Copy Markdown
Contributor

Wauplin commented Apr 23, 2026

(generated by Claude)

Thanks for digging into this @tobocop2! I reproduced the crash and it's a real bug but I'd like to push back a bit on both the framing and the size of the fix before merging.

On the affected-environments list

The PR description says this affects Textual, Jupyter, pytest capture, gunicorn/uvicorn, sandboxed runtimes, and IDE terminals. In practice that list is largely overstated. The relevant code path in CPython is:

# multiprocessing/resource_tracker.py (ensure_running)
fds_to_pass = []
try:
    fds_to_pass.append(sys.stderr.fileno())
except Exception:
    pass

Anything that raises from fileno() is silently swallowed — only a wrapper that explicitly returns -1 triggers the crash. Quick survey:

Environment fileno() behavior Crashes?
io.StringIO / BytesIO raises UnsupportedOperation No
pytest capture (CaptureIO, DontReadFromInput) raises UnsupportedOperation No
Jupyter / ipykernel OutStream raises UnsupportedOperation (or returns a real fd) No
Textual (src/textual/app.py) literal return -1 Yes
uvicorn / gunicorn don't replace stderr No

Plus two compounding preconditions: the multiprocessing start method has to be spawn/forkserver (not the Linux fork default), and the resource tracker must not have been started by something else first.

So in practice the realistic trigger is Textual (and any custom wrapper that returns -1 on purpose). That's still worth fixing — TUIs that download models are a legitimate use case — but I'd trim the description so future readers don't go hunting for a Jupyter repro that doesn't exist (at least AFAIK).

On the fix

The current patch adds a _SafeTqdm subclass plus two try/except (OSError, ValueError) guards. I would prefer a smaller alternative: proactively set a thread lock on our tqdm subclass at module import, before tqdm ever tries to build an mp lock. Something like:

# in utils/tqdm.py, near the tqdm subclass definition
import threading
tqdm.set_lock(threading.RLock())  # avoid the mp-lock fork_exec path entirely

That would:

  • Eliminate the crash proactively (no try/except, no fallback class).
  • Keep behavior identical for the 99.9% of users on fork / real terminals.
  • Only "lose" inter-process tqdm coordination for people who use the HF subclass across multiple processes — which I don't think is a real use case for us.

If you'd rather keep the reactive shape, then the _SafeTqdm class can be dropped: the fallback path always disables the bar, so just returning a disabled old_tqdm with _lock patched on first failure would be enough.

Happy to be wrong about the proactive approach if there's a multiprocessing scenario I'm forgetting — but either way I think we can land this in ~5 lines instead of ~30. WDYT?

@Wauplin
Copy link
Copy Markdown
Contributor

Wauplin commented Apr 23, 2026

Hey @tobocop2 , the comment above has been AI-generated after some investigation from both me and a local Claude, so I agree with what's stated above. I also confirm that I reproduced the bug in Textual but not in other cases you've mentioned (explaining why we haven't got any reports so far). About the solution I'd prefer a smaller one even if it means poorer UX in rare scenarios.

Replace the _SafeTqdm class and the two try/except (OSError, ValueError)
guards in _create_progress_bar with a single tqdm.set_lock(threading.RLock())
call at module import. This proactively short-circuits tqdm.get_lock()'s
lazy TqdmDefaultWriteLock construction, so the multiprocessing
create_mp_lock path that calls fork_exec is never taken on the HF subclass.

The fork_exec path is what fails when sys.stderr.fileno() returns -1
(realistically: Textual TUIs and any custom wrapper that explicitly returns
-1 from fileno()). CPython's multiprocessing/resource_tracker silently
swallows fileno() exceptions, so StringIO-backed streams (Jupyter, pytest
capture, etc.) are not affected.

The proactive lock sacrifices inter-process tqdm bar coordination on the
HF subclass, which is not a supported use case. Drops the now-obsolete
_SafeTqdm class, the warnings.warn fallback, and the two unit tests that
exercised the try/except fallback path. Keeps the end-to-end regression
test in test_file_download.py.
@tobocop2 tobocop2 force-pushed the fix/tqdm-bad-fd-safety branch from 3ab64b5 to fc92298 Compare April 23, 2026 17:40
@tobocop2
Copy link
Copy Markdown
Contributor Author

tobocop2 commented Apr 23, 2026

Hey @tobocop2 , the comment above has been AI-generated after some investigation from both me and a local Claude, so I agree with what's stated above. I also confirm that I reproduced the bug in Textual but not in other cases you've mentioned (explaining why we haven't got any reports so far). About the solution I'd prefer a smaller one even if it means poorer UX in rare scenarios.

thanks for the review @Wauplin
I like your solution a lot. Thanks again.

I've just reflected your feedback and updated the PR and description.

@bot-ci-comment
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Copy Markdown
Contributor

@Wauplin Wauplin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the change @tobocop2 👍

Learnt a few things along the way :) Feels like we are fixing an issue mostly due to how Textual handles no file descriptors (with a -1 instead of a proper UnsupportedOperation) + how CPython do not check for invalid fileno values here. Wonder if this should live in tqdm directly instead but I'm fine with having the fix here since it's small.

Note that if we notice any drawback in the future we might rollback to another solution (e.g. by explicitly calling sys.stderr.fileno() in HF custom tqdm to disable it if running inside Textual?)

@Wauplin Wauplin merged commit b2e1f9d into huggingface:main Apr 27, 2026
7 of 17 checks passed
@tobocop2
Copy link
Copy Markdown
Contributor Author

greatly appreciate you @Wauplin

This means a lot and thanks again

@huggingface-hub-bot
Copy link
Copy Markdown
Contributor

This PR has been shipped as part of the v1.13.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: hf_hub_download crashes with ValueError when stderr lacks a real file descriptor

2 participants