Skip to content

[Buckets] Skip local walk for download sync without delete#4123

Merged
Wauplin merged 3 commits intomainfrom
fix-bucket-sync-download-walk
Apr 23, 2026
Merged

[Buckets] Skip local walk for download sync without delete#4123
Wauplin merged 3 commits intomainfrom
fix-bucket-sync-download-walk

Conversation

@abidlabs
Copy link
Copy Markdown
Member

@abidlabs abidlabs commented Apr 17, 2026

I created a PR that is motivated by experience using hf buckets sync in trackio, where downloading a bucket locally took a very long time. I've fixed it trackio but wondering whether we should try to optimize this upstream. In this PR, we:

  • avoid walking the full local destination tree when syncing bucket -> local without --delete
  • stat only the remote-relative paths that can affect the plan, which keeps download sync fast even if the destination lives under a large directory like ~/.cache/huggingface/
  • keep the full local walk for --delete, where local-only files still need to be discovered

The main reasoning here is that local-only files are irrelevant to the sync plan unless we intend to delete them. For regular download syncs, the plan only depends on remote files and whether each corresponding local path already exists / differs, so a targeted stat is enough. But leaving this as a draft PR as this could, in other cases, lead to worse performance, i.e. if the remote bucket is very large and the local destination is relatively small, doing one stat per remote path can be slower than a single full local walk. This might actually be more common, although this was not the case for trackio. Also, according to Codex, there are also some filesystem edge cases where targeted stat calls may behave a bit differently from the previous os.walk-based scan


Note

Medium Risk
Changes the download sync planning logic to stat only remote-known paths when --delete is not set, which could affect performance characteristics and edge-case filesystem behavior compared to a full local scan.

Overview
Optimizes hf buckets sync when downloading to an existing local directory by skipping a full os.walk unless --delete is enabled.

Adds a _stat_local helper (single os.stat, ignores missing paths and directories) and uses it to either (a) fully scan the local tree when --delete is set or (b) only stat the remote-relative paths to decide downloads/skips, significantly reducing time spent scanning large destination directories.

Reviewed by Cursor Bugbot for commit 6cf628d. Bugbot is set up for automated code reviews on this repo. Configure here.

@bot-ci-comment
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@abidlabs abidlabs requested a review from Wauplin April 17, 2026 19:47
Replace isfile + getsize + getmtime with one os.stat, and share the
stat logic between the full-walk and targeted-stat branches.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@Wauplin
Copy link
Copy Markdown
Contributor

Wauplin commented Apr 23, 2026

Pushed a small follow-up (6cf628d): replaced the isfile + getsize + getmtime combo with a single os.stat() via a shared _stat_local() helper that both _list_local_files and the new targeted-stat branch use. Cuts the stat cost ~3× in the hot path and keeps the two branches from drifting.

On the local-huge vs remote-huge question: I think it's worth keeping the PR biased toward local-huge (this optimization). Walking a deep tree is unbounded and gives users a silent multi-minute hang, whereas the remote-huge penalty is ~one extra syscall per remote file — noticeable but bounded, and small next to the download itself. We can always add an adaptive threshold later if someone actually hits the remote-huge case.

🤖 Claude Code generated

@Wauplin
Copy link
Copy Markdown
Contributor

Wauplin commented Apr 23, 2026

Thanks @abidlabs for the PR! The changes and your reasoning makes total sense. The comment above is a result of me discussing with Claude Code and it convinced me that it's best to optimize for a huge local directory rather than optimizing for a huge remote directory (which in any case will take a large amount of time to complete). I'll merge the PR and let you know once it's shipped.

@Wauplin Wauplin marked this pull request as ready for review April 23, 2026 12:53
@Wauplin Wauplin merged commit 6e9e383 into main Apr 23, 2026
17 of 21 checks passed
@Wauplin Wauplin deleted the fix-bucket-sync-download-walk branch April 23, 2026 13:05
@abidlabs
Copy link
Copy Markdown
Member Author

Nice thanks @Wauplin!

@huggingface-hub-bot
Copy link
Copy Markdown
Contributor

This PR has been shipped as part of the v1.12.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants