[Buckets] Skip local walk for download sync without delete#4123
[Buckets] Skip local walk for download sync without delete#4123
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Replace isfile + getsize + getmtime with one os.stat, and share the stat logic between the full-walk and targeted-stat branches. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
Pushed a small follow-up (6cf628d): replaced the On the local-huge vs remote-huge question: I think it's worth keeping the PR biased toward local-huge (this optimization). Walking a deep tree is unbounded and gives users a silent multi-minute hang, whereas the remote-huge penalty is ~one extra syscall per remote file — noticeable but bounded, and small next to the download itself. We can always add an adaptive threshold later if someone actually hits the remote-huge case. 🤖 Claude Code generated |
|
Thanks @abidlabs for the PR! The changes and your reasoning makes total sense. The comment above is a result of me discussing with Claude Code and it convinced me that it's best to optimize for a huge local directory rather than optimizing for a huge remote directory (which in any case will take a large amount of time to complete). I'll merge the PR and let you know once it's shipped. |
|
Nice thanks @Wauplin! |
|
This PR has been shipped as part of the v1.12.0 release. |
I created a PR that is motivated by experience using
hf buckets syncintrackio, where downloading a bucket locally took a very long time. I've fixed ittrackiobut wondering whether we should try to optimize this upstream. In this PR, we:--delete~/.cache/huggingface/--delete, where local-only files still need to be discoveredThe main reasoning here is that local-only files are irrelevant to the sync plan unless we intend to delete them. For regular download syncs, the plan only depends on remote files and whether each corresponding local path already exists / differs, so a targeted
statis enough. But leaving this as a draft PR as this could, in other cases, lead to worse performance, i.e. if the remote bucket is very large and the local destination is relatively small, doing onestatper remote path can be slower than a single full local walk. This might actually be more common, although this was not the case fortrackio. Also, according to Codex, there are also some filesystem edge cases where targetedstatcalls may behave a bit differently from the previousos.walk-based scanNote
Medium Risk
Changes the download sync planning logic to stat only remote-known paths when
--deleteis not set, which could affect performance characteristics and edge-case filesystem behavior compared to a full local scan.Overview
Optimizes
hf buckets syncwhen downloading to an existing local directory by skipping a fullos.walkunless--deleteis enabled.Adds a
_stat_localhelper (singleos.stat, ignores missing paths and directories) and uses it to either (a) fully scan the local tree when--deleteis set or (b) only stat the remote-relative paths to decide downloads/skips, significantly reducing time spent scanning large destination directories.Reviewed by Cursor Bugbot for commit 6cf628d. Bugbot is set up for automated code reviews on this repo. Configure here.