-
Notifications
You must be signed in to change notification settings - Fork 73
Comparing changes
Open a pull request
base repository: huggingface/xet-core
base: v1.3.2
head repository: huggingface/xet-core
compare: v1.4.1
- 18 commits
- 377 files changed
- 6 contributors
Commits on Feb 27, 2026
-
Scale download buffer memory limit by number of active downloads (#666)
Currently, the maximum number of downloaded files is fixed, regardless of the number of downloads currently in flight. However, as the number of downloads increases, a fixed size total could lead to waiting on individual segments that download out-of-order or don't have enough turnaround time to saturate the output. While writing to disk or the download itself often becomes the bottleneck before these effects, planned features such as streaming files and caching could be affected by this limit. The default formula for the download buffer size now is (2GB + 512MB * number of concurrent downloads) up to a maximum of 8GB (these are adjustable). This PR alleviates this by allocating an additional 512MB buffer allocation per file, prioritized to the specific download, releasing that capacity when the file finishes downloading. This is done using the AdjustableSemaphore class, first introduced for the concurrent scaling, which allows the number of total permits in a semaphore to be incremented or decremented; on decrement, permits are discarded upon return until the total permits is at the target number.
Configuration menu - View commit details
-
Copy full SHA for 543914d - Browse repository at this point
Copy the full SHA 543914dView commit details -
Feature to monitor client process system usage (#617)
Introduces a client benchmark utility to track system resource usage (CPU, memory, disk I/O, and network I/O) of a process, so we don't need to write scripts to capture usage stats according to different OS standards. This becomes extremely helpful when I benchmark on Python notebook instances, e.g. Google Colab, where system monitor is not easily accessible or when running a separate monitor script is not easy. # Usage # Users can enable monitoring by setting `HF_XET_SYSTEM_MONITOR_ENABLED` to true, set usage sample interval using `HF_XET_SYSTEM_MONITOR_SAMPLE_INTERVAL`, this outputs metrics to the tracing stream at `INFO` level by default. In addition, these metrics can be redirected to a separate file by setting sample log path using `HF_XET_SYSTEM_MONITOR_LOG_PATH`. # Output # The stats are output in JSON format, which can be queried using tools like `jq`, e.g. 1. Trace of peak memory usage: `jq '.memory.peak_used_bytes' [HF_XET_SYSTEM_MONITOR_LOG_PATH]` 2. Trace of disk write speed: `jq '.disk.average_write_speed' [HF_XET_SYSTEM_MONITOR_LOG_PATH]` 3. Trace of network receive speed: `jq '.network.average_rx_speed' [HF_XET_SYSTEM_MONITOR_LOG_PATH]`
Configuration menu - View commit details
-
Copy full SHA for c4111eb - Browse repository at this point
Copy the full SHA c4111ebView commit details -
This PR adds an integrated API for streaming downloads, exposing a DownloadStream object that is integrated with the file reconstructor. It also uses the same memory management buffer limiting process to work with the stream object. It also introduces cancellation support to the FileReconstructor to ensure that tasks waiting on a long running download or semaphore wait don't cause things to hang when an error is reported or the user drops the stream.
Configuration menu - View commit details
-
Copy full SHA for 9b3278a - Browse repository at this point
Copy the full SHA 9b3278aView commit details
Commits on Mar 2, 2026
-
Fix command injection in release workflow (CVE) (#677)
## Summary - Fix command injection vulnerability in `.github/workflows/release.yml` (HackerOne #3581567, severity High 8.8) - `${{ github.event.inputs.tag }}` was interpolated directly in `run:` blocks, allowing arbitrary RCE via crafted tag input (e.g. `v0.1.0; id; cat /etc/passwd;#`) - Moved all 6 occurrences to `env:` variables so the value is passed as a shell environment variable instead of being interpolated into the script ## Jobs fixed - `linux` — "Update version in toml" step - `musllinux` — "Update version in toml" step - `windows` — "Update version in toml" step - `macos` — "Update version in toml" step - `sdist` — "Update version in toml" step - `github-release` — "Create GitHub Release" step (`gh release create`)Configuration menu - View commit details
-
Copy full SHA for e66dcef - Browse repository at this point
Copy the full SHA e66dcefView commit details
Commits on Mar 3, 2026
-
feat: accept pre-computed SHA-256 in upload_files() (#678)
## Summary - Add optional `sha256s` keyword parameter to the Python-exposed `upload_files()` function - Forward it to `data_client::upload_async()` which already supports it ## Context ### Double computation today `huggingface_hub` computes SHA-256 on every file during `CommitOperationAdd.__post_init__()` for LFS batch negotiation, then `hf_xet` recomputes it internally because `upload_files()` doesn't accept pre-computed hashes. ### Performance impact This change eliminates the redundant computation entirely. ### Backward compatibility - `sha256s` is a keyword-only parameter with default `None` — no change for existing callers - `data_client::upload_async()` already accepts `sha256s: Option<Vec<String>>` since day one - When provided, `SingleFileCleaner` uses `ShaGenerator::ProvidedValue` and skips internal recomputation Companion PR: huggingface/huggingface_hub#3876
Configuration menu - View commit details
-
Copy full SHA for 40b45fb - Browse repository at this point
Copy the full SHA 40b45fbView commit details
Commits on Mar 4, 2026
-
This PR introduces a new `xet_session` crate that provides a session-based hierarchical API: Users create a XetSession to manage runtime and configuration, then batch uploads into UploadCommit objects and downloads into DownloadGroup objects — each of which runs transfers in the background by the inner XetRuntime. All pub functions are exposed as sync functions - making them easy to use in other languages, e.g. Python, C, etc.
Configuration menu - View commit details
-
Copy full SHA for c4a56f8 - Browse repository at this point
Copy the full SHA c4a56f8View commit details
Commits on Mar 5, 2026
-
Naming clarification: A Xorb is a data object, CAS is the remote serv…
…er. (#680) This PR makes the use of the `cas` and `xorb` terms consistent. Previously, "cas" (for content addressed store) could simultaneously refer to either the remote server or the data bytes stored as a collection of chunks. After the renames in this PR, we consistently use `xorb` to refer to the data object and cas to refer to the remote server. This renames quite a few places; to aid in rebasing current work or updating downstream dependencies, this PR includes a file `API_UPDATES.md` that can be fed into an AI agent to quickly and accurately perform the renaming on any downstream dependencies.
Configuration menu - View commit details
-
Copy full SHA for e6e0413 - Browse repository at this point
Copy the full SHA e6e0413View commit details -
Fix for incorrect error propagation on truncated download stream. (#683)
Currently, the async stream logic silently swallows an UnexpectedEOF, treating it the same as an EOF. This is a bug; this PR fixes it to propagate UnexpectedEOF while handling correct EOF as the end of the stream.
Configuration menu - View commit details
-
Copy full SHA for 70807bf - Browse repository at this point
Copy the full SHA 70807bfView commit details -
Simulation interface for LocalTestServer: supports deletion, direct a…
…ccess, data dumps, etc. (#681) This PR adds interface functions to the LocalServer class that will allow it to become a full simulation environment for testing all the garbage collection stages.
Configuration menu - View commit details
-
Copy full SHA for ebd780d - Browse repository at this point
Copy the full SHA ebd780dView commit details
Commits on Mar 9, 2026
-
Rework simulation pipeline for adaptive concurrency and connection re…
…siliency. (#648) This PR replaces the previous collection of scripts around setting up docker containers with a much more nimble and lightweight set of rust scripts and a simple, reusable proxy that can limit bandwidth and congestion simulations. The previous scripts are rewritten to be more nimble and use more reusable components. New tools: - cas_client/src/simulation/network_simulation: A lightweight, in-process network congestion simulation proxy that lives between the LocalServer instance and the RemoteClient instance, allowing simulation tests to run on a network with realistic congestion conditions and a gated bandwidth. This can be controlled dynamically through a LocalTestServer instance. - simulation/: A new package for collecting simulation scripts and analyzing the results. To run the new simulation scripts for the adaptive concurrency on upload, compile in release mode and run one of the scripts in `simulation/src/adaptive_concurrency/scripts/`. Docker is no longer needed to run any of the simulations. The old `cas_client/tests/adaptive_concurrency/` paths were removed.
Configuration menu - View commit details
-
Copy full SHA for 6a5535b - Browse repository at this point
Copy the full SHA 6a5535bView commit details
Commits on Mar 10, 2026
-
feat: add skip_sha256 option to SingleFileCleaner (#679)
## Summary - Add `ShaGenerator::Skip` variant that skips SHA-256 computation entirely - `ShaGenerator::finalize()` now returns `Option<Sha256>` (None when skipped) - `SingleFileCleaner::new()` and `FileUploadSession::start_clean()` accept a `skip_sha256` boolean - When skipped, no `FileMetadataExt` is included in the shard ## Context Bucket uploads don't need SHA-256 in the shard metadata — the `sha_index` GSI is only used for LFS pointer resolution, which doesn't apply to buckets. Skipping SHA-256 for bucket uploads removes the main CPU bottleneck in the upload pipeline on non-SHA-NI instances. ## Alternative: dummy SHA-256 Instead of skipping entirely, the client could send a zeroed/dummy `FileMetadataExt`. The server would still store it but queries would never match. This avoids the server-side schema change (xetcas PR) but pollutes the GSI with dummy entries. Companion PRs: - xetcas: huggingface-internal/xetcas#498 (make `FileIdItem.sha256` optional server-side)
Configuration menu - View commit details
-
Copy full SHA for a48f1f8 - Browse repository at this point
Copy the full SHA a48f1f8View commit details
Commits on Mar 11, 2026
-
fix: prevent download stall on large file reconstruction (#698)
## Summary Fixes download stalls/deadlocks on large file reconstruction (reported on 48.5 GB GGUF files). The root cause is a circular dependency: the main reconstruction loop holds a buffer semaphore permit while blocking on CAS connection permit acquisition, and xorb write locks held during HTTP downloads cause CAS permit starvation. ### Changes 1. **Single-flight xorb downloads via `OnceCell`** (`xorb_block.rs`): replaces `RwLock<Option<...>>` with `tokio::sync::OnceCell`. Only one task per xorb block acquires a CAS permit and downloads the data; concurrent callers wait on the same result without acquiring permits or duplicating work. This eliminates duplicate downloads, prevents double-counted transfer progress, and avoids a failing duplicate from killing the reconstruction. 2. **Decouple CAS permit from buffer permit** (`file_term.rs`): the main loop no longer blocks on CAS permits while holding a buffer permit. The spawned download task delegates to `retrieve_data` which handles permit acquisition internally via the OnceCell single-flight. This breaks the circular dependency that causes stalls. 3. **Improve error propagation** (`sequential_writer.rs`): when the background writer channel closes, check `RunState` for the original error before returning a generic "channel closed" message. ### Root cause The reconstruction pipeline has three resource pools: buffer permits (bounded semaphore), CAS download permits (64 concurrent), and per-xorb write locks. Before this fix, the main loop would: 1. Acquire a **buffer permit** (blocking if buffer full) 2. Call `get_data_task()` which acquires a **CAS permit** (blocking if pool exhausted) 3. Inside `retrieve_data()`, hold a **write lock** during the entire HTTP download This creates two deadlock vectors: - **Buffer vs CAS**: buffer fills up with terms waiting for CAS permits, but CAS permits are held by tasks blocked behind xorb write locks, and the writer can't drain the buffer because it's waiting for those tasks - **CAS vs write lock**: multiple tasks sharing the same xorb each hold a CAS permit while blocked on the write lock, starving other xorbs of permits ## Reproduction Reliably reproducible with small buffer: ``` HF_XET_RECONSTRUCTION_DOWNLOAD_BUFFER_SIZE=64mb \ HF_XET_RECONSTRUCTION_DOWNLOAD_BUFFER_LIMIT=64mb \ python3 -c "from huggingface_hub import hf_hub_download; hf_hub_download('unsloth/Qwen3-Coder-Next-GGUF', 'Qwen3-Coder-Next-Q4_K_M.gguf', local_dir='/tmp/test')" ``` - **Before fix**: stalls at ~3.4 GB, no progression (deadlock) - **After fix**: continuous progression, completes successfully With default buffer (2 GB), the stall is intermittent depending on network speed (consistently reproduced on slower connections).Configuration menu - View commit details
-
Copy full SHA for 9ba5fb3 - Browse repository at this point
Copy the full SHA 9ba5fb3View commit details -
fix: no timeout for shard uploads (XET-885) (#685)
Fixes [XET-885](https://linear.app/xet/issue/XET-885/investigate-unsloth-upload-failure-shard-upload-timeout-on-cas) ## Summary Shard uploads to CAS can take a long time due to server-side processing (DynamoDB writes scale with file entry count). The default `read_timeout(120s)` on the reqwest client kills these uploads. **Key insight:** reqwest's per-request `RequestBuilder::timeout()` does NOT override the client-level `read_timeout()` — they are independent mechanisms polled as separate futures. So the original approach of using per-request timeouts was ineffective. **Fix:** Create a dedicated `shard_upload_http_client` on `RemoteClient` with **no `read_timeout`**, built once at construction time and reused for all shard uploads. All other settings (connect timeout, pool config, auth middleware) are identical to the standard client. ## Changes ### `cas_client/src/http_client.rs` - Added `reqwest_client_no_read_timeout()` — creates a reqwest client with no `read_timeout` - Added `build_auth_http_client_no_read_timeout()` — public API wrapping it with middleware - 4 unit tests for the new builder ### `cas_client/src/remote_client.rs` - Added `shard_upload_http_client` field to `RemoteClient` (cfg'd out on wasm) - `upload_shard()` uses the pre-built no-timeout client instead of building one per request ### `cas_client/tests/test_shard_upload_timeout.rs` - Updated: slow server test now asserts **success** (shard uploads should wait as long as needed) ### `xet_config/src/groups/client.rs` - Removed `shard_read_timeout` config field (no longer needed) --------- Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 83a2827 - Browse repository at this point
Copy the full SHA 83a2827View commit details -
Configuration menu - View commit details
-
Copy full SHA for 02da1d2 - Browse repository at this point
Copy the full SHA 02da1d2View commit details -
Code reorganization towards release of xet cargo package (#693)
This PR is a massive rearrangement of the code base into 5 packages intended for release on cargo. The directories and corresponding packages are: 1. xet_runtime/ — compiles into the xet-runtime package. Contains the runtime, config, and logging management. 2. xet_core_structures/ — compiles into the xet-core-structures package. Contains core data structures for hashing, shards, and xorbs as well as internal data structures that depend on these. 3. xet_client/ — compiles into the xet-client package, contains client code for remotely connecting to the Hugging Face servers. 4. xet_data/ — compiles into the xet-data package, contains the data processing pipeline: chunking/deduplication, file reconstruction, clean/smudge operations, and progress tracking. 5. xet_pkg/ — compiles into the hf-xet package, provides the top-level session-based API for file upload and download with user-facing error categorization. This is the primary package downstream dependencies would use. This also contains a single summary error type, XetError, that translates cleanly into python error types. In addition, the other tools are: - git_xet/ — the git_xet CLI binary crate (location preserved). - hf_xet/ -- the hf_xet python package (location preserved). - simulation/ — the simulation crate for upload scenario benchmarking. - wasm/ -- the wasm objects. The full description — and information for an AI agent to use to update downstream dependencies — is at api_changes/update_260309_package_restructure.md. Summary of moves: - xet_runtime: became xet_runtime::core inside xet_runtime/. - utils: became xet_runtime::utils inside xet_runtime/. - xet_config: became xet_runtime::config inside xet_runtime/. - xet_logging: became xet_runtime::logging inside xet_runtime/. - error_printer: became xet_runtime::error_printer inside xet_runtime/. - file_utils: became xet_runtime::file_utils inside xet_runtime/. - merklehash: became xet_core_structures::merklehash inside xet_core_structures/. - mdb_shard: became xet_core_structures::metadata_shard inside xet_core_structures/. - xorb_object: became xet_core_structures::xorb_object inside xet_core_structures/. - cas_client: became xet_client::cas_client inside xet_client/. - hub_client: became xet_client::hub_client inside xet_client/. - cas_types: became xet_client::cas_types inside xet_client/. - chunk_cache: became xet_client::chunk_cache inside xet_client/. - data: became xet_data::processing inside xet_data/. - deduplication: became xet_data::deduplication inside xet_data/. - file_reconstruction: became xet_data::file_reconstruction inside xet_data/. - progress_tracking: became xet_data::progress_tracking inside xet_data/. - xet_session: became xet::xet_session inside xet_pkg/. - Wasm packages (hf_xet_wasm, hf_xet_thin_wasm): moved from top-level into wasm/; internal imports updated, public APIs unchanged.
Configuration menu - View commit details
-
Copy full SHA for 45d38a1 - Browse repository at this point
Copy the full SHA 45d38a1View commit details -
Record API changes in api_changes/updates_<date>_<description>.md (#689)
This PR creates a folder, api_changes, in which AI agents can record updates to the API surface that could affect downstream PRs and dependencies. This can be scanned by AI agents to reliably perform merges or to propagate changes. See api_changes/README.md for a description of how this should work.
Configuration menu - View commit details
-
Copy full SHA for 6061deb - Browse repository at this point
Copy the full SHA 6061debView commit details -
Rework the interface for session task to get result from registered u…
…pload (#690) This PR updates the interface for retrieving per-task results after UploadCommit::commit() or DownloadGroup::finish(). The problem with the previous interface is that commit() and finish() return a vector of FileMetadata or DownloadResult, making it difficult for users to associate each result with a specific task. The new interface uses `task_id` as a strong binding bridge: ## Upload per-task result access patterns After commit() completes, there are two equivalent ways to retrieve a per-task FileMetadata result: 1. Lookup in the global result map: ``` let commit = session.new_upload_commit()?; let handle = commit.upload_from_path(src)?; let results = commit.commit()?; let result = results.get(&handle.task_id) ``` 2. Direct access from the handle: ``` let commit = session.new_upload_commit()?; let handle = commit.upload_from_path(src)?; commit.commit()?; // handle.result() is populated by commit() via the shared Arc. let result = handle.result() ``` ## Download per-task result access patterns The pattern is similar to the above. ## Why not put results in a vector in the same order as tasks are registered to the commit instance? After a commit instance is created, it can be cloned (since it is itself an Arc wrapping an internal struct) and sent to different threads. When multiple threads are registering tasks, there is no static registration order that a program can observe upfront.
Configuration menu - View commit details
-
Copy full SHA for cacd713 - Browse repository at this point
Copy the full SHA cacd713View commit details
Commits on Mar 12, 2026
-
feat: expose skip_sha256 parameter in Python upload API (#705)
## Summary Add `skip_sha256` and `sha256s` parameters to `upload_bytes()` Python binding for per-file SHA-256 policies: - `skip_sha256: bool = False` - Skip SHA-256 computation entirely (sets `Sha256Policy::Skip`) - `sha256s: Optional[List[str]] = None` - Provide pre-computed SHA-256 hashes (companion to existing parameter on `upload_files()`) - These parameters are mutually exclusive ## Changes **Python binding changes:** - Add `skip_sha256` + `sha256s` params to `upload_bytes()` / `upload_files()` - All policy conversion happens at Python boundary **Internal refactoring:** - Add `Clone`/`Copy` derives + `from_skip()`/`from_hex()` helpers to `Sha256Policy` - Update `upload_bytes_async`, `upload_async`, `clean_file` to use `Vec<Sha256Policy>` - Update all internal callers across `git_xet`, `xet_pkg`, migration tool, tests ## Motivation `huggingface_hub` already knows whether SHA-256 is required. This change enables skipping expensive computation when unnecessary, or passing pre-computed hashes for bulk operations. Companion to #678. --------- Co-authored-by: Wauplin <[email protected]>
Configuration menu - View commit details
-
Copy full SHA for 0fb930c - Browse repository at this point
Copy the full SHA 0fb930cView commit details
This comparison is taking too long to generate.
Unfortunately it looks like we can’t render this comparison for you right now. It might be too big, or there might be something weird with your repository.
You can try running this command locally to see the comparison on your machine:
git diff v1.3.2...v1.4.1