Skip to content
Permalink

Comparing changes

Choose two branches to see what’s changed or to start a new pull request. If you need to, you can also or learn more about diff comparisons.

Open a pull request

Create a new pull request by comparing changes across two branches. If you need to, you can also . Learn more about diff comparisons here.
base repository: huggingface/xet-core
Failed to load repositories. Confirm that selected base ref is valid, then try again.
Loading
base: v1.3.2
Choose a base ref
...
head repository: huggingface/xet-core
Failed to load repositories. Confirm that selected head ref is valid, then try again.
Loading
compare: v1.4.1
Choose a head ref
  • 18 commits
  • 377 files changed
  • 6 contributors

Commits on Feb 27, 2026

  1. Scale download buffer memory limit by number of active downloads (#666)

    Currently, the maximum number of downloaded files is fixed, regardless
    of the number of downloads currently in flight. However, as the number
    of downloads increases, a fixed size total could lead to waiting on
    individual segments that download out-of-order or don't have enough
    turnaround time to saturate the output. While writing to disk or the
    download itself often becomes the bottleneck before these effects,
    planned features such as streaming files and caching could be affected
    by this limit. The default formula for the download buffer size now is
    (2GB + 512MB * number of concurrent downloads) up to a maximum of 8GB
    (these are adjustable).
    
    This PR alleviates this by allocating an additional 512MB buffer
    allocation per file, prioritized to the specific download, releasing
    that capacity when the file finishes downloading. This is done using the
    AdjustableSemaphore class, first introduced for the concurrent scaling,
    which allows the number of total permits in a semaphore to be
    incremented or decremented; on decrement, permits are discarded upon
    return until the total permits is at the target number.
    hoytak authored Feb 27, 2026
    Configuration menu
    Copy the full SHA
    543914d View commit details
    Browse the repository at this point in the history
  2. Feature to monitor client process system usage (#617)

    Introduces a client benchmark utility to track system resource usage
    (CPU, memory, disk I/O, and network I/O) of a process, so we don't need
    to write scripts to capture usage stats according to different OS
    standards. This becomes extremely helpful when I benchmark on Python
    notebook instances, e.g. Google Colab, where system monitor is not
    easily accessible or when running a separate monitor script is not easy.
    
    # Usage #
    Users can enable monitoring by setting `HF_XET_SYSTEM_MONITOR_ENABLED`
    to true, set usage sample interval using
    `HF_XET_SYSTEM_MONITOR_SAMPLE_INTERVAL`, this outputs metrics to the
    tracing stream at `INFO` level by default. In addition, these metrics
    can be redirected to a separate file by setting sample log path using
    `HF_XET_SYSTEM_MONITOR_LOG_PATH`.
    
    # Output #
    The stats are output in JSON format, which can be queried using tools
    like `jq`, e.g.
    1. Trace of peak memory usage: `jq '.memory.peak_used_bytes'
    [HF_XET_SYSTEM_MONITOR_LOG_PATH]`
    2. Trace of disk write speed: `jq '.disk.average_write_speed'
    [HF_XET_SYSTEM_MONITOR_LOG_PATH]`
    3. Trace of network receive speed: `jq '.network.average_rx_speed'
    [HF_XET_SYSTEM_MONITOR_LOG_PATH]`
    seanses authored Feb 27, 2026
    Configuration menu
    Copy the full SHA
    c4111eb View commit details
    Browse the repository at this point in the history
  3. Streaming data writer (#656)

    This PR adds an integrated API for streaming downloads, exposing a
    DownloadStream object that is integrated with the file reconstructor. It
    also uses the same memory management buffer limiting process to work
    with the stream object.
    
    It also introduces cancellation support to the FileReconstructor to
    ensure that tasks waiting on a long running download or semaphore wait
    don't cause things to hang when an error is reported or the user drops
    the stream.
    hoytak authored Feb 27, 2026
    Configuration menu
    Copy the full SHA
    9b3278a View commit details
    Browse the repository at this point in the history

Commits on Mar 2, 2026

  1. Fix command injection in release workflow (CVE) (#677)

    ## Summary
    
    - Fix command injection vulnerability in `.github/workflows/release.yml`
    (HackerOne #3581567, severity High 8.8)
    - `${{ github.event.inputs.tag }}` was interpolated directly in `run:`
    blocks, allowing arbitrary RCE via crafted tag input (e.g. `v0.1.0; id;
    cat /etc/passwd;#`)
    - Moved all 6 occurrences to `env:` variables so the value is passed as
    a shell environment variable instead of being interpolated into the
    script
    
    ## Jobs fixed
    
    - `linux` — "Update version in toml" step
    - `musllinux` — "Update version in toml" step
    - `windows` — "Update version in toml" step
    - `macos` — "Update version in toml" step
    - `sdist` — "Update version in toml" step
    - `github-release` — "Create GitHub Release" step (`gh release create`)
    XciD authored Mar 2, 2026
    Configuration menu
    Copy the full SHA
    e66dcef View commit details
    Browse the repository at this point in the history

Commits on Mar 3, 2026

  1. feat: accept pre-computed SHA-256 in upload_files() (#678)

    ## Summary
    
    - Add optional `sha256s` keyword parameter to the Python-exposed
    `upload_files()` function
    - Forward it to `data_client::upload_async()` which already supports it
    
    ## Context
    
    ### Double computation today
    
    `huggingface_hub` computes SHA-256 on every file during
    `CommitOperationAdd.__post_init__()` for LFS batch negotiation, then
    `hf_xet` recomputes it internally because `upload_files()` doesn't
    accept pre-computed hashes.
    
    ### Performance impact
    
    This change eliminates the redundant computation entirely.
    
    ### Backward compatibility
    
    - `sha256s` is a keyword-only parameter with default `None` — no change
    for existing callers
    - `data_client::upload_async()` already accepts `sha256s:
    Option<Vec<String>>` since day one
    - When provided, `SingleFileCleaner` uses `ShaGenerator::ProvidedValue`
    and skips internal recomputation
    
    Companion PR: huggingface/huggingface_hub#3876
    XciD authored Mar 3, 2026
    Configuration menu
    Copy the full SHA
    40b45fb View commit details
    Browse the repository at this point in the history

Commits on Mar 4, 2026

  1. XetSession API (#657)

    This PR introduces a new `xet_session` crate that provides a
    session-based hierarchical API: Users create a XetSession to manage
    runtime and configuration, then batch uploads into UploadCommit objects
    and downloads into DownloadGroup objects — each of which runs transfers
    in the background by the inner XetRuntime.
    
    All pub functions are exposed as sync functions - making them easy to
    use in other languages, e.g. Python, C, etc.
    seanses authored Mar 4, 2026
    Configuration menu
    Copy the full SHA
    c4a56f8 View commit details
    Browse the repository at this point in the history

Commits on Mar 5, 2026

  1. Naming clarification: A Xorb is a data object, CAS is the remote serv…

    …er. (#680)
    
    This PR makes the use of the `cas` and `xorb` terms consistent.
    Previously, "cas" (for content addressed store) could simultaneously
    refer to either the remote server or the data bytes stored as a
    collection of chunks. After the renames in this PR, we consistently use
    `xorb` to refer to the data object and cas to refer to the remote
    server.
    
    This renames quite a few places; to aid in rebasing current work or
    updating downstream dependencies, this PR includes a file
    `API_UPDATES.md` that can be fed into an AI agent to quickly and
    accurately perform the renaming on any downstream dependencies.
    hoytak authored Mar 5, 2026
    Configuration menu
    Copy the full SHA
    e6e0413 View commit details
    Browse the repository at this point in the history
  2. Fix for incorrect error propagation on truncated download stream. (#683)

    Currently, the async stream logic silently swallows an UnexpectedEOF,
    treating it the same as an EOF. This is a bug; this PR fixes it to
    propagate UnexpectedEOF while handling correct EOF as the end of the
    stream.
    hoytak authored Mar 5, 2026
    Configuration menu
    Copy the full SHA
    70807bf View commit details
    Browse the repository at this point in the history
  3. Simulation interface for LocalTestServer: supports deletion, direct a…

    …ccess, data dumps, etc. (#681)
    
    This PR adds interface functions to the LocalServer class that will
    allow it to become a full simulation environment for testing all the
    garbage collection stages.
    hoytak authored Mar 5, 2026
    Configuration menu
    Copy the full SHA
    ebd780d View commit details
    Browse the repository at this point in the history

Commits on Mar 9, 2026

  1. Rework simulation pipeline for adaptive concurrency and connection re…

    …siliency. (#648)
    
    This PR replaces the previous collection of scripts around setting up
    docker containers with a much more nimble and lightweight set of rust
    scripts and a simple, reusable proxy that can limit bandwidth and
    congestion simulations. The previous scripts are rewritten to be more
    nimble and use more reusable components.
    
    New tools: 
    - cas_client/src/simulation/network_simulation: A lightweight,
    in-process network congestion simulation proxy that lives between the
    LocalServer instance and the RemoteClient instance, allowing simulation
    tests to run on a network with realistic congestion conditions and a
    gated bandwidth. This can be controlled dynamically through a
    LocalTestServer instance.
    - simulation/: A new package for collecting simulation scripts and
    analyzing the results.
    
    To run the new simulation scripts for the adaptive concurrency on
    upload, compile in release mode and run one of the scripts in
    `simulation/src/adaptive_concurrency/scripts/`. Docker is no longer
    needed to run any of the simulations.
    
    The old `cas_client/tests/adaptive_concurrency/` paths were removed.
    hoytak authored Mar 9, 2026
    Configuration menu
    Copy the full SHA
    6a5535b View commit details
    Browse the repository at this point in the history

Commits on Mar 10, 2026

  1. feat: add skip_sha256 option to SingleFileCleaner (#679)

    ## Summary
    
    - Add `ShaGenerator::Skip` variant that skips SHA-256 computation
    entirely
    - `ShaGenerator::finalize()` now returns `Option<Sha256>` (None when
    skipped)
    - `SingleFileCleaner::new()` and `FileUploadSession::start_clean()`
    accept a `skip_sha256` boolean
    - When skipped, no `FileMetadataExt` is included in the shard
    
    ## Context
    
    Bucket uploads don't need SHA-256 in the shard metadata — the
    `sha_index` GSI is only used for LFS pointer resolution, which doesn't
    apply to buckets. Skipping SHA-256 for bucket uploads removes the main
    CPU bottleneck in the upload pipeline on non-SHA-NI instances.
    
    ## Alternative: dummy SHA-256
    
    Instead of skipping entirely, the client could send a zeroed/dummy
    `FileMetadataExt`. The server would still store it but queries would
    never match. This avoids the server-side schema change (xetcas PR) but
    pollutes the GSI with dummy entries.
    
    Companion PRs:
    - xetcas: huggingface-internal/xetcas#498 (make `FileIdItem.sha256`
    optional server-side)
    XciD authored Mar 10, 2026
    Configuration menu
    Copy the full SHA
    a48f1f8 View commit details
    Browse the repository at this point in the history

Commits on Mar 11, 2026

  1. fix: prevent download stall on large file reconstruction (#698)

    ## Summary
    
    Fixes download stalls/deadlocks on large file reconstruction (reported
    on 48.5 GB GGUF files). The root cause is a circular dependency: the
    main reconstruction loop holds a buffer semaphore permit while blocking
    on CAS connection permit acquisition, and xorb write locks held during
    HTTP downloads cause CAS permit starvation.
    
    ### Changes
    
    1. **Single-flight xorb downloads via `OnceCell`** (`xorb_block.rs`):
    replaces `RwLock<Option<...>>` with `tokio::sync::OnceCell`. Only one
    task per xorb block acquires a CAS permit and downloads the data;
    concurrent callers wait on the same result without acquiring permits or
    duplicating work. This eliminates duplicate downloads, prevents
    double-counted transfer progress, and avoids a failing duplicate from
    killing the reconstruction.
    
    2. **Decouple CAS permit from buffer permit** (`file_term.rs`): the main
    loop no longer blocks on CAS permits while holding a buffer permit. The
    spawned download task delegates to `retrieve_data` which handles permit
    acquisition internally via the OnceCell single-flight. This breaks the
    circular dependency that causes stalls.
    
    3. **Improve error propagation** (`sequential_writer.rs`): when the
    background writer channel closes, check `RunState` for the original
    error before returning a generic "channel closed" message.
    
    ### Root cause
    
    The reconstruction pipeline has three resource pools: buffer permits
    (bounded semaphore), CAS download permits (64 concurrent), and per-xorb
    write locks.
    
    Before this fix, the main loop would:
    1. Acquire a **buffer permit** (blocking if buffer full)
    2. Call `get_data_task()` which acquires a **CAS permit** (blocking if
    pool exhausted)
    3. Inside `retrieve_data()`, hold a **write lock** during the entire
    HTTP download
    
    This creates two deadlock vectors:
    - **Buffer vs CAS**: buffer fills up with terms waiting for CAS permits,
    but CAS permits are held by tasks blocked behind xorb write locks, and
    the writer can't drain the buffer because it's waiting for those tasks
    - **CAS vs write lock**: multiple tasks sharing the same xorb each hold
    a CAS permit while blocked on the write lock, starving other xorbs of
    permits
    
    ## Reproduction
    
    Reliably reproducible with small buffer:
    ```
    HF_XET_RECONSTRUCTION_DOWNLOAD_BUFFER_SIZE=64mb \
    HF_XET_RECONSTRUCTION_DOWNLOAD_BUFFER_LIMIT=64mb \
    python3 -c "from huggingface_hub import hf_hub_download; hf_hub_download('unsloth/Qwen3-Coder-Next-GGUF', 'Qwen3-Coder-Next-Q4_K_M.gguf', local_dir='/tmp/test')"
    ```
    
    - **Before fix**: stalls at ~3.4 GB, no progression (deadlock)
    - **After fix**: continuous progression, completes successfully
    
    With default buffer (2 GB), the stall is intermittent depending on
    network speed (consistently reproduced on slower connections).
    XciD authored Mar 11, 2026
    Configuration menu
    Copy the full SHA
    9ba5fb3 View commit details
    Browse the repository at this point in the history
  2. fix: no timeout for shard uploads (XET-885) (#685)

    Fixes
    [XET-885](https://linear.app/xet/issue/XET-885/investigate-unsloth-upload-failure-shard-upload-timeout-on-cas)
    
    ## Summary
    
    Shard uploads to CAS can take a long time due to server-side processing
    (DynamoDB writes scale with file entry count). The default
    `read_timeout(120s)` on the reqwest client kills these uploads.
    
    **Key insight:** reqwest's per-request `RequestBuilder::timeout()` does
    NOT override the client-level `read_timeout()` — they are independent
    mechanisms polled as separate futures. So the original approach of using
    per-request timeouts was ineffective.
    
    **Fix:** Create a dedicated `shard_upload_http_client` on `RemoteClient`
    with **no `read_timeout`**, built once at construction time and reused
    for all shard uploads. All other settings (connect timeout, pool config,
    auth middleware) are identical to the standard client.
    
    ## Changes
    
    ### `cas_client/src/http_client.rs`
    - Added `reqwest_client_no_read_timeout()` — creates a reqwest client
    with no `read_timeout`
    - Added `build_auth_http_client_no_read_timeout()` — public API wrapping
    it with middleware
    - 4 unit tests for the new builder
    
    ### `cas_client/src/remote_client.rs`
    - Added `shard_upload_http_client` field to `RemoteClient` (cfg'd out on
    wasm)
    - `upload_shard()` uses the pre-built no-timeout client instead of
    building one per request
    
    ### `cas_client/tests/test_shard_upload_timeout.rs`
    - Updated: slow server test now asserts **success** (shard uploads
    should wait as long as needed)
    
    ### `xet_config/src/groups/client.rs`
    - Removed `shard_read_timeout` config field (no longer needed)
    
    ---------
    
    Co-authored-by: Claude Opus 4.6 (1M context) <[email protected]>
    rajatarya and claude authored Mar 11, 2026
    Configuration menu
    Copy the full SHA
    83a2827 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    02da1d2 View commit details
    Browse the repository at this point in the history
  4. Code reorganization towards release of xet cargo package (#693)

    This PR is a massive rearrangement of the code base into 5 packages
    intended for release on cargo. The directories and corresponding
    packages are:
    
    1. xet_runtime/ — compiles into the xet-runtime package. Contains the
    runtime, config, and logging management.
    2. xet_core_structures/ — compiles into the xet-core-structures package.
    Contains core data structures for hashing, shards, and xorbs as well as
    internal data structures that depend on these.
    3. xet_client/ — compiles into the xet-client package, contains client
    code for remotely connecting to the Hugging Face servers.
    4. xet_data/ — compiles into the xet-data package, contains the data
    processing pipeline: chunking/deduplication, file reconstruction,
    clean/smudge operations, and progress tracking.
    5. xet_pkg/ — compiles into the hf-xet package, provides the top-level
    session-based API for file upload and download with user-facing error
    categorization. This is the primary package downstream dependencies
    would use. This also contains a single summary error type, XetError,
    that translates cleanly into python error types.
    
    In addition, the other tools are: 
    
    - git_xet/ — the git_xet CLI binary crate (location preserved). 
    - hf_xet/ -- the hf_xet python package (location preserved).
    - simulation/ — the simulation crate for upload scenario benchmarking.
    - wasm/ -- the wasm objects. 
    
    The full description — and information for an AI agent to use to update
    downstream dependencies — is at
    api_changes/update_260309_package_restructure.md.
    
    Summary of moves:
    
    - xet_runtime: became xet_runtime::core inside xet_runtime/.
    - utils: became xet_runtime::utils inside xet_runtime/.
    - xet_config: became xet_runtime::config inside xet_runtime/.
    - xet_logging: became xet_runtime::logging inside xet_runtime/.
    - error_printer: became xet_runtime::error_printer inside xet_runtime/.
    - file_utils: became xet_runtime::file_utils inside xet_runtime/.
    - merklehash: became xet_core_structures::merklehash inside
    xet_core_structures/.
    - mdb_shard: became xet_core_structures::metadata_shard inside
    xet_core_structures/.
    - xorb_object: became xet_core_structures::xorb_object inside
    xet_core_structures/.
    - cas_client: became xet_client::cas_client inside xet_client/.
    - hub_client: became xet_client::hub_client inside xet_client/.
    - cas_types: became xet_client::cas_types inside xet_client/.
    - chunk_cache: became xet_client::chunk_cache inside xet_client/.
    - data: became xet_data::processing inside xet_data/.
    - deduplication: became xet_data::deduplication inside xet_data/.
    - file_reconstruction: became xet_data::file_reconstruction inside
    xet_data/.
    - progress_tracking: became xet_data::progress_tracking inside
    xet_data/.
    - xet_session: became xet::xet_session inside xet_pkg/.
    
    - Wasm packages (hf_xet_wasm, hf_xet_thin_wasm): moved from top-level
    into wasm/; internal imports updated, public APIs unchanged.
    hoytak authored Mar 11, 2026
    Configuration menu
    Copy the full SHA
    45d38a1 View commit details
    Browse the repository at this point in the history
  5. Record API changes in api_changes/updates_<date>_<description>.md (#689)

    This PR creates a folder, api_changes, in which AI agents can record
    updates to the API surface that could affect downstream PRs and
    dependencies. This can be scanned by AI agents to reliably perform
    merges or to propagate changes. See api_changes/README.md for a
    description of how this should work.
    hoytak authored Mar 11, 2026
    Configuration menu
    Copy the full SHA
    6061deb View commit details
    Browse the repository at this point in the history
  6. Rework the interface for session task to get result from registered u…

    …pload (#690)
    
    This PR updates the interface for retrieving per-task results after
    UploadCommit::commit() or DownloadGroup::finish(). The problem with the
    previous interface is that commit() and finish() return a vector of
    FileMetadata or DownloadResult, making it difficult for users to
    associate each result with a specific task.
    
    The new interface uses `task_id` as a strong binding bridge:
    
    ## Upload per-task result access patterns
    After commit() completes, there are two equivalent ways to retrieve a
    per-task FileMetadata result:
    
    1. Lookup in the global result map:
    ```
    let commit = session.new_upload_commit()?;
    let handle = commit.upload_from_path(src)?;
    let results = commit.commit()?;
    let result = results.get(&handle.task_id)
    ```
    
    2. Direct access from the handle:
    ```
    let commit = session.new_upload_commit()?;
    let handle = commit.upload_from_path(src)?;
    commit.commit()?;
    // handle.result() is populated by commit() via the shared Arc.
    let result = handle.result()
    ```
    
    ## Download per-task result access patterns
    The pattern is similar to the above.
    
    ## Why not put results in a vector in the same order as tasks are
    registered to the commit instance?
    After a commit instance is created, it can be cloned (since it is itself
    an Arc wrapping an internal struct) and sent to different threads. When
    multiple threads are registering tasks, there is no static registration
    order that a program can observe upfront.
    seanses authored Mar 11, 2026
    Configuration menu
    Copy the full SHA
    cacd713 View commit details
    Browse the repository at this point in the history

Commits on Mar 12, 2026

  1. feat: expose skip_sha256 parameter in Python upload API (#705)

    ## Summary
    
    Add `skip_sha256` and `sha256s` parameters to `upload_bytes()` Python
    binding for per-file SHA-256 policies:
    - `skip_sha256: bool = False` - Skip SHA-256 computation entirely (sets
    `Sha256Policy::Skip`)
    - `sha256s: Optional[List[str]] = None` - Provide pre-computed SHA-256
    hashes (companion to existing parameter on `upload_files()`)
    - These parameters are mutually exclusive
    
    ## Changes
    
    **Python binding changes:**
    - Add `skip_sha256` + `sha256s` params to `upload_bytes()` /
    `upload_files()`
    - All policy conversion happens at Python boundary
    
    **Internal refactoring:**
    - Add `Clone`/`Copy` derives + `from_skip()`/`from_hex()` helpers to
    `Sha256Policy`
    - Update `upload_bytes_async`, `upload_async`, `clean_file` to use
    `Vec<Sha256Policy>`
    - Update all internal callers across `git_xet`, `xet_pkg`, migration
    tool, tests
    
    ## Motivation
    
    `huggingface_hub` already knows whether SHA-256 is required. This change
    enables skipping expensive computation when unnecessary, or passing
    pre-computed hashes for bulk operations.
    
    Companion to #678.
    
    ---------
    
    Co-authored-by: Wauplin <[email protected]>
    XciD and Wauplin authored Mar 12, 2026
    Configuration menu
    Copy the full SHA
    0fb930c View commit details
    Browse the repository at this point in the history
Loading