Conversation
Introduces a single ``parse_hf_uri`` helper and ``HfUri`` dataclass under ``huggingface_hub.utils``, plus a reference page documenting the canonical ``hf://`` syntax and what does (and does not) qualify as a HF URI. Related constants (``HF_PROTOCOL``, ``HfUriType``, ``HF_URI_TYPE_PREFIXES``) live in ``huggingface_hub.constants``. Addresses #3971. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Adds a dedicated `HfUriError(ValueError)` in `huggingface_hub.errors` and raises it from `parse_hf_uri` instead of plain `ValueError`. Inheriting from `ValueError` keeps backward compatibility for callers that catch the latter. Also drops the unnecessary `from __future__ import annotations` in the new modules — the project requires Python 3.10+ where `str | None` works at runtime, no forward refs are involved, and only 4/136 src files use it. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Switch from `[\`name\`][huggingface_hub.utils.module.name]` to the shorter `[\`name\`]` form throughout the new module's docstrings and reference page, matching the convention used elsewhere. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Use the `type` and `id` fields directly. The `is_repo`/`is_bucket` booleans remain available to disambiguate when needed. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
More descriptive — pairs naturally with parse_hf_uri. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
t # Lines starting with '#' will be ignored, and an empty message aborts
…nvert refs - to_uri() now URL-encodes '/' in non-special revisions so 'feature/foo' round-trips correctly. - _SPECIAL_REFS_REVISION_REGEX accepts '-' and '.' in convert ref names (e.g. 'refs/convert/parquet-v2').
| "spaces": "space", | ||
| "kernels": "kernel", | ||
| "buckets": "bucket", | ||
| } |
There was a problem hiding this comment.
New constant duplicates existing REPO_TYPES_MAPPING data
Low Severity
HF_URI_TYPE_PREFIXES duplicates the four repo-type entries already defined in REPO_TYPES_MAPPING (same file), only adding "buckets": "bucket". If a new repo type is later added to REPO_TYPES_MAPPING, a developer could easily forget to update HF_URI_TYPE_PREFIXES, causing the URI parser to silently reject valid URIs. Building HF_URI_TYPE_PREFIXES from REPO_TYPES_MAPPING (e.g. {**REPO_TYPES_MAPPING, "buckets": "bucket"}) would keep a single source of truth for repo-type mappings.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 63df690. Configure here.
hanouticelina
left a comment
There was a problem hiding this comment.
Made a first pass, looks good overall! mostly cosmetic comments
| # (Pull Request refs) and 'refs/convert/<name>' (e.g. parquet conversions). | ||
| # The conversion name allows the typical git ref characters '[a-zA-Z0-9_.-]' | ||
| # so names like 'parquet-v2' or 'duckdb.v1' round-trip correctly. | ||
| _SPECIAL_REFS_REVISION_REGEX = re.compile(r"^refs/(?:convert/[\w.-]+|pr/\d+)") |
There was a problem hiding this comment.
it's different from SPECIAL_REFS_REVISION_REGEX defined in hf_api.py right? let's maybe drop the one in hf_api.py, make this one public and import it there?
There was a problem hiding this comment.
no it's basically the same but I did not want to touch any existing code. The SPECIAL_REFS_REVISION_REGEX regex you are referring to will disappear once we use parse_hf_uri everywhere
| # (Pull Request refs) and 'refs/convert/<name>' (e.g. parquet conversions). | ||
| # The conversion name allows the typical git ref characters '[a-zA-Z0-9_.-]' | ||
| # so names like 'parquet-v2' or 'duckdb.v1' round-trip correctly. | ||
| _SPECIAL_REFS_REVISION_REGEX = re.compile(r"^refs/(?:convert/[\w.-]+|pr/\d+)") |
There was a problem hiding this comment.
Duplicated special-refs regex with divergent matching behavior
Low Severity
_SPECIAL_REFS_REVISION_REGEX in _hf_uris.py uses [\w.-]+ for convert ref names, while the existing SPECIAL_REFS_REVISION_REGEX in hf_api.py uses \w+. The new pattern matches hyphens and dots (e.g. refs/convert/parquet-v2) that the old one rejects. Having two subtly different regexes for the same semantic concept risks inconsistent behavior across code paths that haven't yet migrated to the new parser.
Reviewed by Cursor Bugbot for commit 439df11. Configure here.
There was a problem hiding this comment.
as mentioned above (#4158 (comment)), SPECIAL_REFS_REVISION_REGEX will soon disappear so don't worry
| """Parse the body of a bucket URI: 'namespace/name[/path]'.""" | ||
| if "@" in location: | ||
| raise HfUriError(uri=raw, msg="Bucket URIs do not support a revision marker ('@').") | ||
| location = location.strip("/") |
There was a problem hiding this comment.
strip("/") silently accepts double slashes after type prefix
Low Severity
location.strip("/") in both _parse_bucket_body and _parse_repo_body removes leading slashes that result from double-slash inputs like hf://models//org/model or hf://buckets//org/b. After _split_type returns rest = /org/model (from the double slash), strip silently normalizes it to org/model, causing the URI to parse successfully. This is inconsistent with the explicit rejection of // within paths (e.g. hf://models/org/m//sub), which the PR reviewer specifically requested to be treated as errors.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 0586cc5. Configure here.
Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
Thanks for the review @hanouticelina ! I have addressed all of your comments. Also as discussed in DMs, I have split the logic between |
hanouticelina
left a comment
There was a problem hiding this comment.
Thank you! left two small nits, it looks good and works as expected 🙌
|
Thank you! Will merge if CI gets green and move forward on the integration :) |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 4 total unresolved issues (including 3 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 116f6ea. Configure here.
| try: | ||
| validate_repo_id(self.id) | ||
| except HFValidationError as e: | ||
| raise HfUriError(uri=uri, msg=str(e)) from e |
There was a problem hiding this comment.
Bucket IDs with empty segments bypass validation
Medium Severity
The __post_init__ validation for HfUri checks self.id.count("/") != 1 but doesn't verify both segments are non-empty. For non-bucket types, validate_repo_id catches IDs like "org/" or "/name" via its regex. But bucket types explicitly skip validate_repo_id, so directly constructing HfUri(type="bucket", id="org/") or HfUri(type="bucket", id="/b") silently produces an invalid object. The parser's _parse_bucket_body has the correct not parts[0] or not parts[1] guard, but __post_init__ doesn't replicate it for the bucket case.
Reviewed by Cursor Bugbot for commit 116f6ea. Configure here.
|
This PR has been shipped as part of the v1.13.0 release. |


Summary
Adds a single source of truth for parsing
hf://...URIs, addressing #3971.HfUridataclass andparse_hf_urihelper, plus shared constants inhuggingface_hub.constants(HF_PROTOCOL,HfUriType,HF_URI_TYPE_PREFIXES).HfMountdataclass andparse_hf_mounthelper for volume mount specifications (hf://...:<MOUNT_PATH>[:ro|:rw]). Mount logic is cleanly separated from URI parsing —HfUriis a pure location identifier,HfMountwraps aHfUriwith a mount path and read-only flag.HfUri.to_uri()andHfMount.to_uri().Follow-ups (not in this PR)
HfFileSystem.resolve_path(keeps its 1- vs 2-segment network fallback on top of the parser)_parse_hf_copy_handleinhf_api.py(the spot already carries a# TODOreferencing Harmonize hf:// parsing logic #3971)parse_volumesincli/_cli_utils.py_split_bucket_id_and_prefix/_parse_bucket_path/_is_bucket_pathin_buckets.pyand_parse_bucket_argument/_is_hf_handleincli/buckets.pyVolume.to_hf_handleformatter in_space_api.py(mirror ofHfMount.to_uri)repo_type_and_id_from_hf_id(used byRepoUrl): it's a mixed parser accepting bothhf://URIs andhttps://huggingface.co/...URLs. Likely split into two helpers, with the URI half delegating toparse_hf_uri.RepoUrlmigration would benefit.Note
Medium Risk
Adds new public parsing APIs and stricter URI semantics (e.g., requiring
namespace/name), which could affect downstream users who adopt the new helpers or rely on older, looser examples.Overview
Introduces a new centralized
hf://...parser inutils/_hf_uris.py, including frozenHfUri/HfMountdataclasses with canonical round-tripping (to_uri) and strict validation/error reporting via a newHfUriError.Exports
HfUriandparse_hf_urifrom the package (huggingface_hubandhuggingface_hub.utils), adds shared URI constants/types inconstants.py, and adds comprehensive unit tests plus a new docs reference page (package_reference/hf_uris) linked from the docs toctree. CLI volume help text/examples are updated to reflect namespaced model URIs (no morehf://gpt2).Reviewed by Cursor Bugbot for commit 116f6ea. Bugbot is set up for automated code reviews on this repo. Configure here.