Skip to content

Centralize hf:// URI parsing#4158

Merged
Wauplin merged 27 commits intomainfrom
harmonize-hf-handles
Apr 29, 2026
Merged

Centralize hf:// URI parsing#4158
Wauplin merged 27 commits intomainfrom
harmonize-hf-handles

Conversation

@Wauplin
Copy link
Copy Markdown
Contributor

@Wauplin Wauplin commented Apr 28, 2026

Summary

Adds a single source of truth for parsing hf://... URIs, addressing #3971.

  • New HfUri dataclass and parse_hf_uri helper, plus shared constants in huggingface_hub.constants (HF_PROTOCOL, HfUriType, HF_URI_TYPE_PREFIXES).
  • Separate HfMount dataclass and parse_hf_mount helper for volume mount specifications (hf://...:<MOUNT_PATH>[:ro|:rw]). Mount logic is cleanly separated from URI parsing — HfUri is a pure location identifier, HfMount wraps a HfUri with a mount path and read-only flag.
  • A reference page documenting the canonical syntax for both URIs and mounts, and what does (and does not) qualify as a HF URI — strict-by-design (plural-only type prefixes, no canonical repos without namespace, etc.).
  • Pure string parser, no network calls. Round-trippable via HfUri.to_uri() and HfMount.to_uri().
>>> from huggingface_hub import parse_hf_uri
>>> parse_hf_uri("hf://datasets/namespace/my-dataset@refs/pr/3/train.json")
HfUri(type='dataset', id='namespace/my-dataset', revision='refs/pr/3', path_in_repo='train.json')

>>> from huggingface_hub import parse_hf_mount
>>> parse_hf_mount("hf://buckets/my-org/my-bucket/sub/dir:/mnt:ro")
HfMount(source=HfUri(type='bucket', id='my-org/my-bucket', revision=None, path_in_repo='sub/dir'), mount_path='/mnt', read_only=True)

Follow-ups (not in this PR)

  • Migrate existing call sites to the new parser:
    • HfFileSystem.resolve_path (keeps its 1- vs 2-segment network fallback on top of the parser)
    • _parse_hf_copy_handle in hf_api.py (the spot already carries a # TODO referencing Harmonize hf:// parsing logic #3971)
    • parse_volumes in cli/_cli_utils.py
    • _split_bucket_id_and_prefix / _parse_bucket_path / _is_bucket_path in _buckets.py and _parse_bucket_argument / _is_hf_handle in cli/buckets.py
    • Volume.to_hf_handle formatter in _space_api.py (mirror of HfMount.to_uri)
  • Decide what to do with repo_type_and_id_from_hf_id (used by RepoUrl): it's a mixed parser accepting both hf:// URIs and https://huggingface.co/... URLs. Likely split into two helpers, with the URI half delegating to parse_hf_uri.
  • Possibly extend the parser later to accept HTTPS Hub URLs, if the RepoUrl migration would benefit.

Note

Medium Risk
Adds new public parsing APIs and stricter URI semantics (e.g., requiring namespace/name), which could affect downstream users who adopt the new helpers or rely on older, looser examples.

Overview
Introduces a new centralized hf://... parser in utils/_hf_uris.py, including frozen HfUri/HfMount dataclasses with canonical round-tripping (to_uri) and strict validation/error reporting via a new HfUriError.

Exports HfUri and parse_hf_uri from the package (huggingface_hub and huggingface_hub.utils), adds shared URI constants/types in constants.py, and adds comprehensive unit tests plus a new docs reference page (package_reference/hf_uris) linked from the docs toctree. CLI volume help text/examples are updated to reflect namespaced model URIs (no more hf://gpt2).

Reviewed by Cursor Bugbot for commit 116f6ea. Bugbot is set up for automated code reviews on this repo. Configure here.

Introduces a single ``parse_hf_uri`` helper and ``HfUri`` dataclass under
``huggingface_hub.utils``, plus a reference page documenting the canonical
``hf://`` syntax and what does (and does not) qualify as a HF URI. Related
constants (``HF_PROTOCOL``, ``HfUriType``, ``HF_URI_TYPE_PREFIXES``) live in
``huggingface_hub.constants``.

Addresses #3971.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@bot-ci-comment
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Wauplin Wauplin changed the title [Utils] Centralize hf:// URI parsing Centralize hf:// URI parsing Apr 28, 2026
Wauplin and others added 11 commits April 28, 2026 10:56
Adds a dedicated `HfUriError(ValueError)` in `huggingface_hub.errors` and
raises it from `parse_hf_uri` instead of plain `ValueError`. Inheriting from
`ValueError` keeps backward compatibility for callers that catch the latter.

Also drops the unnecessary `from __future__ import annotations` in the new
modules — the project requires Python 3.10+ where `str | None` works at
runtime, no forward refs are involved, and only 4/136 src files use it.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Switch from `[\`name\`][huggingface_hub.utils.module.name]` to the shorter
`[\`name\`]` form throughout the new module's docstrings and reference
page, matching the convention used elsewhere.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Use the `type` and `id` fields directly. The `is_repo`/`is_bucket` booleans
remain available to disambiguate when needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
More descriptive — pairs naturally with parse_hf_uri.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@Wauplin Wauplin requested a review from hanouticelina April 28, 2026 10:11
@Wauplin Wauplin marked this pull request as ready for review April 28, 2026 10:12
Comment thread src/huggingface_hub/utils/_hf_uris.py
Wauplin added 3 commits April 28, 2026 12:20
t # Lines starting with '#' will be ignored, and an empty message aborts
Comment thread src/huggingface_hub/utils/_hf_uris.py Outdated
…nvert refs

- to_uri() now URL-encodes '/' in non-special revisions so 'feature/foo' round-trips correctly.
- _SPECIAL_REFS_REVISION_REGEX accepts '-' and '.' in convert ref names (e.g. 'refs/convert/parquet-v2').
"spaces": "space",
"kernels": "kernel",
"buckets": "bucket",
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New constant duplicates existing REPO_TYPES_MAPPING data

Low Severity

HF_URI_TYPE_PREFIXES duplicates the four repo-type entries already defined in REPO_TYPES_MAPPING (same file), only adding "buckets": "bucket". If a new repo type is later added to REPO_TYPES_MAPPING, a developer could easily forget to update HF_URI_TYPE_PREFIXES, causing the URI parser to silently reject valid URIs. Building HF_URI_TYPE_PREFIXES from REPO_TYPES_MAPPING (e.g. {**REPO_TYPES_MAPPING, "buckets": "bucket"}) would keep a single source of truth for repo-type mappings.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 63df690. Configure here.

Copy link
Copy Markdown
Contributor Author

@Wauplin Wauplin Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not an issue IMO

Comment thread tests/test_utils_hf_uris.py Outdated
Comment thread src/huggingface_hub/utils/_hf_uris.py
Copy link
Copy Markdown
Contributor

@hanouticelina hanouticelina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a first pass, looks good overall! mostly cosmetic comments

Comment thread src/huggingface_hub/utils/_hf_uris.py Outdated
Comment thread src/huggingface_hub/errors.py
Comment thread src/huggingface_hub/utils/_hf_uris.py Outdated
# (Pull Request refs) and 'refs/convert/<name>' (e.g. parquet conversions).
# The conversion name allows the typical git ref characters '[a-zA-Z0-9_.-]'
# so names like 'parquet-v2' or 'duckdb.v1' round-trip correctly.
_SPECIAL_REFS_REVISION_REGEX = re.compile(r"^refs/(?:convert/[\w.-]+|pr/\d+)")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's different from SPECIAL_REFS_REVISION_REGEX defined in hf_api.py right? let's maybe drop the one in hf_api.py, make this one public and import it there?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no it's basically the same but I did not want to touch any existing code. The SPECIAL_REFS_REVISION_REGEX regex you are referring to will disappear once we use parse_hf_uri everywhere

# (Pull Request refs) and 'refs/convert/<name>' (e.g. parquet conversions).
# The conversion name allows the typical git ref characters '[a-zA-Z0-9_.-]'
# so names like 'parquet-v2' or 'duckdb.v1' round-trip correctly.
_SPECIAL_REFS_REVISION_REGEX = re.compile(r"^refs/(?:convert/[\w.-]+|pr/\d+)")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated special-refs regex with divergent matching behavior

Low Severity

_SPECIAL_REFS_REVISION_REGEX in _hf_uris.py uses [\w.-]+ for convert ref names, while the existing SPECIAL_REFS_REVISION_REGEX in hf_api.py uses \w+. The new pattern matches hyphens and dots (e.g. refs/convert/parquet-v2) that the old one rejects. Having two subtly different regexes for the same semantic concept risks inconsistent behavior across code paths that haven't yet migrated to the new parser.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 439df11. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned above (#4158 (comment)), SPECIAL_REFS_REVISION_REGEX will soon disappear so don't worry

"""Parse the body of a bucket URI: 'namespace/name[/path]'."""
if "@" in location:
raise HfUriError(uri=raw, msg="Bucket URIs do not support a revision marker ('@').")
location = location.strip("/")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strip("/") silently accepts double slashes after type prefix

Low Severity

location.strip("/") in both _parse_bucket_body and _parse_repo_body removes leading slashes that result from double-slash inputs like hf://models//org/model or hf://buckets//org/b. After _split_type returns rest = /org/model (from the double slash), strip silently normalizes it to org/model, causing the URI to parse successfully. This is inconsistent with the explicit rejection of // within paths (e.g. hf://models/org/m//sub), which the PR reviewer specifically requested to be treated as errors.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 0586cc5. Configure here.

Co-Authored-By: Claude Opus 4.6 <[email protected]>
@Wauplin
Copy link
Copy Markdown
Contributor Author

Wauplin commented Apr 29, 2026

Thanks for the review @hanouticelina ! I have addressed all of your comments.

Also as discussed in DMs, I have split the logic between HfUri and HfMount as it didn't feel right to me to include mount logic into HfUri. We now have parse_hf_mount which depends on parse_hf_uri to parse the mount source. This split should make the API more consistent while keeping the core logic unified.

@Wauplin Wauplin requested a review from hanouticelina April 29, 2026 08:42
Copy link
Copy Markdown
Contributor

@hanouticelina hanouticelina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! left two small nits, it looks good and works as expected 🙌

Comment thread src/huggingface_hub/utils/_hf_uris.py Outdated
Comment thread src/huggingface_hub/utils/_hf_uris.py Outdated
@Wauplin
Copy link
Copy Markdown
Contributor Author

Wauplin commented Apr 29, 2026

Thank you! Will merge if CI gets green and move forward on the integration :)

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 116f6ea. Configure here.

try:
validate_repo_id(self.id)
except HFValidationError as e:
raise HfUriError(uri=uri, msg=str(e)) from e
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bucket IDs with empty segments bypass validation

Medium Severity

The __post_init__ validation for HfUri checks self.id.count("/") != 1 but doesn't verify both segments are non-empty. For non-bucket types, validate_repo_id catches IDs like "org/" or "/name" via its regex. But bucket types explicitly skip validate_repo_id, so directly constructing HfUri(type="bucket", id="org/") or HfUri(type="bucket", id="/b") silently produces an invalid object. The parser's _parse_bucket_body has the correct not parts[0] or not parts[1] guard, but __post_init__ doesn't replicate it for the bucket case.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 116f6ea. Configure here.

@Wauplin Wauplin merged commit 5fb553d into main Apr 29, 2026
21 checks passed
@Wauplin Wauplin deleted the harmonize-hf-handles branch April 29, 2026 15:29
@huggingface-hub-bot
Copy link
Copy Markdown
Contributor

This PR has been shipped as part of the v1.13.0 release.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doc page is very cool

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants