Skip to content

Document exclusive-end --chunk range in curate_pai_samples.py#87

Open
lonexreb wants to merge 1 commit intoNVlabs:mainfrom
lonexreb:docs/curate-pai-chunk-range-help
Open

Document exclusive-end --chunk range in curate_pai_samples.py#87
lonexreb wants to merge 1 commit intoNVlabs:mainfrom
lonexreb:docs/curate-pai-chunk-range-help

Conversation

@lonexreb
Copy link
Copy Markdown
Contributor

@lonexreb lonexreb commented May 4, 2026

Problem

scripts/curate_pai_samples.py and scripts/download_pai.py both accept a start-end range form for chunk IDs and both implement it as range(start, end) (exclusive of end).

download_pai.py spells this out in its --chunk-ids help string:

range '0-3' (exclusive end, downloads 0,1,2)

curate_pai_samples.py does not, and its docstring example actively misleads:

"""
example:
python scripts/curate_pai_samples.py \\
  --clip-index-path /path/to/PAI_datset/clip_index.parquet \\
  --chunk 3116-3119 --num-samples 16 \\
  ...
"""

A user following that example reasonably expects 4 chunks (3116, 3117, 3118, 3119) but silently gets 3 (3116, 3117, 3118), because:

def _parse_chunk_ids(chunk_arg: str) -> list[str]:
    if "-" in chunk_arg:
        start_s, end_s = chunk_arg.split("-", 1)
        start, end = int(start_s.strip()), int(end_s.strip())
        return [str(i) for i in range(start, end)]

Fix

Pure docs / UX change. The _parse_chunk_ids implementation is unchanged.

  • Update the --chunk argparse help to mirror download_pai.py's wording and explicitly call out the exclusive-end behavior, with a cross-reference so the two stay in lockstep.
  • Adjust the docstring example from --chunk 3116-3119 to --chunk 3116-3120 and add an inline note clarifying that 3120 is excluded. The new example still selects 4 chunks (3116-3119), matching the obvious reading of the original example.

Verification

No code logic touched. New help text correctly describes existing behavior of _parse_chunk_ids.

scripts/curate_pai_samples.py and scripts/download_pai.py both accept a
"start-end" range form for chunk IDs and both implement it as
range(start, end), i.e. exclusive of `end`. download_pai.py spells this
out in its --chunk-ids help string ("range '0-3' (exclusive end,
downloads 0,1,2)"), but curate_pai_samples.py does not -- so a user
following the in-file example `--chunk 3116-3119 --num-samples 16`
silently gets 3 chunks (3116-3118), not the 4 they probably expected.

This commit:

- Updates the --chunk argparse help to mirror the wording used by
  download_pai.py and explicitly call out the exclusive-end behavior
  + cross-references download_pai.py so the two stay in lockstep.
- Adjusts the docstring example from `--chunk 3116-3119` to
  `--chunk 3116-3120` and adds an inline note clarifying that 3120 is
  excluded. The new example still selects 4 chunks (3116-3119), which
  matches the obvious reading of the original example.

Pure docs/UX change. The implementation of `_parse_chunk_ids` is
unchanged.

Signed-off-by: lonexreb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant