Skip to content

Accept multiple --chunk-ids without quoting in download_pai.py#81

Open
lonexreb wants to merge 1 commit intoNVlabs:mainfrom
lonexreb:fix/download-pai-chunk-ids-multi-arg
Open

Accept multiple --chunk-ids without quoting in download_pai.py#81
lonexreb wants to merge 1 commit intoNVlabs:mainfrom
lonexreb:fix/download-pai-chunk-ids-multi-arg

Conversation

@lonexreb
Copy link
Copy Markdown
Contributor

@lonexreb lonexreb commented May 4, 2026

Summary

The `--chunk-ids` help text already advertises a `multi '0 1'` form, and `docs/FINETUNE_SFT.md` reinforces it ("multiple ids (`0 1`)"). But `argparse` was wired with `type=str` and no `nargs`, so that form actually fails:

```console
$ python scripts/download_pai.py --chunk-ids 0 1 2
usage: download_pai.py [-h] ...
download_pai.py: error: unrecognized arguments: 1 2
```

Switching to `nargs="+"` plus a small `_expand_chunk_ids` helper lines the CLI up with what the docs already promise.

Behavior matrix

Invocation Before After
`--chunk-ids 0` `[0]` `[0]`
`--chunk-ids 0-3` `[0, 1, 2]` `[0, 1, 2]`
`--chunk-ids "0 1 2"` (quoted) `["0", "1", "2"]` (later `int()`-coerced) `[0, 1, 2]`
`--chunk-ids 0 1 2` (unquoted) argparse error `[0, 1, 2]`
`--chunk-ids 0-2 5` argparse error `[0, 1, 5]`

Every previously valid invocation still produces the same flat list of integer chunk ids that `build_allow_patterns` consumes — strictly more permissive.

Test plan

  • Manual: ran `_expand_chunk_ids` against 7 input shapes (None, single, multi, range, mixed, multi-range, legacy quoted) — all produce the expected `list[int]`.
  • `python scripts/download_pai.py --help` renders the updated help string and the new `CHUNK_IDS [CHUNK_IDS ...]` signature.
  • No download was triggered (the actual HF call is unchanged).

Existing example invocations in the docs continue to work:

  • `docs/FINETUNE_SFT.md`: `--chunk-ids 0-10` ✓
  • `finetune/rl/README.md`: `--chunk-ids 3116` ✓

The `--chunk-ids` help text and the FINETUNE_SFT example both document a
``multi '0 1'`` form, but argparse was using ``type=str`` with no ``nargs``,
so that form failed:

    $ python scripts/download_pai.py --chunk-ids 0 1 2
    error: unrecognized arguments: 1 2

Switch to ``nargs="+"`` and route the list through a small ``_expand_chunk_ids``
helper. Now all of these work and produce the same flat ``list[int]`` of chunk
ids:

  * ``--chunk-ids 0``          -> ``[0]``
  * ``--chunk-ids 0 1 2``      -> ``[0, 1, 2]``  (newly works unquoted)
  * ``--chunk-ids 0-3``        -> ``[0, 1, 2]``  (range, end exclusive)
  * ``--chunk-ids 0-2 5``      -> ``[0, 1, 5]``  (mix of range and singles)
  * ``--chunk-ids "0 1 2"``    -> ``[0, 1, 2]``  (legacy quoted form preserved)

The change is strictly more permissive — every previously valid invocation
still produces the same list of integers.

Signed-off-by: lonexreb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant