Skip to content

Comments

feat: automatically batch large file lists to prevent ARG_MAX errors#338

Merged
jdx merged 3 commits intomainfrom
feat/auto-batch-large-file-lists
Oct 4, 2025
Merged

feat: automatically batch large file lists to prevent ARG_MAX errors#338
jdx merged 3 commits intomainfrom
feat/auto-batch-large-file-lists

Conversation

@jdx
Copy link
Owner

@jdx jdx commented Oct 4, 2025

Summary

  • Adds automatic batching of file lists when they would exceed system ARG_MAX limits
  • Prevents "Argument list too long" errors when passing thousands of files to commands via {{files}}
  • Uses binary search to find optimal batch size within safe limits

Problem

When {{files}} template variable expands to a very large string (e.g., thousands of files with long paths), it can exceed the system's ARG_MAX limit, causing "Argument list too long" errors. Users had no way to work around this without manually batching.

Solution

Implemented transparent automatic batching that:

  • Detects system ARG_MAX using sysconf() with conservative 128KB fallback
  • Estimates file list size including shell quoting overhead
  • Automatically splits jobs when estimated size exceeds 50% of ARG_MAX (safety margin)
  • Uses binary search to find optimal batch size
  • Works transparently without user configuration

Changes

  • src/env.rs: Added memoized ARG_MAX detection
  • src/step.rs:
    • estimate_files_string_size() - calculates expected {{files}} string size
    • auto_batch_jobs_if_needed() - automatically batches large jobs
    • Integrated into build_step_jobs() pipeline
  • Cargo.toml: Added libc dependency
  • test/auto_batch_large_files.bats: Comprehensive tests for batching behavior

Test plan

  • Added new bats tests for auto-batching with large file lists
  • Added tests verifying small file lists aren't unnecessarily batched
  • Added tests for fix commands with batching
  • All existing cargo tests pass (91 passed)
  • All existing bats tests pass (check.bats, glob_dir_bug.bats, dir.bats, exclude.bats)

Notes

  • Auto-batching doesn't preserve workspace_indicator due to StepJob private field constraints
  • This is acceptable since workspace jobs are typically small and won't trigger batching
  • The 50% safety margin accounts for environment variables and command overhead

🤖 Generated with Claude Code


Note

Automatically batches steps when {{files}} would exceed safe ARG_MAX, estimating size and splitting via binary search using sysconf-detected limits.

  • Core batching logic:
    • src/step.rs:
      • Add estimate_files_string_size() to approximate {{files}} expansion size (with quoting/spacing).
      • Add auto_batch_jobs_if_needed() to split oversized jobs using binary search and a 50% ARG_MAX safety margin.
      • Integrate auto-batching into build_step_jobs() after initial job creation.
  • Environment:
    • src/env.rs: Add memoized ARG_MAX (via libc::sysconf(_SC_ARG_MAX) with 128KB fallback).
  • Dependencies:
    • Cargo.toml: Add libc dependency.
  • Tests:
    • test/auto_batch_large_files.bats: Add tests for large lists, small lists (no batching), and fix-mode batching.

Written by Cursor Bugbot for commit b940021. This will update automatically on new commits. Configure here.

When {{files}} template variable expands to a very large string (e.g., thousands
of files), it can exceed the system's ARG_MAX limit, causing "Argument list too long"
errors. This change automatically detects when file lists would exceed safe limits
and batches them into multiple job executions.

Implementation:
- Added ARG_MAX detection in env.rs using sysconf() with 128KB fallback
- Implemented auto_batch_jobs_if_needed() in step.rs that:
  - Estimates file list size including shell quoting overhead
  - Uses binary search to find optimal batch size
  - Splits jobs when estimated size exceeds 50% of ARG_MAX (safety margin)
- Added comprehensive bats tests for auto-batching behavior

The batching is transparent to users and preserves correct behavior for:
- Small file lists (no batching needed)
- Fix and check commands
- Multiple file patterns

Note: Auto-batching doesn't preserve workspace_indicator due to private field
constraints in StepJob. This is acceptable since workspace jobs are typically
small and won't trigger batching.
cursor[bot]

This comment was marked as outdated.

jdx added 2 commits October 4, 2025 17:25
- Remove initialization of batch_size to avoid using stale values
- Ensure low is always updated correctly in binary search
- Add .max(1) safeguard to ensure batch_size is never 0
- Remove redundant .min() call since mid is always <= high <= len()
- Add debug logging for batch size selection

This fixes potential panics when:
1. job.files.len() is 1 (batch_size was 0)
2. Single file exceeds limit (batch_size might be too large)
@jdx jdx enabled auto-merge (squash) October 4, 2025 22:28
@jdx jdx merged commit 9a0466f into main Oct 4, 2025
12 checks passed
@jdx jdx deleted the feat/auto-batch-large-file-lists branch October 4, 2025 22:31
@jdx jdx mentioned this pull request Oct 4, 2025
jdx added a commit that referenced this pull request Oct 4, 2025
…338)

## Summary
- Adds automatic batching of file lists when they would exceed system
ARG_MAX limits
- Prevents "Argument list too long" errors when passing thousands of
files to commands via {{files}}
- Uses binary search to find optimal batch size within safe limits

## Problem
When {{files}} template variable expands to a very large string (e.g.,
thousands of files with long paths), it can exceed the system's ARG_MAX
limit, causing "Argument list too long" errors. Users had no way to work
around this without manually batching.

## Solution
Implemented transparent automatic batching that:
- Detects system ARG_MAX using sysconf() with conservative 128KB
fallback
- Estimates file list size including shell quoting overhead
- Automatically splits jobs when estimated size exceeds 50% of ARG_MAX
(safety margin)
- Uses binary search to find optimal batch size
- Works transparently without user configuration

## Changes
- **src/env.rs**: Added memoized ARG_MAX detection
- **src/step.rs**: 
- `estimate_files_string_size()` - calculates expected {{files}} string
size
  - `auto_batch_jobs_if_needed()` - automatically batches large jobs
  - Integrated into `build_step_jobs()` pipeline
- **Cargo.toml**: Added libc dependency
- **test/auto_batch_large_files.bats**: Comprehensive tests for batching
behavior

## Test plan
- [x] Added new bats tests for auto-batching with large file lists
- [x] Added tests verifying small file lists aren't unnecessarily
batched
- [x] Added tests for fix commands with batching
- [x] All existing cargo tests pass (91 passed)
- [x] All existing bats tests pass (check.bats, glob_dir_bug.bats,
dir.bats, exclude.bats)

## Notes
- Auto-batching doesn't preserve workspace_indicator due to StepJob
private field constraints
- This is acceptable since workspace jobs are typically small and won't
trigger batching
- The 50% safety margin accounts for environment variables and command
overhead

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Automatically batches steps when `{{files}}` would exceed safe
ARG_MAX, estimating size and splitting via binary search using
sysconf-detected limits.
> 
> - **Core batching logic**:
>   - `src/step.rs`:
> - Add `estimate_files_string_size()` to approximate `{{files}}`
expansion size (with quoting/spacing).
> - Add `auto_batch_jobs_if_needed()` to split oversized jobs using
binary search and a 50% ARG_MAX safety margin.
> - Integrate auto-batching into `build_step_jobs()` after initial job
creation.
> - **Environment**:
> - `src/env.rs`: Add memoized `ARG_MAX` (via
`libc::sysconf(_SC_ARG_MAX)` with 128KB fallback).
> - **Dependencies**:
>   - `Cargo.toml`: Add `libc` dependency.
> - **Tests**:
> - `test/auto_batch_large_files.bats`: Add tests for large lists, small
lists (no batching), and fix-mode batching.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
b940021. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->
jdx added a commit that referenced this pull request Oct 5, 2025
## [1.17.0](https://github.com/jdx/hk/compare/v1.16.0..v1.17.0) -
2025-10-05

### 🚀 Features

- Add hk util trailing-whitespace command by
[@jdx](https://github.com/jdx) in
[#319](#319)
- add mixed_line_ending builtin by [@jdx](https://github.com/jdx) in
[#324](#324)
- add check_symlinks builtin by [@jdx](https://github.com/jdx) in
[#326](#326)
- add check_executables_have_shebangs builtin by
[@jdx](https://github.com/jdx) in
[#325](#325)
- Add check-merge-conflict util command and builtin by
[@jdx](https://github.com/jdx) in
[#322](#322)
- add check_case_conflict builtin by [@jdx](https://github.com/jdx) in
[#323](#323)
- add detect_private_key builtin by [@jdx](https://github.com/jdx) in
[#332](#332)
- add check_added_large_files builtin by [@jdx](https://github.com/jdx)
in [#329](#329)
- add python_debug_statements builtin by [@jdx](https://github.com/jdx)
in [#331](#331)
- add python_check_ast builtin by [@jdx](https://github.com/jdx) in
[#330](#330)
- add no_commit_to_branch builtin by [@jdx](https://github.com/jdx) in
[#333](#333)
- add check_byte_order_marker and fix_byte_order_marker builtins by
[@jdx](https://github.com/jdx) in
[#328](#328)
- add regex pattern support for glob and exclude by
[@jdx](https://github.com/jdx) in
[#336](#336)
- automatically batch large file lists to prevent ARG_MAX errors by
[@jdx](https://github.com/jdx) in
[#338](#338)

### 🐛 Bug Fixes

- Add validation for stage attribute requiring fix command by
[@jdx](https://github.com/jdx) in
[#327](#327)
- display stderr when check_list_files returns empty list by
[@jdx](https://github.com/jdx) in
[#334](#334)
- added new builtins to Builtins.pkl by [@jdx](https://github.com/jdx)
in
[b8a2b17](b8a2b17)
- enable experimental settings in mise.toml for swift support by
[@jdx](https://github.com/jdx) in
[#342](#342)
- correct airflow migration test to expect local imports by
[@jdx](https://github.com/jdx) in
[#343](#343)
- make final CI check always run and fail if dependencies fail by
[@jdx](https://github.com/jdx) in
[#344](#344)
- add ruff format to ruff builtin by [@jdx](https://github.com/jdx) in
[#340](#340)

### 🚜 Refactor

- Split util module into separate files by
[@jdx](https://github.com/jdx) in
[#321](#321)

### 🛡️ Security

- migrate pre-commit by [@jdx](https://github.com/jdx) in
[#318](#318)

### 🔍 Other Changes

- split CI runs into parallel jobs and add docs-sync mise task by
[@jdx](https://github.com/jdx) in
[#337](#337)
- remove v0 pkl files from docs/public by [@jdx](https://github.com/jdx)
in [#341](#341)

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Release 1.17.0 with new migrate command, many util subcommands,
refreshed docs, and dependency updates.
> 
> - **Version**: bump `hk` to `1.17.0` (Cargo.toml, usage specs, docs).
> - **CLI**:
> - **New Command**: `migrate pre-commit` with flags (`--config`,
`--output`, `--force`, `--hk-pkl-root`).
> - **Util Subcommands**: add `check-added-large-files`,
`check-byte-order-marker`, `fix-byte-order-marker`,
`check-case-conflict`, `check-executables-have-shebangs`,
`check-merge-conflict`, `check-symlinks`, `detect-private-key`,
`end-of-file-fixer`, `mixed-line-ending`, `no-commit-to-branch`,
`python-check-ast`, `python-debug-statements`, `trailing-whitespace`.
> - **Docs**:
> - Update `docs/cli/index.md`, regenerate `docs/cli/commands.json`, and
`hk.usage.kdl`.
>   - Split `util` docs into per-command pages and add `migrate` docs.
> - **Dependencies**: update `Cargo.lock` crate versions and set `hk`
crate version to `1.17.0`.
> - **Changelog**: add `CHANGELOG.md` entry for `1.17.0` with
features/bug fixes.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
75b972a. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

Co-authored-by: mise-en-dev <[email protected]>
jdx pushed a commit that referenced this pull request Nov 19, 2025
As an attempt to avoid the automatic batching and `ARG_MAX_STRLEN`
issues, add a `stdin` step config. The expectation is that the user will
configure `stdin` be the joined files (they do the joining via Tera
filters) and then `check`/`fix` use `stdin` somehow (most likely by
leveraging `xargs`).

This change also:
- Errors if the user configures `interactive` and` stdin`
- Along with the necessary plumbing of `-> Result<()>` through
`.init(...)` calls
- Adds new Tera variable `files_list`

Open Qs:
- Where should this be suggested/documented?
- Should any of the builtins leverage this? (At least the `ruff` version
which allows `@/dev/stdin`)
- ...why does `interactive` hook up stdin? Is there a real-world
use-case in mind?
- Because if it didn't we could still use `interactive` 😉 and wouldn't
need to error
- `xargs` would actually avoid the strlen issue. Was it considered
(either implicit or explicit) as an alternative to
#338? (This wouldn't fix `ARG_MAX_STRLEN`
issues, but would fix `ARG_MAX` issues that wouldn't already hit
`ARG_MAX_STRLEN`)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant