feat: automatically batch large file lists to prevent ARG_MAX errors#338
Merged
feat: automatically batch large file lists to prevent ARG_MAX errors#338
Conversation
When {{files}} template variable expands to a very large string (e.g., thousands
of files), it can exceed the system's ARG_MAX limit, causing "Argument list too long"
errors. This change automatically detects when file lists would exceed safe limits
and batches them into multiple job executions.
Implementation:
- Added ARG_MAX detection in env.rs using sysconf() with 128KB fallback
- Implemented auto_batch_jobs_if_needed() in step.rs that:
- Estimates file list size including shell quoting overhead
- Uses binary search to find optimal batch size
- Splits jobs when estimated size exceeds 50% of ARG_MAX (safety margin)
- Added comprehensive bats tests for auto-batching behavior
The batching is transparent to users and preserves correct behavior for:
- Small file lists (no batching needed)
- Fix and check commands
- Multiple file patterns
Note: Auto-batching doesn't preserve workspace_indicator due to private field
constraints in StepJob. This is acceptable since workspace jobs are typically
small and won't trigger batching.
- Remove initialization of batch_size to avoid using stale values - Ensure low is always updated correctly in binary search - Add .max(1) safeguard to ensure batch_size is never 0 - Remove redundant .min() call since mid is always <= high <= len() - Add debug logging for batch size selection This fixes potential panics when: 1. job.files.len() is 1 (batch_size was 0) 2. Single file exceeds limit (batch_size might be too large)
Merged
jdx
added a commit
that referenced
this pull request
Oct 4, 2025
…338) ## Summary - Adds automatic batching of file lists when they would exceed system ARG_MAX limits - Prevents "Argument list too long" errors when passing thousands of files to commands via {{files}} - Uses binary search to find optimal batch size within safe limits ## Problem When {{files}} template variable expands to a very large string (e.g., thousands of files with long paths), it can exceed the system's ARG_MAX limit, causing "Argument list too long" errors. Users had no way to work around this without manually batching. ## Solution Implemented transparent automatic batching that: - Detects system ARG_MAX using sysconf() with conservative 128KB fallback - Estimates file list size including shell quoting overhead - Automatically splits jobs when estimated size exceeds 50% of ARG_MAX (safety margin) - Uses binary search to find optimal batch size - Works transparently without user configuration ## Changes - **src/env.rs**: Added memoized ARG_MAX detection - **src/step.rs**: - `estimate_files_string_size()` - calculates expected {{files}} string size - `auto_batch_jobs_if_needed()` - automatically batches large jobs - Integrated into `build_step_jobs()` pipeline - **Cargo.toml**: Added libc dependency - **test/auto_batch_large_files.bats**: Comprehensive tests for batching behavior ## Test plan - [x] Added new bats tests for auto-batching with large file lists - [x] Added tests verifying small file lists aren't unnecessarily batched - [x] Added tests for fix commands with batching - [x] All existing cargo tests pass (91 passed) - [x] All existing bats tests pass (check.bats, glob_dir_bug.bats, dir.bats, exclude.bats) ## Notes - Auto-batching doesn't preserve workspace_indicator due to StepJob private field constraints - This is acceptable since workspace jobs are typically small and won't trigger batching - The 50% safety margin accounts for environment variables and command overhead 🤖 Generated with [Claude Code](https://claude.com/claude-code) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Automatically batches steps when `{{files}}` would exceed safe ARG_MAX, estimating size and splitting via binary search using sysconf-detected limits. > > - **Core batching logic**: > - `src/step.rs`: > - Add `estimate_files_string_size()` to approximate `{{files}}` expansion size (with quoting/spacing). > - Add `auto_batch_jobs_if_needed()` to split oversized jobs using binary search and a 50% ARG_MAX safety margin. > - Integrate auto-batching into `build_step_jobs()` after initial job creation. > - **Environment**: > - `src/env.rs`: Add memoized `ARG_MAX` (via `libc::sysconf(_SC_ARG_MAX)` with 128KB fallback). > - **Dependencies**: > - `Cargo.toml`: Add `libc` dependency. > - **Tests**: > - `test/auto_batch_large_files.bats`: Add tests for large lists, small lists (no batching), and fix-mode batching. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit b940021. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY -->
jdx
added a commit
that referenced
this pull request
Oct 5, 2025
## [1.17.0](https://github.com/jdx/hk/compare/v1.16.0..v1.17.0) - 2025-10-05 ### 🚀 Features - Add hk util trailing-whitespace command by [@jdx](https://github.com/jdx) in [#319](#319) - add mixed_line_ending builtin by [@jdx](https://github.com/jdx) in [#324](#324) - add check_symlinks builtin by [@jdx](https://github.com/jdx) in [#326](#326) - add check_executables_have_shebangs builtin by [@jdx](https://github.com/jdx) in [#325](#325) - Add check-merge-conflict util command and builtin by [@jdx](https://github.com/jdx) in [#322](#322) - add check_case_conflict builtin by [@jdx](https://github.com/jdx) in [#323](#323) - add detect_private_key builtin by [@jdx](https://github.com/jdx) in [#332](#332) - add check_added_large_files builtin by [@jdx](https://github.com/jdx) in [#329](#329) - add python_debug_statements builtin by [@jdx](https://github.com/jdx) in [#331](#331) - add python_check_ast builtin by [@jdx](https://github.com/jdx) in [#330](#330) - add no_commit_to_branch builtin by [@jdx](https://github.com/jdx) in [#333](#333) - add check_byte_order_marker and fix_byte_order_marker builtins by [@jdx](https://github.com/jdx) in [#328](#328) - add regex pattern support for glob and exclude by [@jdx](https://github.com/jdx) in [#336](#336) - automatically batch large file lists to prevent ARG_MAX errors by [@jdx](https://github.com/jdx) in [#338](#338) ### 🐛 Bug Fixes - Add validation for stage attribute requiring fix command by [@jdx](https://github.com/jdx) in [#327](#327) - display stderr when check_list_files returns empty list by [@jdx](https://github.com/jdx) in [#334](#334) - added new builtins to Builtins.pkl by [@jdx](https://github.com/jdx) in [b8a2b17](b8a2b17) - enable experimental settings in mise.toml for swift support by [@jdx](https://github.com/jdx) in [#342](#342) - correct airflow migration test to expect local imports by [@jdx](https://github.com/jdx) in [#343](#343) - make final CI check always run and fail if dependencies fail by [@jdx](https://github.com/jdx) in [#344](#344) - add ruff format to ruff builtin by [@jdx](https://github.com/jdx) in [#340](#340) ### 🚜 Refactor - Split util module into separate files by [@jdx](https://github.com/jdx) in [#321](#321) ### 🛡️ Security - migrate pre-commit by [@jdx](https://github.com/jdx) in [#318](#318) ### 🔍 Other Changes - split CI runs into parallel jobs and add docs-sync mise task by [@jdx](https://github.com/jdx) in [#337](#337) - remove v0 pkl files from docs/public by [@jdx](https://github.com/jdx) in [#341](#341) <!-- CURSOR_SUMMARY --> --- > [!NOTE] > Release 1.17.0 with new migrate command, many util subcommands, refreshed docs, and dependency updates. > > - **Version**: bump `hk` to `1.17.0` (Cargo.toml, usage specs, docs). > - **CLI**: > - **New Command**: `migrate pre-commit` with flags (`--config`, `--output`, `--force`, `--hk-pkl-root`). > - **Util Subcommands**: add `check-added-large-files`, `check-byte-order-marker`, `fix-byte-order-marker`, `check-case-conflict`, `check-executables-have-shebangs`, `check-merge-conflict`, `check-symlinks`, `detect-private-key`, `end-of-file-fixer`, `mixed-line-ending`, `no-commit-to-branch`, `python-check-ast`, `python-debug-statements`, `trailing-whitespace`. > - **Docs**: > - Update `docs/cli/index.md`, regenerate `docs/cli/commands.json`, and `hk.usage.kdl`. > - Split `util` docs into per-command pages and add `migrate` docs. > - **Dependencies**: update `Cargo.lock` crate versions and set `hk` crate version to `1.17.0`. > - **Changelog**: add `CHANGELOG.md` entry for `1.17.0` with features/bug fixes. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 75b972a. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup> <!-- /CURSOR_SUMMARY --> Co-authored-by: mise-en-dev <[email protected]>
jdx
pushed a commit
that referenced
this pull request
Nov 19, 2025
As an attempt to avoid the automatic batching and `ARG_MAX_STRLEN` issues, add a `stdin` step config. The expectation is that the user will configure `stdin` be the joined files (they do the joining via Tera filters) and then `check`/`fix` use `stdin` somehow (most likely by leveraging `xargs`). This change also: - Errors if the user configures `interactive` and` stdin` - Along with the necessary plumbing of `-> Result<()>` through `.init(...)` calls - Adds new Tera variable `files_list` Open Qs: - Where should this be suggested/documented? - Should any of the builtins leverage this? (At least the `ruff` version which allows `@/dev/stdin`) - ...why does `interactive` hook up stdin? Is there a real-world use-case in mind? - Because if it didn't we could still use `interactive` 😉 and wouldn't need to error - `xargs` would actually avoid the strlen issue. Was it considered (either implicit or explicit) as an alternative to #338? (This wouldn't fix `ARG_MAX_STRLEN` issues, but would fix `ARG_MAX` issues that wouldn't already hit `ARG_MAX_STRLEN`)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Problem
When {{files}} template variable expands to a very large string (e.g., thousands of files with long paths), it can exceed the system's ARG_MAX limit, causing "Argument list too long" errors. Users had no way to work around this without manually batching.
Solution
Implemented transparent automatic batching that:
Changes
estimate_files_string_size()- calculates expected {{files}} string sizeauto_batch_jobs_if_needed()- automatically batches large jobsbuild_step_jobs()pipelineTest plan
Notes
🤖 Generated with Claude Code
Note
Automatically batches steps when
{{files}}would exceed safe ARG_MAX, estimating size and splitting via binary search using sysconf-detected limits.src/step.rs:estimate_files_string_size()to approximate{{files}}expansion size (with quoting/spacing).auto_batch_jobs_if_needed()to split oversized jobs using binary search and a 50% ARG_MAX safety margin.build_step_jobs()after initial job creation.src/env.rs: Add memoizedARG_MAX(vialibc::sysconf(_SC_ARG_MAX)with 128KB fallback).Cargo.toml: Addlibcdependency.test/auto_batch_large_files.bats: Add tests for large lists, small lists (no batching), and fix-mode batching.Written by Cursor Bugbot for commit b940021. This will update automatically on new commits. Configure here.