-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
csplit: rewrite with one function per pattern type #7806
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
GNU testsuite comparison: |
1 similar comment
|
GNU testsuite comparison: |
|
It is still in draft mode :) |
89302d5 to
257f26f
Compare
|
GNU testsuite comparison: |
7f0ef59 to
77d8a07
Compare
|
GNU testsuite comparison: |
This pull request fixes several bugs in `csplit` by rewriting the code
to have one function per pattern type. The functions are
* `up_to_line_loop` for patterns like `123 {45}` and `123 {*}`,
* `up_to_match_pos_offset_repeat` for `/regex/+123 {45}`,
* `up_to_match_pos_offset_always` for `/regex/+123 {*}`,
* `up_to_match_neg_offset_repeat` for `/regex/-123 {45}`,
* `up_to_match_neg_offset_always` for `/regex/-123 {*}`,
* `skip_to_match_pos_offset_repeat` for `%regex%+123 {45}`,
* `skip_to_match_pos_offset_always` for `%regex%+123 {*}`,
* `skip_to_match_neg_offset_repeat` for `%regex%-123 {45}`,
* `skip_to_match_neg_offset_always` for `%regex%-123 {*}`,
The command-line options, global state, and per-chunk state live in the
`Splitter` struct (which replaces the `SplitWriter` and `InputSplitter`
structs). There are a ton of ugly corner cases where different types of
patterns interact; most of these should be called out with comments.
I added some tests that were already passing, but were missing test
coverage, including
* same regex pattern twice in a row, as in `/foo/ /foo/`
* line number out of range on empty input, as in `: | csplit - 1`,
* skip to match with negative offset repeating indefinitely, as in
`csplit %foo%-5 {*}`.
Some bugs fixed include:
* missing a regex match on the final line of a line number pattern, as
in `seq 20 | csplit - 10 /10/`,
* missing output `0\n` when input is a directory,
* uutils#7286, a missing empty file with `--suppress-matched`.
Some bugs remain:
* The GNU test file `tests/csplit/csplit-suppress-matched` remains
failing. I'm hoping that the change in this pull request will make it
easier to reason about csplit and fix the remaining failures.
* Probably related to the previous item, there is some weird behavior of
GNU csplit that I could not understand. This was already partially
covered by a test named `repeat_everything`. I've extracted just the
relevant part of that into its own test named
`test_repeat_line_after_match_suppress_matched` that has TODO comments
for the behavior that differs from GNU `csplit`.
Fixes uutils#7286
77d8a07 to
6e9d4f8
Compare
|
GNU testsuite comparison: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors csplit by rewriting pattern handling into dedicated functions per pattern type and fixes several bugs. Key changes include:
- Refactored pattern matching logic to use one function per pattern type.
- Expanded test coverage to validate various edge cases and updated expected outputs.
- Consolidated global and per-chunk state in the Splitter struct.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| tests/by-util/test_csplit.rs | New and updated tests to cover edge cases and validate new behavior. |
| src/uu/csplit/src/patterns.rs | Simplified pattern execution by replacing an iterator with a function. |
Comments suppressed due to low confidence (3)
tests/by-util/test_csplit.rs:1558
- [nitpick] The Windows section in test_directory_input_file contains a commented-out stdout expectation. Please clarify the intended behavior on Windows to ensure the test covers the proper output.
// .stdout_is("0\n")
src/uu/csplit/src/patterns.rs:51
- [nitpick] Consider renaming the parameter 'i' to a more descriptive name, such as 'iteration_count', for improved readability.
pub(crate) fn should_continue(&self, i: usize) -> bool {
tests/by-util/test_csplit.rs:1388
- The change in test_line_num_range_with_up_to_match3 from failing to succeeding, with output '18\n0\n123\n', differs from previous behavior. Confirm that this behavior change is intentional and well documented.
.succeeds()
|
Sorry, I was hoping to find a way to make this easier to review, but I couldn't. I will come back to this and rebase it |
…ixes uutils#7286) Problem - With , uutils csplit failed to create the final empty output file when the last split point consumed the trailing input. - GNU csplit always creates a final segment after processing patterns; with , that empty final segment is elided. - This broke the GNU test (tests/csplit/csplit-suppress-matched.pl) and the scenario: Root cause - Final-file creation was conditional on there being remaining input after pattern processing. If none remained, no final file was created, contrary to GNU semantics. Fix (minimal, targeted) - In , after : - If there is remaining input, always create a final split and copy the remainder, then finish. - Else, if all patterns were integer-based and is set, create a final (possibly empty) split and finish; elides it when is set. Tests (Rust integration) - Added two Rust tests under to lock down GNU-compatible behavior: - (expects sizes 2,2,2,0 and a final empty ) - (final empty file is correctly elided) Verification - All tests pass locally. The originally reported case now matches GNU. Relation to PR uutils#7806 - uutils#7806 proposes a broader refactor to fix multiple issues, but remains a draft and notes remaining GNU suppress-matched differences. - This PR provides a small, reviewable fix specifically for uutils#7286, plus precise integration tests to safeguard behavior. Fixes uutils#7286
…ixes uutils#7286) Problem - With , uutils csplit failed to create the final empty output file when the last split point consumed the trailing input. - GNU csplit always creates a final segment after processing patterns; with , that empty final segment is elided. - This broke the GNU test (tests/csplit/csplit-suppress-matched.pl) and the scenario: Root cause - Final-file creation was conditional on there being remaining input after pattern processing. If none remained, no final file was created, contrary to GNU semantics. Fix (minimal, targeted) - In , after : - If there is remaining input, always create a final split and copy the remainder, then finish. - Else, if all patterns were integer-based and is set, create a final (possibly empty) split and finish; elides it when is set. Tests (Rust integration) - Added two Rust tests under to lock down GNU-compatible behavior: - (expects sizes 2,2,2,0 and a final empty ) - (final empty file is correctly elided) Verification - All tests pass locally. The originally reported case now matches GNU. Relation to PR uutils#7806 - uutils#7806 proposes a broader refactor to fix multiple issues, but remains a draft and notes remaining GNU suppress-matched differences. - This PR provides a small, reviewable fix specifically for uutils#7286, plus precise integration tests to safeguard behavior. Fixes uutils#7286
|
it has been draft for a long time, please reopen when ready |
This pull request fixes several bugs in csplit by rewriting the code to have one function per pattern type. The functions are
up_to_line_repeatfor patterns like123 {45}up_to_line_alwaysfor123 {*}up_to_match_pos_offset_repeatfor/regex/+123 {45}up_to_match_pos_offset_alwaysfor/regex/+123 {*}up_to_match_neg_offset_repeatfor/regex/-123 {45}up_to_match_neg_offset_alwaysfor/regex/-123 {*}skip_to_match_pos_offset_repeatfor%regex%+123 {45}skip_to_match_pos_offset_alwaysfor%regex%+123 {*}skip_to_match_neg_offset_repeatfor%regex%-123 {45}skip_to_match_neg_offset_alwaysfor%regex%-123 {*}The command-line options, global state, and per-chunk state live in the
Splitterstruct (which replaces theSplitWriterandInputSplitterstructs). There are a ton of ugly corner cases where different types of patterns interact. I've tried to call those out with comments.I added some tests that were already passing, but were missing test coverage, including
/foo/ /foo/: | csplit - 1,csplit %foo%-5 {*}.Some bugs fixed include:
seq 20 | csplit - 10 /10/,0\nwhen input is a directory,--suppress-matched.Some bugs remain:
tests/csplit/csplit-suppress-matchedremains failing. I'm hoping that the change in this pull request will make it easier to reason about csplit and fix the remaining failures.repeat_everything. I've extracted just the relevant part of that into its own test namedtest_repeat_line_after_match_suppress_matchedthat has TODO comments for the behavior that differs from GNU csplit.Fixes #7286