Skip to content

Conversation

@jfinkels
Copy link
Collaborator

@jfinkels jfinkels commented Apr 20, 2025

This pull request fixes several bugs in csplit by rewriting the code to have one function per pattern type. The functions are

  • up_to_line_repeat for patterns like 123 {45}
  • up_to_line_always for 123 {*}
  • up_to_match_pos_offset_repeat for /regex/+123 {45}
  • up_to_match_pos_offset_always for /regex/+123 {*}
  • up_to_match_neg_offset_repeat for /regex/-123 {45}
  • up_to_match_neg_offset_always for /regex/-123 {*}
  • skip_to_match_pos_offset_repeat for %regex%+123 {45}
  • skip_to_match_pos_offset_always for %regex%+123 {*}
  • skip_to_match_neg_offset_repeat for %regex%-123 {45}
  • skip_to_match_neg_offset_always for %regex%-123 {*}

The command-line options, global state, and per-chunk state live in the Splitter struct (which replaces the SplitWriter and InputSplitter structs). There are a ton of ugly corner cases where different types of patterns interact. I've tried to call those out with comments.

I added some tests that were already passing, but were missing test coverage, including

  • same regex pattern twice in a row, as in /foo/ /foo/
  • line number out of range on empty input, as in : | csplit - 1,
  • skip to match with negative offset repeating indefinitely, as in csplit %foo%-5 {*}.

Some bugs fixed include:

Some bugs remain:

  • The GNU test file tests/csplit/csplit-suppress-matched remains failing. I'm hoping that the change in this pull request will make it easier to reason about csplit and fix the remaining failures.
  • Probably related to the previous item, there is some weird behavior of GNU csplit that I could not understand. This was already partially covered by a test named repeat_everything. I've extracted just the relevant part of that into its own test named test_repeat_line_after_match_suppress_matched that has TODO comments for the behavior that differs from GNU csplit.

Fixes #7286

@github-actions
Copy link

GNU testsuite comparison:

GNU test failed: tests/misc/tee. tests/misc/tee is passing on 'main'. Maybe you have to rebase?
Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

1 similar comment
@github-actions
Copy link

GNU testsuite comparison:

GNU test failed: tests/misc/tee. tests/misc/tee is passing on 'main'. Maybe you have to rebase?
Skip an intermittent issue tests/timeout/timeout (fails in this run but passes in the 'main' branch)

@sylvestre
Copy link
Contributor

It is still in draft mode :)

@jfinkels jfinkels force-pushed the csplit-simplify-2 branch from 89302d5 to 257f26f Compare April 27, 2025 02:39
@github-actions
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/misc/tee (passes in this run but fails in the 'main' branch)
Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)

@jfinkels jfinkels force-pushed the csplit-simplify-2 branch from 7f0ef59 to 77d8a07 Compare April 27, 2025 18:04
@github-actions
Copy link

GNU testsuite comparison:

Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)

This pull request fixes several bugs in `csplit` by rewriting the code
to have one function per pattern type. The functions are

* `up_to_line_loop` for patterns like `123 {45}` and `123 {*}`,
* `up_to_match_pos_offset_repeat` for `/regex/+123 {45}`,
* `up_to_match_pos_offset_always` for `/regex/+123 {*}`,
* `up_to_match_neg_offset_repeat` for `/regex/-123 {45}`,
* `up_to_match_neg_offset_always` for `/regex/-123 {*}`,
* `skip_to_match_pos_offset_repeat` for `%regex%+123 {45}`,
* `skip_to_match_pos_offset_always` for `%regex%+123 {*}`,
* `skip_to_match_neg_offset_repeat` for `%regex%-123 {45}`,
* `skip_to_match_neg_offset_always` for `%regex%-123 {*}`,

The command-line options, global state, and per-chunk state live in the
`Splitter` struct (which replaces the `SplitWriter` and `InputSplitter`
structs). There are a ton of ugly corner cases where different types of
patterns interact; most of these should be called out with comments.

I added some tests that were already passing, but were missing test
coverage, including

* same regex pattern twice in a row, as in `/foo/ /foo/`
* line number out of range on empty input, as in `: | csplit - 1`,
* skip to match with negative offset repeating indefinitely, as in
  `csplit %foo%-5 {*}`.

Some bugs fixed include:

* missing a regex match on the final line of a line number pattern, as
  in `seq 20 | csplit - 10 /10/`,
* missing output `0\n` when input is a directory,
* uutils#7286, a missing empty file with `--suppress-matched`.

Some bugs remain:

* The GNU test file `tests/csplit/csplit-suppress-matched` remains
  failing. I'm hoping that the change in this pull request will make it
  easier to reason about csplit and fix the remaining failures.
* Probably related to the previous item, there is some weird behavior of
  GNU csplit that I could not understand. This was already partially
  covered by a test named `repeat_everything`. I've extracted just the
  relevant part of that into its own test named
  `test_repeat_line_after_match_suppress_matched` that has TODO comments
  for the behavior that differs from GNU `csplit`.

Fixes uutils#7286
@jfinkels jfinkels force-pushed the csplit-simplify-2 branch from 77d8a07 to 6e9d4f8 Compare April 27, 2025 20:33
@github-actions
Copy link

GNU testsuite comparison:

Skip an intermittent issue tests/misc/tee (fails in this run but passes in the 'main' branch)
Skipping an intermittent issue tests/timeout/timeout (passes in this run but fails in the 'main' branch)

@sylvestre sylvestre requested a review from Copilot April 27, 2025 21:19
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors csplit by rewriting pattern handling into dedicated functions per pattern type and fixes several bugs. Key changes include:

  • Refactored pattern matching logic to use one function per pattern type.
  • Expanded test coverage to validate various edge cases and updated expected outputs.
  • Consolidated global and per-chunk state in the Splitter struct.

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
tests/by-util/test_csplit.rs New and updated tests to cover edge cases and validate new behavior.
src/uu/csplit/src/patterns.rs Simplified pattern execution by replacing an iterator with a function.
Comments suppressed due to low confidence (3)

tests/by-util/test_csplit.rs:1558

  • [nitpick] The Windows section in test_directory_input_file contains a commented-out stdout expectation. Please clarify the intended behavior on Windows to ensure the test covers the proper output.
        // .stdout_is("0\n")

src/uu/csplit/src/patterns.rs:51

  • [nitpick] Consider renaming the parameter 'i' to a more descriptive name, such as 'iteration_count', for improved readability.
pub(crate) fn should_continue(&self, i: usize) -> bool {

tests/by-util/test_csplit.rs:1388

  • The change in test_line_num_range_with_up_to_match3 from failing to succeeding, with output '18\n0\n123\n', differs from previous behavior. Confirm that this behavior change is intentional and well documented.
        .succeeds()

@jfinkels
Copy link
Collaborator Author

Sorry, I was hoping to find a way to make this easier to review, but I couldn't. I will come back to this and rebase it

naoNao89 added a commit to naoNao89/coreutils that referenced this pull request Sep 17, 2025
…ixes uutils#7286)

Problem
- With , uutils csplit failed to create the final empty output file when the last split point consumed the trailing input.
- GNU csplit always creates a final segment after processing patterns; with , that empty final segment is elided.
- This broke the GNU test (tests/csplit/csplit-suppress-matched.pl) and the scenario:

Root cause
- Final-file creation was conditional on there being remaining input after pattern processing. If none remained, no final file was created, contrary to GNU semantics.

Fix (minimal, targeted)
- In , after :
  - If there is remaining input, always create a final split and copy the remainder, then finish.
  - Else, if all patterns were integer-based and  is set, create a final (possibly empty) split and finish;  elides it when  is set.

Tests (Rust integration)
- Added two Rust tests under  to lock down GNU-compatible behavior:
  -  (expects sizes 2,2,2,0 and a final empty )
  -  (final empty file is correctly elided)

Verification
- All  tests pass locally. The originally reported case now matches GNU.

Relation to PR uutils#7806
- uutils#7806 proposes a broader refactor to fix multiple issues, but remains a draft and notes remaining GNU suppress-matched differences.
- This PR provides a small, reviewable fix specifically for uutils#7286, plus precise integration tests to safeguard behavior.

Fixes uutils#7286
naoNao89 added a commit to naoNao89/coreutils that referenced this pull request Sep 17, 2025
…ixes uutils#7286)

Problem
- With , uutils csplit failed to create the final empty output file when the last split point consumed the trailing input.
- GNU csplit always creates a final segment after processing patterns; with , that empty final segment is elided.
- This broke the GNU test (tests/csplit/csplit-suppress-matched.pl) and the scenario:

Root cause
- Final-file creation was conditional on there being remaining input after pattern processing. If none remained, no final file was created, contrary to GNU semantics.

Fix (minimal, targeted)
- In , after :
  - If there is remaining input, always create a final split and copy the remainder, then finish.
  - Else, if all patterns were integer-based and  is set, create a final (possibly empty) split and finish;  elides it when  is set.

Tests (Rust integration)
- Added two Rust tests under  to lock down GNU-compatible behavior:
  -  (expects sizes 2,2,2,0 and a final empty )
  -  (final empty file is correctly elided)

Verification
- All  tests pass locally. The originally reported case now matches GNU.

Relation to PR uutils#7806
- uutils#7806 proposes a broader refactor to fix multiple issues, but remains a draft and notes remaining GNU suppress-matched differences.
- This PR provides a small, reviewable fix specifically for uutils#7286, plus precise integration tests to safeguard behavior.

Fixes uutils#7286
@sylvestre
Copy link
Contributor

it has been draft for a long time, please reopen when ready

@sylvestre sylvestre closed this Oct 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

csplit: --suppress-matched incorrectly elides last empty file

2 participants