Fix duplicate word typos in comments and documentation#4658
Fix duplicate word typos in comments and documentation#4658Liam-DeVoe merged 8 commits intoHypothesisWorks:masterfrom
Conversation
Combined fixes for: - database.py: "the the" -> "the" - shrinker.py: "to to" -> "to" - changelog.rst: Multiple instances of duplicate "the" and "a"
|
Let's add a check to our It looks like the CI failures are due to a formatting issue and an unexpected Z3 build failing, tagging @Liam-DeVoe re #4646. |
Add a `git grep` check for duplicate words (e.g. "the the") to the `./build.sh lint` rule, and fix a redundant-parentheses formatting issue in shrinker.py that was failing CI check-format. Co-Authored-By: Claude Opus 4.6 <[email protected]>
| sys.exit(1) | ||
|
|
||
| dupes = subprocess.run( |
There was a problem hiding this comment.
let's not skip the duplicates-check if we happen to violate this too; we can exit(1) just once at the end.
We'll also want to fix existing cases, by either deduplicating them or rephrasing as appropriate.
There was a problem hiding this comment.
Done - both points addressed in 6ef9519:
- Restructured to use a
failedflag so both the dataclass check and the duplicate-word check run before a singleexit(1)at the end. - All existing duplicate-word cases across the codebase have been fixed (deduplicated or rephrased).
One small deviation from the suggested pattern: I used ([a-z]{2,}) instead of ([a-z]+) to require at least 2-letter words. With ([a-z]+), there's a false positive in test_patching.py:37 where n n matches across an escaped newline in a string literal (",\n) followed by n=100). That case can't meaningfully be rephrased since it's test data representing actual code formatting. The {2,} minimum avoids single-letter false positives like this while still catching all real duplicate-word typos.
Restructure the duplicate-word check in lint() to set a failure flag instead of calling sys.exit(1) immediately, so all checks run before exiting. Fix all existing duplicate-word cases across the codebase by deduplicating or rephrasing as appropriate. Also require 2+ letter words in the duplicate-word pattern to avoid false positives from single-letter matches in escape sequences. Co-Authored-By: Claude Opus 4.6 <[email protected]>
|
Personally I disagree with adding a lint rule for this because there's no way to locally disable it, and the computer saying that my personal style of writing must be changed to satisfy an arbitrary rule really puts me off. AI is already good enough to catch this stuff (see: this pr), so let's just wait for such a tool to be available in CI and then run it every few months? Or wait for drive-by contributors? Idk this seems very not worth the CI rule to me. |
|
I think we could whitelist |
|
Sure, I don't have very strong feelings about this. I'd lean towards adding a grep -v or a negative-lookbehind to exclude allowed-duplicate words, but if you want to take out the check entirely that's fine with me. @Liam-DeVoe tag me if you need an approve on your push to merge or something. |
|
Thanks for the discussion @Zac-HD @Liam-DeVoe! Given the concerns about false positives with legitimate duplicates ('that that', etc.) and no way to locally disable a grep-based check, I think the simplest approach is to just keep the typo fixes in this PR without adding an automated lint rule. Happy to add a check with exclusions if you'd prefer, but dropping it seems like the path of least resistance here. |
|
Based on the discussion above, it sounds like the consensus is to keep just the typo fixes and drop the automated lint rule. I'll push an update to remove the duplicate-word check from |
Per discussion with Zac-HD and Liam-DeVoe, the automated duplicate-word grep check has too many false positives for legitimate constructs like "that that" and no way to locally disable it. Keeping just the typo fixes, dropping the lint rule. Co-Authored-By: Claude Opus 4.6 <[email protected]>
Liam-DeVoe
left a comment
There was a problem hiding this comment.
Let's keep the CI check with just the following unambiguously-incorrect double words whitelisted: the as a to because. That gets ~70% of this PR, and we'll just accept the rest as casualties.
I was tempted to use a blacklist, but these two diffs make me think a generic blacklist is not effective enough:
…ste changes Per Liam-DeVoe and Zac-HD's feedback: - Revert the "that that" -> "the" change in api-style.rst (intentional construct) - Re-add duplicate-word lint check, but only for unambiguously incorrect duplicates: the, as, a, to, because - Use deferred exit so all lint issues are reported before failing
This PR fixes several duplicate word typos found in comments and documentation:
Changes:
database.py: Fixed "the the" → "the" in GitHubArtifactDatabase commentshrinker.py: Fixed "to to" → "to" in shrinker commentchangelog.rst: Fixed multiple instances of duplicate "the" and "a" in historical changelog entriesThis combines fixes from PRs #4655, #4656, and #4657 as requested by the maintainer.