Add tsv-duplicates.py script by goodmami · Pull Request #33 · omwn/omw-data

goodmami · 2023-04-06T06:20:59Z

This pull request adds a script for detecting potential duplicate lemmas in OMW .tab files. We can also use this PR to fix the duplicate issues.

There are 3 kinds of duplicates detected:

underscores, e.g., canard colvert and canard_colvert
case differences, e.g., Renaissance and renaissance
diacritics, e.g., nanoséconde and nanoseconde

For all of the above, duplicates are only detected by normalizing the lemma forms within a single synset. There may be duplicate synsets, but the script does not test for these. But here's an example of two synsets that may ~~be duplicates~~ have redundant lemmas:

WARNING:tsv-duplicates:duplicate of 11640645-n: 'sequoia', 'séquoia'
WARNING:tsv-duplicates:duplicate of 11640898-n: 'sequoia', 'séquoia'

You can run the script as follows:

$ python scripts/tsv-duplicates.py --ignore-case --underscore --diacritics wns/{arb,fra,msa}/*.tab
wn-data-arb.tab duplicates	1502 synsets	3013 lemmas
wn-nodia-arb.tab duplicates	176 synsets	355 lemmas
wn-data-fra.tab duplicates	3821 synsets	7702 lemmas
wn-data-ind.tab duplicates	466 synsets	934 lemmas
wn-data-zsm.tab duplicates	366 synsets	732 lemmas
total duplicates	6331 synsets	12736 lemmas

It takes a variable number of paths, so you can check one at a time or many at once. The --verbose option will print a warning for every duplicate it finds (best when only checking a single .tab file).

ekaf · 2024-10-31T09:32:24Z

The two sequoia synsets are not duplicates: one is for the "tree" and the other denotes the "wood" (i.e. material to make furniture). The issue was discussed in OEWN some years ago.

fcbond · 2024-10-31T09:55:21Z

Hi, we are not saying that 11640645-n and 11640898-n are duplicates, but that within each one 'sequoia' and 'séquoia' are duplicates (or at least variants of the same word).

…

On Thu, 31 Oct 2024 at 10:32, Eric Kafe ***@***.***> wrote: The two *sequoia* synsets are not duplicates: one is for the "tree" and the other denotes the "wood" (i.e. material to make furniture). The issue <globalwordnet/english-wordnet#78> was discussed in OEWN some years ago. — Reply to this email directly, view it on GitHub <#33 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIPZRXVOJRMINXEVF7YQJLZ6H2L7AVCNFSM6AAAAABQYCNM72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINBZGQYTOMZVHE> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

-- Francis Bond <https://fcbond.github.io/>

ekaf · 2024-10-31T12:06:28Z

Thanks @fcbond,I should have made it clearer that I was responding to this opening comment above, where @goodmami expresses his sentiment that these two synsets might be duplicates, which they aren't:

There may be duplicate synsets, but the script does not test for these. But here's an example of two synsets that may be duplicates:

WARNING:tsv-duplicates:duplicate of 11640645-n: 'sequoia', 'séquoia'
WARNING:tsv-duplicates:duplicate of 11640898-n: 'sequoia',

goodmami · 2025-01-21T03:58:19Z

Apologies, @ekaf, my wording was imprecise and my example ill-chosen. @fcbond is correct that the script looks for near-duplicate lemmas within a synset and not for multiple synsets that are duplicates of each other (I've updated the issue text above to hopefully make this more clear). The idea is the TSV files have some lemmas that are only trivially different within the same synset and that we'd rather not keep all of them. This example might be more illustrative:

WARNING:tsv-duplicates:duplicate of 15277118-n: 'mortalite', 'mortalité'

If we look at all the lemmas for that synset, we see two others that are more interestingly different:

$ grep 15277118-n wns/fra/wn-data-fra.tab
15277118-n	fra:lemma	taux de mortalité
15277118-n	fra:lemma	mortalite
15277118-n	fra:lemma	morbidité
15277118-n	fra:lemma	mortalité

My guess is that the mortalite without diacritics is redundant and can be removed from the TSV file.

ekaf · 2025-01-22T16:49:47Z

@goodmami, rather than just redundant, "the mortalite without diacritics" is incorrect. But, as you wrote earlier, only native lexicographers can make such corrections:

I don't think we can get around having human annotators to fix the upper/lower case, diacritics, and plurals.

Maybe a spell checker could detect some incorrect forms, but any orthographic editing would need to be approached with great caution.

goodmami · 2025-01-30T01:30:52Z

I created #48 only for the scripts so we can move forward with that. The modifications to the data should happen in other PRs.

I think this PR should be closed without merging since the commits adding the scripts would cause conflicts. The manual modifications to the Icelandic wordnet could be cherry-picked for a new PR. Alternatively, we could repurpose this PR with a force-push that rebases without those commits. Let me know if you want help with either of those options.

goodmami · 2025-04-24T17:59:51Z

@fcbond I want to close this PR but I don't want to lose the manual fixes you made to isl. Unfortunately, git cherry-pick 0881ed9394a53d78ae3781f4bc01c536b1e9ea2c results in a conflict because I've already removed redundant lemmas in #50 and your changes affect the same lines. Furthermore, it looks like you sorted the tab file, so it's hard to identify which were meaningful changes and which were just movements.

Perhaps you can just re-run your isl2tab.py script on a fresh branch, followed by:

$ python3 -m scripts.clean-tsv --in-place wns/isl/wn-data-isl.tab

fcbond · 2025-05-22T19:20:38Z

Done! Sorry for the long delay. It was a very old version of nltk, I had to make a few changes to the script.

I also fixed the entry for föðurafi, móðurafi "grandmother", and stripped off excess white space, so I think there is nothing left to clean :-)

9d4b7f8..930f9ee

goodmami · 2025-05-22T20:05:20Z

Ok great, thanks! I'll close this PR.

But why did you remove the isl-changes.tab file?

fcbond · 2025-05-22T20:08:18Z

Because after I fixed the wn-data-isl.tab, there were no more changes, ...

…

On Thu, 22 May 2025 at 22:05, Michael Wayne Goodman < ***@***.***> wrote: *goodmami* left a comment (omwn/omw-data#33) <#33 (comment)> Ok great, thanks! I'll close this PR. But why did you remove the isl-changes.tab file? — Reply to this email directly, view it on GitHub <#33 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAIPZRVN7U7F6NMIZKH4KG327YUZNAVCNFSM6AAAAABQYCNM72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSMBSGQZDANRZGI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Francis Bond <https://fcbond.github.io/>

goodmami · 2025-05-22T20:18:44Z

Ok, so perhaps you saw it as a record of changes made since the data was imported into the TSV file, and since you've re-imported it with more cleaning steps, those changes are no longer needed? I was using it more as a record of changes since the last release of the OMW data, in which case those changes still exist.

That's fine. I don't think the logs were the best solution anyway. We can probably recreate them using a script that compares versions in git history (which makes storing the logs in git unnecessary).

Add tsv-duplicates.py script

f638a35

goodmami mentioned this pull request Apr 6, 2023

Duplicates in Tab and LMF files #32

Closed

fcbond added 2 commits October 28, 2024 19:42

hand fixed some issues

0881ed9

remove the duplicates and save the new file in the build directory

5dbfc27

goodmami mentioned this pull request Jan 30, 2025

TSV cleanup scripts #48

Merged

goodmami mentioned this pull request Apr 29, 2025

Create a new release with some improvements (2.0) #31

Closed

goodmami closed this May 22, 2025

goodmami deleted the fix-32 branch May 22, 2025 20:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add tsv-duplicates.py script#33

Add tsv-duplicates.py script#33
goodmami wants to merge 3 commits intomainfrom
fix-32

goodmami commented Apr 6, 2023 •

edited

Loading

Uh oh!

ekaf commented Oct 31, 2024

Uh oh!

fcbond commented Oct 31, 2024 via email

Uh oh!

ekaf commented Oct 31, 2024 •

edited

Loading

Uh oh!

goodmami commented Jan 21, 2025

Uh oh!

ekaf commented Jan 22, 2025

Uh oh!

goodmami commented Jan 30, 2025

Uh oh!

goodmami commented Apr 24, 2025

Uh oh!

fcbond commented May 22, 2025

Uh oh!

goodmami commented May 22, 2025 •

edited

Loading

Uh oh!

fcbond commented May 22, 2025 via email

Uh oh!

goodmami commented May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

goodmami commented Apr 6, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ekaf commented Oct 31, 2024

Uh oh!

fcbond commented Oct 31, 2024 via email

Uh oh!

ekaf commented Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

goodmami commented Jan 21, 2025

Uh oh!

ekaf commented Jan 22, 2025

Uh oh!

goodmami commented Jan 30, 2025

Uh oh!

goodmami commented Apr 24, 2025

Uh oh!

fcbond commented May 22, 2025

Uh oh!

goodmami commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fcbond commented May 22, 2025 via email

Uh oh!

goodmami commented May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

goodmami commented Apr 6, 2023 •

edited

Loading

ekaf commented Oct 31, 2024 •

edited

Loading

goodmami commented May 22, 2025 •

edited

Loading