Conversation
|
The two sequoia synsets are not duplicates: one is for the "tree" and the other denotes the "wood" (i.e. material to make furniture). The issue was discussed in OEWN some years ago. |
|
Hi,
we are not saying that 11640645-n and 11640898-n are duplicates, but that
within each one 'sequoia' and 'séquoia' are duplicates (or at least
variants of the same word).
…On Thu, 31 Oct 2024 at 10:32, Eric Kafe ***@***.***> wrote:
The two *sequoia* synsets are not duplicates: one is for the "tree" and
the other denotes the "wood" (i.e. material to make furniture). The issue
<globalwordnet/english-wordnet#78> was
discussed in OEWN some years ago.
—
Reply to this email directly, view it on GitHub
<#33 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIPZRXVOJRMINXEVF7YQJLZ6H2L7AVCNFSM6AAAAABQYCNM72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDINBZGQYTOMZVHE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
|
Thanks @fcbond,I should have made it clearer that I was responding to this opening comment above, where @goodmami expresses his sentiment that these two synsets might be duplicates, which they aren't:
|
|
Apologies, @ekaf, my wording was imprecise and my example ill-chosen. @fcbond is correct that the script looks for near-duplicate lemmas within a synset and not for multiple synsets that are duplicates of each other (I've updated the issue text above to hopefully make this more clear). The idea is the TSV files have some lemmas that are only trivially different within the same synset and that we'd rather not keep all of them. This example might be more illustrative: If we look at all the lemmas for that synset, we see two others that are more interestingly different: $ grep 15277118-n wns/fra/wn-data-fra.tab
15277118-n fra:lemma taux de mortalité
15277118-n fra:lemma mortalite
15277118-n fra:lemma morbidité
15277118-n fra:lemma mortalitéMy guess is that the mortalite without diacritics is redundant and can be removed from the TSV file. |
|
@goodmami, rather than just redundant, "the mortalite without diacritics" is incorrect. But, as you wrote earlier, only native lexicographers can make such corrections:
Maybe a spell checker could detect some incorrect forms, but any orthographic editing would need to be approached with great caution. |
|
I created #48 only for the scripts so we can move forward with that. The modifications to the data should happen in other PRs. I think this PR should be closed without merging since the commits adding the scripts would cause conflicts. The manual modifications to the Icelandic wordnet could be cherry-picked for a new PR. Alternatively, we could repurpose this PR with a force-push that rebases without those commits. Let me know if you want help with either of those options. |
|
@fcbond I want to close this PR but I don't want to lose the manual fixes you made to Perhaps you can just re-run your $ python3 -m scripts.clean-tsv --in-place wns/isl/wn-data-isl.tab |
|
Ok great, thanks! I'll close this PR. But why did you remove the |
|
Because after I fixed the wn-data-isl.tab, there were no more changes, ...
…On Thu, 22 May 2025 at 22:05, Michael Wayne Goodman < ***@***.***> wrote:
*goodmami* left a comment (omwn/omw-data#33)
<#33 (comment)>
Ok great, thanks! I'll close this PR.
But why did you remove the isl-changes.tab file?
—
Reply to this email directly, view it on GitHub
<#33 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIPZRVN7U7F6NMIZKH4KG327YUZNAVCNFSM6AAAAABQYCNM72VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDSMBSGQZDANRZGI>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
|
Ok, so perhaps you saw it as a record of changes made since the data was imported into the TSV file, and since you've re-imported it with more cleaning steps, those changes are no longer needed? I was using it more as a record of changes since the last release of the OMW data, in which case those changes still exist. That's fine. I don't think the logs were the best solution anyway. We can probably recreate them using a script that compares versions in git history (which makes storing the logs in git unnecessary). |
This pull request adds a script for detecting potential duplicate lemmas in OMW
.tabfiles. We can also use this PR to fix the duplicate issues.There are 3 kinds of duplicates detected:
For all of the above, duplicates are only detected by normalizing the lemma forms within a single synset. There may be duplicate synsets, but the script does not test for these. But here's an example of two synsets that may
be duplicateshave redundant lemmas:You can run the script as follows:
It takes a variable number of paths, so you can check one at a time or many at once. The
--verboseoption will print a warning for every duplicate it finds (best when only checking a single.tabfile).