Skip to content

Remove redundant lemmas found with clean.sh#50

Merged
fcbond merged 1 commit intomainfrom
remove-duplicates
Feb 28, 2025
Merged

Remove redundant lemmas found with clean.sh#50
fcbond merged 1 commit intomainfrom
remove-duplicates

Conversation

@goodmami
Copy link
Copy Markdown
Collaborator

@goodmami goodmami commented Feb 2, 2025

Also add files that log the changes made.

This removes redundant lemmas with the following procedure:

  1. Normalize the lemma
    a. Strip quote pairs at the beginning and end of the lemma
    b. Replace each underscore in a lemma with a single space
    c. Strip any spaces at the beginning and end of the lemma
    d. Store the normalized form if it differs from the original
  2. Remove Redundancies: If multiple lemmas with the same synset identifier (offset + pos) and lemma type (e.g., fra:lemma) are the same string after normalization, only one is kept.
  3. Log Changes: Write a log indicating which lemmas have been REMOVED or MODIFIED. Wordnets without a log file did not have any changes.

Also note:

Also add files that log the changes made.
@arademaker
Copy link
Copy Markdown

Hi all,

I just want to understand the nature of the changes being proposed. As far as I understand this repo consolidates data from different sources. Does it make sense to change the data here or report to each wordnet maintainer? Changing here requires some warning about it.

@goodmami
Copy link
Copy Markdown
Collaborator Author

goodmami commented Feb 2, 2025

As far as I understand this repo consolidates data from different sources.

Yes, although as I understand, OMW is the de facto or de jure maintainer of many of them. This could be because the upstream projects are no longer active (e.g., the URLs we have for MCR, the Finnish, and Icelandic wordnets no longer work; others just point back to https://github.com/omwn/omw-data) or Francis was a PI for their development (WordNet Bahasa, Chinese Open Wordnet, Japanese Wordnet, ...). I'm not really sure which ones are still tied to an upstream project, but @fcbond might.

Does it make sense to change the data here or report to each wordnet maintainer?

I think it would be polite to notify upstream maintainers if they are contactable. I don't think it's required because they all have open licenses, and OMW is essentially distributing a fork of those projects.

Changing here requires some warning about it.

Agreed, and the notice of these changes should be summarized in the release notes and point to the lists of changes that are part of this PR.

But also note that this PR only removes very obvious redundant lemmas and does simple normalization of lemmas. The resulting lexicons should be essentially the same. E.g.:

exact duplicates:

 01448100-v	als:lemma	tërheqje
-01448100-v	als:lemma	tërheqje

duplicates after basic normalization:

 04467899-n	fra:lemma	bord de fuite
-04467899-n	fra:lemma	bord_de_fuite
...
 04539203-n	fra:lemma	terrarium
-04539203-n	fra:lemma	« terrarium »

normalized lemmas:

-09283193-n fra:lemma « fomites »
+09283193-n fra:lemma fomites

@goodmami goodmami requested a review from fcbond February 11, 2025 19:17
@goodmami
Copy link
Copy Markdown
Collaborator Author

@fcbond at least for this PR I'd like your approval before merging.

@goodmami
Copy link
Copy Markdown
Collaborator Author

@fcbond ping, in case this got buried

You were active in some other threads so maybe you have a spare moment to look at this?

@fcbond fcbond merged commit e0b1e21 into main Feb 28, 2025
@goodmami goodmami deleted the remove-duplicates branch March 2, 2025 20:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants