Conversation
Also add files that log the changes made.
|
Hi all, I just want to understand the nature of the changes being proposed. As far as I understand this repo consolidates data from different sources. Does it make sense to change the data here or report to each wordnet maintainer? Changing here requires some warning about it. |
Yes, although as I understand, OMW is the de facto or de jure maintainer of many of them. This could be because the upstream projects are no longer active (e.g., the URLs we have for MCR, the Finnish, and Icelandic wordnets no longer work; others just point back to https://github.com/omwn/omw-data) or Francis was a PI for their development (WordNet Bahasa, Chinese Open Wordnet, Japanese Wordnet, ...). I'm not really sure which ones are still tied to an upstream project, but @fcbond might.
I think it would be polite to notify upstream maintainers if they are contactable. I don't think it's required because they all have open licenses, and OMW is essentially distributing a fork of those projects.
Agreed, and the notice of these changes should be summarized in the release notes and point to the lists of changes that are part of this PR. But also note that this PR only removes very obvious redundant lemmas and does simple normalization of lemmas. The resulting lexicons should be essentially the same. E.g.: exact duplicates: 01448100-v als:lemma tërheqje
-01448100-v als:lemma tërheqjeduplicates after basic normalization: 04467899-n fra:lemma bord de fuite
-04467899-n fra:lemma bord_de_fuite
...
04539203-n fra:lemma terrarium
-04539203-n fra:lemma « terrarium »normalized lemmas: -09283193-n fra:lemma « fomites »
+09283193-n fra:lemma fomites |
|
@fcbond at least for this PR I'd like your approval before merging. |
|
@fcbond ping, in case this got buried You were active in some other threads so maybe you have a spare moment to look at this? |
Also add files that log the changes made.
This removes redundant lemmas with the following procedure:
a. Strip quote pairs at the beginning and end of the lemma
b. Replace each underscore in a lemma with a single space
c. Strip any spaces at the beginning and end of the lemma
d. Store the normalized form if it differs from the original
fra:lemma) are the same string after normalization, only one is kept.REMOVEDorMODIFIED. Wordnets without a log file did not have any changes.Also note: