Conversation
|
I should have marked this as a draft as I didn't intend to merge it just yet. I've restored the branch so I can commit some more to it, but there's no need to revert the merge. |
Or recognizing both omw-en and omw-en30 as WordNet 3.0? |
That may be easier in the consumer application (e.g., Wn), where I can probably just create a second index entry pointing to the same file. The |
|
It is better when the name tells you what's in the data. A name like omw-en has poor information value, compared to omw-pwn30 or omw-oewn31. Concerning the use of aliases, the situation is indeed different for a data distribution like OMW-data, compared to a downstream library like Wn. For ex., a recent nltk PR #3378 makes it easy for users to install any wordnet, and call them what they want. A versatile approach like that can be nice in an application, but less so in a data distribution. |
I'm not too concerned about the "information value" of the identifier. The XML has a <LexicalResource xmlns:dc="https://globalwordnet.github.io/schemas/dc/">
<Lexicon id="omw-en"
label="OMW English Wordnet based on WordNet-3.0"
language="en"
email="[email protected]"
license="https://wordnet.princeton.edu/license-and-commercial-use"
version="2.0"
url="https://github.com/omwn/omw-data"
citation="Christiane Fellbaum (1998, ed.) *WordNet: An Electronic Lexical Database*. MIT Press.">Also we had to change the lexicon from mentioning PWN or the Princeton WordNet because they only refer to the original WNDB files (which the NLTK reads) and not the WN-LMF derivatives (which Wn reads). I'm more concerned with confusion when using the identifiers with the OMW version, e.g.: >>> import wn
>>> en = wn.Wordnet("omw-en:2.0")The >>> en20 = wn.Wordnet("omw-en20:2.0") |
|
You just showed that omw-en20 is a more informative id than omw-en. In my opinion, the same applies to omw-en30. That's information value, preventing confusion. |
|
... My point is not that the identifier has no information value, it's that there are other attributes with more and clearer information about the source data. To be clear, we've never released the lexicon derived from WordNet 3.0 as >>> from nltk.corpus import wordnet30
Traceback (most recent call last):
File "<python-input-0>", line 1, in <module>
from nltk.corpus import wordnet30
ImportError: cannot import name 'wordnet30' from 'nltk.corpus' (...). Did you mean: 'wordnet'?but does allow this: >>> from nltk.corpus import wordnet
>>> wordnet.get_version()
'3.0' |
|
You're completely right @goodmami, nltk has the same issue, and well-established habits are unlikely to change. |
That's fair. I don't think we'll make this change in the data for 2.0, but if you think it's something we should consider for a future release, please raise a new issue so we can track it. The comments here will become harder to find when we're done with the PR. |
|
Thanks, there is no need for a new issue. As you wrote earlier "I no longer think the benefit of clarity and consistency outweighs the disruption caused by the change". |
Pushing more commits to the merged branch may cause some confusion, so I pushed a new branch release-2.0-pt2 and created a draft pull request #62. |
This branch is for final changes before producing the release.
omw-eninstead ofomw-en30for WordNet 3.0. I no longer think the benefit of clarity and consistency outweighs the disruption caused by the change.