Skip to content

Release 2.0#61

Merged
fcbond merged 1 commit intomainfrom
release-2.0
May 26, 2025
Merged

Release 2.0#61
fcbond merged 1 commit intomainfrom
release-2.0

Conversation

@goodmami
Copy link
Copy Markdown
Collaborator

This branch is for final changes before producing the release.

  • Go back to using omw-en instead of omw-en30 for WordNet 3.0. I no longer think the benefit of clarity and consistency outweighs the disruption caused by the change.

@fcbond fcbond merged commit d515b98 into main May 26, 2025
@fcbond fcbond deleted the release-2.0 branch May 26, 2025 15:06
@goodmami goodmami restored the release-2.0 branch May 27, 2025 00:43
@goodmami
Copy link
Copy Markdown
Collaborator Author

I should have marked this as a draft as I didn't intend to merge it just yet. I've restored the branch so I can commit some more to it, but there's no need to revert the merge.

@ekaf
Copy link
Copy Markdown
Contributor

ekaf commented May 27, 2025

Go back to using omw-en instead of omw-en30 for WordNet 3.0

Or recognizing both omw-en and omw-en30 as WordNet 3.0?

@goodmami
Copy link
Copy Markdown
Collaborator Author

Or recognizing both omw-en and omw-en30 as WordNet 3.0?

That may be easier in the consumer application (e.g., Wn), where I can probably just create a second index entry pointing to the same file. The index.toml file in this repository is used mainly to support building the WN-LMF data from .tab files, so I would need to create an alias mechanism or something. The alternative I don't prefer is to actually create two near-identical files, but the lexicon's ID would be omw-en30 instead of omw-en, as well as in all identifiers (e.g., omw-en30-woodworker-n in one and omw-en-woodworker-n in the other).

@ekaf
Copy link
Copy Markdown
Contributor

ekaf commented May 28, 2025

It is better when the name tells you what's in the data. A name like omw-en has poor information value, compared to omw-pwn30 or omw-oewn31.

Concerning the use of aliases, the situation is indeed different for a data distribution like OMW-data, compared to a downstream library like Wn. For ex., a recent nltk PR #3378 makes it easy for users to install any wordnet, and call them what they want. A versatile approach like that can be nice in an application, but less so in a data distribution.

@goodmami
Copy link
Copy Markdown
Collaborator Author

It is better when the name tells you what's in the data. A name like omw-en has poor information value, compared to omw-pwn30 or omw-oewn31.

I'm not too concerned about the "information value" of the identifier. The XML has a label attribute that is more descriptive:

<LexicalResource xmlns:dc="https://globalwordnet.github.io/schemas/dc/">
  <Lexicon id="omw-en"
           label="OMW English Wordnet based on WordNet-3.0"
           language="en"
           email="[email protected]"
           license="https://wordnet.princeton.edu/license-and-commercial-use"
           version="2.0"
           url="https://github.com/omwn/omw-data"
           citation="Christiane Fellbaum (1998, ed.) *WordNet: An Electronic Lexical Database*. MIT Press.">

Also we had to change the lexicon from mentioning PWN or the Princeton WordNet because they only refer to the original WNDB files (which the NLTK reads) and not the WN-LMF derivatives (which Wn reads).

I'm more concerned with confusion when using the identifiers with the OMW version, e.g.:

>>> import wn
>>> en = wn.Wordnet("omw-en:2.0")

The 2.0 is above is the version of the OMW, but it loads the data derived from WordNet 3.0. To get data from WordNet 2.0, you'd do:

>>> en20 = wn.Wordnet("omw-en20:2.0")

@ekaf
Copy link
Copy Markdown
Contributor

ekaf commented May 29, 2025

You just showed that omw-en20 is a more informative id than omw-en. In my opinion, the same applies to omw-en30. That's information value, preventing confusion.

@goodmami
Copy link
Copy Markdown
Collaborator Author

... My point is not that the identifier has no information value, it's that there are other attributes with more and clearer information about the source data.

To be clear, we've never released the lexicon derived from WordNet 3.0 as omw-en30. I changed it in the repository here and then changed it back before making a release because, even though it is more explicit (helping resolve confusion about omw-en:2.0 as described above), I decided it would be more confusing to change years of precedent. Probably for the same reason, the NLTK doesn't allow this:

>>> from nltk.corpus import wordnet30
Traceback (most recent call last):
  File "<python-input-0>", line 1, in <module>
    from nltk.corpus import wordnet30
ImportError: cannot import name 'wordnet30' from 'nltk.corpus' (...). Did you mean: 'wordnet'?

but does allow this:

>>> from nltk.corpus import wordnet
>>> wordnet.get_version()
'3.0'

@ekaf
Copy link
Copy Markdown
Contributor

ekaf commented May 30, 2025

You're completely right @goodmami, nltk has the same issue, and well-established habits are unlikely to change.
My point is that while it was clearer in the past what "the English Wordnet" meant, there is now a greater need of future-proofing by using more explicit identifiers.

@goodmami
Copy link
Copy Markdown
Collaborator Author

My point is that while it was clearer in the past what "the English Wordnet" meant, there is now a greater need of future-proofing by using more explicit identifiers.

That's fair. I don't think we'll make this change in the data for 2.0, but if you think it's something we should consider for a future release, please raise a new issue so we can track it. The comments here will become harder to find when we're done with the PR.

@ekaf
Copy link
Copy Markdown
Contributor

ekaf commented May 30, 2025

Thanks, there is no need for a new issue. As you wrote earlier "I no longer think the benefit of clarity and consistency outweighs the disruption caused by the change".

@goodmami goodmami deleted the release-2.0 branch June 2, 2025 19:32
@goodmami
Copy link
Copy Markdown
Collaborator Author

goodmami commented Jun 2, 2025

I should have marked this as a draft as I didn't intend to merge it just yet. I've restored the branch so I can commit some more to it, but there's no need to revert the merge.

Pushing more commits to the merged branch may cause some confusion, so I pushed a new branch release-2.0-pt2 and created a draft pull request #62.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants