52 allow the input file to have counts and pronunciation by fcbond · Pull Request #53 · omwn/omw-data

fcbond · 2025-03-02T20:50:17Z

No description provided.

…n't work

A previous commit removed these lines, but this commit changes the scripts/clean-tsv.py script to do the same thing and generates records for the *-changes.tab files. I verified that this reproduced the hand-committed changes by copying in earlier versions of the affected files and running the cleaning script. The result had no diffs with what's in Git and I was able to get the change records. This commit also changes the clean.sh script so it appends to the *-changes.tab files instead of overwriting them.

Also remove spurious whitespace

goodmami · 2025-03-03T22:43:56Z

@fcbond I added a couple of commits (so far).

The first commit replicates the removal of blank entries using a script; the result is the same but now we have logs and can repeat the process in the future with clean.sh.

The second commit fixes the issue you had with lexicalized="false"; the main issue is that what is initially assembled is not the final objects sent to be serialized and converting the former to the latter skipped synsets without any members and didn't set the lexicalized attribute. Now it works and we see things like this:

    <Synset id="omw-it-00029343-a" ili="i145" partOfSpeech="a" lexicalized="false">
      <Definition>a cui piace possedere</Definition>
    </Synset>

However, there are also lots of things like this:

    <Synset id="omw-it-00033206-a" ili="i161" partOfSpeech="a" lexicalized="false" members="omw-it-troppo_attivo-00033206-a" />

... which is the result of these lines in the .tab file:

00033206-a	ita:lemma	GAP!
00033206-a	ita:lemma	troppo attivo

What does it mean for something ostensibly not lexicalized to nevertheless have a lexical entry? Maybe the answer is in the MultiWordnet paper, but I could not find a freely-available version online.

fcbond · 2025-03-04T13:55:45Z

It means that the lexical entry is compositional, for example if we have an entry 'buckwheat noodles' for the synset for soba. A lexicon just for that language would not have it, but a multilingual lexicon may.

So 00033206-a overactive does not really exist in Italian, but the concept could be realized as troppo attivo "too active".

It's useful to have the entry from the point of view of a multi-lingual resource, but it is also nice to be able to note that it is not really something you would have in a monolingual lexicon.

For the MultiWordnet, that this Italian (and Hebrew) wordnets come from, they marked all entries not existing in a reference lexicon of Italian as lexicalized False (GAP or PSEUDOGAP).

fcbond · 2025-03-04T14:01:16Z

We should maybe also mark it on the sense as well as the synset? Strictly speaking we should, as otherwise it defaults to True. It is there on the sense level because we could have a concept that is lexicalized, but one of the senses isn't. For example, if we wanted to have guitar guitar "real guitar" for acoustic guitar, ... I think no one uses this yet.

goodmami · 2025-03-04T18:50:02Z

It means that the lexical entry is compositional

Ok, that makes sense. There are also a handful of synset groups with GAP! and also single-word entries:

00593071-a	ita:lemma	GAP!
00593071-a	ita:lemma	persistente
00593071-a	ita:lemma	che non dà tregua
00593071-a	ita:lemma	senza tregua

For reference:

>>> import wn
>>> en = wn.Wordnet('omw-en')
>>> en.synset('omw-en-00593071-s').definition()
'never-ceasing'
>>> en.synset('omw-en-00593071-s').lemmas()
['persistent', 'relentless', 'unrelenting']

We should maybe also mark it on the sense as well as the synset? Strictly speaking we should, as otherwise it defaults to True. It is there on the sense level because we could have a concept that is lexicalized, but one of the senses isn't.

Yes, I agree, and I think this can solve the problem above. The problem is identifying which entries are compositional and which may be fixed collocations or idioms. I just found this paper which explains the lexical gaps: http://www.lrec-conf.org/proceedings/lrec2000/pdf/236.pdf, and it seems like "simple word" is the first test (which I guess means a word without spaces), but there are many other tests which we cannot easily replicate.

So how about:

If a synset group has GAP!, then the sense for any entry with spaces gets lexicalized="false"
If all senses of the synset have lexicalized="false", then the synset gets lexicalized="false", too

I still have no idea how PSEUDOGAP! differs from GAP!. I guess we just treat it the same as GAP!.

This commit has substantial changes to the module.

Before it was trying to set sys.stderr as the logging destination when --log is unset, which created strange file.

goodmami · 2025-03-08T03:22:28Z

I added some more commits:

Use logging instead of print calls (so we don't have to pass logfile around all the time)
Did a nearly-full rewrite of the tsv2lmf.py file. The result is hopefully easier to follow and modify. There is still a 2-pass process where we load the TSV data into intermediate structures, then process those structures to create the LMF, but the current version is closer to the LMF, so the build() step is pretty simple.
Made validation (checking headers, redundant senses, empty synsets, etc.) a separate function.
Senses on LexicalEntry elements now appear in the order they were in the TSV file. I wasn't trying to make this change, but I noticed it after I finished.
Empty synsets are now included in the LMF when they have definitions or examples.
Added some unit tests for the TSV loading and linted/formatted/type-checked the code.

This version models lexicalized on synsets and senses but doesn't set the value when we see GAP!. This PR is already doing too many things, so let's save that for the next one.

fcbond added 3 commits March 2, 2025 17:08

add counts and pronunciaiton from tsv

d0f728d

fix to try to get GAP shown as lexicalized=False on the synset - does…

3297ea6

…n't work

got rid of some lines with empty values

1995cd0

fcbond linked an issue Mar 2, 2025 that may be closed by this pull request

Allow the input file to have counts and pronunciation #52

Closed

goodmami added 2 commits March 3, 2025 12:35

Fix 'lexicalized' problem

35a0cc0

Also remove spurious whitespace

goodmami added 5 commits March 7, 2025 12:33

Use logging instead of print in tsv2lmf

6679e48

Make tsv2lmf.py easier to add new fields

72566a7

This commit has substantial changes to the module.

Reformat tsv2lmf.py with Ruff

6383e62

Change tsv2lmf.py --log argument to be path

9725d91

Before it was trying to set sys.stderr as the logging destination when --log is unset, which created strange file.

Add tests for tsv2lmf's load() function

bca83ca

goodmami merged commit 79ecef8 into main Mar 8, 2025

goodmami mentioned this pull request Mar 8, 2025

Allow the input file to have counts and pronunciation #52

Closed

goodmami deleted the 52-allow-the-input-file-to-have-counts-and-pronunciation branch March 8, 2025 03:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

52 allow the input file to have counts and pronunciation#53

52 allow the input file to have counts and pronunciation#53
goodmami merged 10 commits intomainfrom
52-allow-the-input-file-to-have-counts-and-pronunciation

fcbond commented Mar 2, 2025

Uh oh!

goodmami commented Mar 3, 2025

Uh oh!

fcbond commented Mar 4, 2025

Uh oh!

fcbond commented Mar 4, 2025

Uh oh!

goodmami commented Mar 4, 2025

Uh oh!

goodmami commented Mar 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

fcbond commented Mar 2, 2025

Uh oh!

goodmami commented Mar 3, 2025

Uh oh!

fcbond commented Mar 4, 2025

Uh oh!

fcbond commented Mar 4, 2025

Uh oh!

goodmami commented Mar 4, 2025

Uh oh!

goodmami commented Mar 8, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants