Skip to content

52 allow the input file to have counts and pronunciation#53

Merged
goodmami merged 10 commits intomainfrom
52-allow-the-input-file-to-have-counts-and-pronunciation
Mar 8, 2025
Merged

52 allow the input file to have counts and pronunciation#53
goodmami merged 10 commits intomainfrom
52-allow-the-input-file-to-have-counts-and-pronunciation

Conversation

@fcbond
Copy link
Copy Markdown
Contributor

@fcbond fcbond commented Mar 2, 2025

No description provided.

@fcbond fcbond linked an issue Mar 2, 2025 that may be closed by this pull request
goodmami added 2 commits March 3, 2025 12:35
A previous commit removed these lines, but this commit changes the
scripts/clean-tsv.py script to do the same thing and generates records
for the *-changes.tab files. I verified that this reproduced the
hand-committed changes by copying in earlier versions of the affected
files and running the cleaning script. The result had no diffs with
what's in Git and I was able to get the change records.

This commit also changes the clean.sh script so it appends to
the *-changes.tab files instead of overwriting them.
Also remove spurious whitespace
@goodmami
Copy link
Copy Markdown
Collaborator

goodmami commented Mar 3, 2025

@fcbond I added a couple of commits (so far).

The first commit replicates the removal of blank entries using a script; the result is the same but now we have logs and can repeat the process in the future with clean.sh.

The second commit fixes the issue you had with lexicalized="false"; the main issue is that what is initially assembled is not the final objects sent to be serialized and converting the former to the latter skipped synsets without any members and didn't set the lexicalized attribute. Now it works and we see things like this:

    <Synset id="omw-it-00029343-a" ili="i145" partOfSpeech="a" lexicalized="false">
      <Definition>a cui piace possedere</Definition>
    </Synset>

However, there are also lots of things like this:

    <Synset id="omw-it-00033206-a" ili="i161" partOfSpeech="a" lexicalized="false" members="omw-it-troppo_attivo-00033206-a" />

... which is the result of these lines in the .tab file:

00033206-a	ita:lemma	GAP!
00033206-a	ita:lemma	troppo attivo

What does it mean for something ostensibly not lexicalized to nevertheless have a lexical entry? Maybe the answer is in the MultiWordnet paper, but I could not find a freely-available version online.

@fcbond
Copy link
Copy Markdown
Contributor Author

fcbond commented Mar 4, 2025

It means that the lexical entry is compositional, for example if we have an entry 'buckwheat noodles' for the synset for soba. A lexicon just for that language would not have it, but a multilingual lexicon may.

So 00033206-a overactive does not really exist in Italian, but the concept could be realized as troppo attivo "too active".

It's useful to have the entry from the point of view of a multi-lingual resource, but it is also nice to be able to note that it is not really something you would have in a monolingual lexicon.

For the MultiWordnet, that this Italian (and Hebrew) wordnets come from, they marked all entries not existing in a reference lexicon of Italian as lexicalized False (GAP or PSEUDOGAP).

@fcbond
Copy link
Copy Markdown
Contributor Author

fcbond commented Mar 4, 2025

We should maybe also mark it on the sense as well as the synset? Strictly speaking we should, as otherwise it defaults to True. It is there on the sense level because we could have a concept that is lexicalized, but one of the senses isn't. For example, if we wanted to have guitar guitar "real guitar" for acoustic guitar, ... I think no one uses this yet.

@goodmami
Copy link
Copy Markdown
Collaborator

goodmami commented Mar 4, 2025

It means that the lexical entry is compositional

Ok, that makes sense. There are also a handful of synset groups with GAP! and also single-word entries:

00593071-a	ita:lemma	GAP!
00593071-a	ita:lemma	persistente
00593071-a	ita:lemma	che non dà tregua
00593071-a	ita:lemma	senza tregua

For reference:

>>> import wn
>>> en = wn.Wordnet('omw-en')
>>> en.synset('omw-en-00593071-s').definition()
'never-ceasing'
>>> en.synset('omw-en-00593071-s').lemmas()
['persistent', 'relentless', 'unrelenting']

We should maybe also mark it on the sense as well as the synset? Strictly speaking we should, as otherwise it defaults to True. It is there on the sense level because we could have a concept that is lexicalized, but one of the senses isn't.

Yes, I agree, and I think this can solve the problem above. The problem is identifying which entries are compositional and which may be fixed collocations or idioms. I just found this paper which explains the lexical gaps: http://www.lrec-conf.org/proceedings/lrec2000/pdf/236.pdf, and it seems like "simple word" is the first test (which I guess means a word without spaces), but there are many other tests which we cannot easily replicate.

So how about:

  1. If a synset group has GAP!, then the sense for any entry with spaces gets lexicalized="false"
  2. If all senses of the synset have lexicalized="false", then the synset gets lexicalized="false", too

I still have no idea how PSEUDOGAP! differs from GAP!. I guess we just treat it the same as GAP!.

goodmami added 5 commits March 7, 2025 12:33
This commit has substantial changes to the module.
Before it was trying to set sys.stderr as the logging destination when --log is
unset, which created strange file.
@goodmami
Copy link
Copy Markdown
Collaborator

goodmami commented Mar 8, 2025

I added some more commits:

  • Use logging instead of print calls (so we don't have to pass logfile around all the time)
  • Did a nearly-full rewrite of the tsv2lmf.py file. The result is hopefully easier to follow and modify. There is still a 2-pass process where we load the TSV data into intermediate structures, then process those structures to create the LMF, but the current version is closer to the LMF, so the build() step is pretty simple.
  • Made validation (checking headers, redundant senses, empty synsets, etc.) a separate function.
  • Senses on LexicalEntry elements now appear in the order they were in the TSV file. I wasn't trying to make this change, but I noticed it after I finished.
  • Empty synsets are now included in the LMF when they have definitions or examples.
  • Added some unit tests for the TSV loading and linted/formatted/type-checked the code.

This version models lexicalized on synsets and senses but doesn't set the value when we see GAP!. This PR is already doing too many things, so let's save that for the next one.

@goodmami goodmami merged commit 79ecef8 into main Mar 8, 2025
@goodmami goodmami deleted the 52-allow-the-input-file-to-have-counts-and-pronunciation branch March 8, 2025 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow the input file to have counts and pronunciation

2 participants