52 allow the input file to have counts and pronunciation#53
Conversation
A previous commit removed these lines, but this commit changes the scripts/clean-tsv.py script to do the same thing and generates records for the *-changes.tab files. I verified that this reproduced the hand-committed changes by copying in earlier versions of the affected files and running the cleaning script. The result had no diffs with what's in Git and I was able to get the change records. This commit also changes the clean.sh script so it appends to the *-changes.tab files instead of overwriting them.
Also remove spurious whitespace
|
@fcbond I added a couple of commits (so far). The first commit replicates the removal of blank entries using a script; the result is the same but now we have logs and can repeat the process in the future with The second commit fixes the issue you had with <Synset id="omw-it-00029343-a" ili="i145" partOfSpeech="a" lexicalized="false">
<Definition>a cui piace possedere</Definition>
</Synset>However, there are also lots of things like this: <Synset id="omw-it-00033206-a" ili="i161" partOfSpeech="a" lexicalized="false" members="omw-it-troppo_attivo-00033206-a" />... which is the result of these lines in the 00033206-a ita:lemma GAP!
00033206-a ita:lemma troppo attivoWhat does it mean for something ostensibly not lexicalized to nevertheless have a lexical entry? Maybe the answer is in the MultiWordnet paper, but I could not find a freely-available version online. |
|
It means that the lexical entry is compositional, for example if we have an entry 'buckwheat noodles' for the synset for soba. A lexicon just for that language would not have it, but a multilingual lexicon may. So 00033206-a overactive does not really exist in Italian, but the concept could be realized as troppo attivo "too active". It's useful to have the entry from the point of view of a multi-lingual resource, but it is also nice to be able to note that it is not really something you would have in a monolingual lexicon. For the MultiWordnet, that this Italian (and Hebrew) wordnets come from, they marked all entries not existing in a reference lexicon of Italian as |
|
We should maybe also mark it on the sense as well as the synset? Strictly speaking we should, as otherwise it defaults to True. It is there on the sense level because we could have a concept that is lexicalized, but one of the senses isn't. For example, if we wanted to have guitar guitar "real guitar" for acoustic guitar, ... I think no one uses this yet. |
Ok, that makes sense. There are also a handful of synset groups with 00593071-a ita:lemma GAP!
00593071-a ita:lemma persistente
00593071-a ita:lemma che non dà tregua
00593071-a ita:lemma senza treguaFor reference: >>> import wn
>>> en = wn.Wordnet('omw-en')
>>> en.synset('omw-en-00593071-s').definition()
'never-ceasing'
>>> en.synset('omw-en-00593071-s').lemmas()
['persistent', 'relentless', 'unrelenting']
Yes, I agree, and I think this can solve the problem above. The problem is identifying which entries are compositional and which may be fixed collocations or idioms. I just found this paper which explains the lexical gaps: http://www.lrec-conf.org/proceedings/lrec2000/pdf/236.pdf, and it seems like "simple word" is the first test (which I guess means a word without spaces), but there are many other tests which we cannot easily replicate. So how about:
I still have no idea how |
This commit has substantial changes to the module.
Before it was trying to set sys.stderr as the logging destination when --log is unset, which created strange file.
|
I added some more commits:
This version models |
No description provided.