Skip to content

Gh 24 unexpected identifiers#59

Merged
goodmami merged 2 commits intomainfrom
gh-24-unexpected-identifiers
May 22, 2025
Merged

Gh 24 unexpected identifiers#59
goodmami merged 2 commits intomainfrom
gh-24-unexpected-identifiers

Conversation

@goodmami
Copy link
Copy Markdown
Collaborator

This PR does two things:

  • Replaces any -s synset IDs with -a for nld, lit, and slk. This happened anyway during conversion to WN-LMF, but for some reason the .tab files still retained the -s IDs. I confirmed that the resulting XML files had no diffs.
  • Removes entries (see list below) from tab files where the offset does not match anything in WordNet 3.0. Some of these had lemmas in other synsets, but many did not. Of those that did not, some do not appear to be good candidates as the POS was wrong or the lemma was a phrase. It would be unfortunate if some useful data were lost, but: (1) we have version control; (2) if the entries do not match a WordNet 3.0 ID, they do not have much value; (3) some lemmas are represented by other entries; and (4) examples without lemmas probably aren't very useful (I did not check if any examples without lemmas exist for actual WordNet 3.0 synsets IDs).

The list of removed entries, from #24, is as follows:

wns/mcr/wn-data-cat.tab:00001837-n      cat:exe 0       187 DC
wns/hrv/wn-data-hrv.tab:01498548-a      hrv:lemma       amoralan
wns/hrv/wn-data-hrv.tab:01498548-a      hrv:lemma       nemoralan
wns/hrv/wn-data-hrv.tab:01505508-a      hrv:lemma       mnogo više
wns/hrv/wn-data-hrv.tab:01505508-a      hrv:lemma       puno više
wns/hrv/wn-data-hrv.tab:02002046-a      hrv:lemma       izuzev
wns/hrv/wn-data-hrv.tab:02002046-a      hrv:lemma       izuzevši
wns/hrv/wn-data-hrv.tab:02002046-a      hrv:lemma       izuzimajući
wns/hrv/wn-data-hrv.tab:02002046-a      hrv:lemma       osim
wns/hrv/wn-data-hrv.tab:02917945-a      hrv:lemma       mahunast
wns/hrv/wn-data-hrv.tab:03202339-n      hrv:lemma       modne potrepštine
wns/cow/wn-data-cmn.tab:14869976-n      cmn:lemma       污点
wns/cow/wn-data-cmn.tab:14869977-n      cmn:lemma       小斑
wns/cow/wn-data-cmn.tab:15168570-n      cmn:lemma       规定的睡觉时间
wns/cow/wn-data-cmn.tab:15171146-n      cmn:lemma       节日
wns/cow/wn-data-cmn.tab:15171147-n      cmn:lemma       纪念日
wns/cow/wn-data-cmn.tab:15171739-n      cmn:lemma       竞技状态不佳的日子
wns/cow/wn-data-cmn.tab:15171858-n      cmn:lemma       存取时间
wns/cow/wn-data-cmn.tab:15172882-n      cmn:lemma       选举日
wns/cow/wn-data-cmn.tab:15173065-n      cmn:lemma       教会年
wns/cow/wn-data-cmn.tab:15176162-n      cmn:lemma       雾月
wns/cow/wn-data-cmn.tab:15177867-n      cmn:lemma       希伯来历
wns/cow/wn-data-cmn.tab:15178842-n      cmn:lemma       回历
wns/mcr/wn-data-glg.tab:15300653-n      glg:lemma       métopa
wns/mcr/wn-data-spa.tab:15300823-n      spa:exe 0       En 1850, el Dr. Green publicó un artículo en la revista Lancet en el cual niega la relación del "asma del heno" con el heno.

The removals are noted in the *-changes.tab files, but the -s to -a modifications are not (it's trivial, and there are thousands of them).

The -s would be replaced with -a during conversion anyway.
Confirmed that the resulting XML file has no diffs.
Copy link
Copy Markdown
Contributor

@fcbond fcbond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thanks.

Definitely not helpful to keep the '-s' around, or non-linked synsets.

@goodmami goodmami merged commit 6f8494f into main May 22, 2025
@goodmami goodmami deleted the gh-24-unexpected-identifiers branch May 22, 2025 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants