Skip to content

HTML anchors in French lemmas? #47

@goodmami

Description

@goodmami

There are a number of lemmas in the French wordnet that have # in them, apparently like an HTML anchor:

$ grep -rP --include='wn-data*.tab' 'lemma\t.*#'
wns/fra/wn-data-fra.tab:00074092-n	fra:lemma	Lapsus#La m.C3.A9canique inconsciente du lapsus
wns/fra/wn-data-fra.tab:01242962-n	fra:lemma	jeûne#je.c3.bbne politique
wns/fra/wn-data-fra.tab:01242962-n	fra:lemma	Jeûne#Je.C3.BBne politique
wns/fra/wn-data-fra.tab:01304820-n	fra:lemma	Guerre de Sept Ans#Le_th.C3.A9.C3.A2tre_d.E2.80.99op.C3.A9rations_am.C3.A9ricain
wns/fra/wn-data-fra.tab:02515560-n	fra:lemma	cœlacanthe
wns/fra/wn-data-fra.tab:02515713-n	fra:lemma	cœlacanthe
wns/fra/wn-data-fra.tab:02922798-n	fra:lemma	jute#toile_de_jute
wns/fra/wn-data-fra.tab:02922798-n	fra:lemma	Jute#Toile_de_jute
wns/fra/wn-data-fra.tab:03321103-n	fra:lemma	Turboréacteur#Simple et double flux
wns/fra/wn-data-fra.tab:03321103-n	fra:lemma	turboréacteur#simple et double flux
wns/fra/wn-data-fra.tab:03321419-n	fra:lemma	Turboréacteur#Simple et double flux
wns/fra/wn-data-fra.tab:03321419-n	fra:lemma	turboréacteur#simple et double flux
wns/fra/wn-data-fra.tab:03507963-n	fra:lemma	moteur#moteur_thermique
wns/fra/wn-data-fra.tab:03507963-n	fra:lemma	Moteur#Moteur_thermique
wns/fra/wn-data-fra.tab:03510244-n	fra:lemma	Radiateur#.C3.89changeur_solide.2Fair
wns/fra/wn-data-fra.tab:03510244-n	fra:lemma	radiateur#.c3.89changeur_solide.2fair
wns/fra/wn-data-fra.tab:03833750-n	fra:lemma	semi-conducteur#dopage de type n
wns/fra/wn-data-fra.tab:04017993-n	fra:lemma	semi-conducteur#dopage de type p
wns/fra/wn-data-fra.tab:04445952-n	fra:lemma	interrupteur#levier
wns/fra/wn-data-fra.tab:04445952-n	fra:lemma	Interrupteur#Levier
wns/fra/wn-data-fra.tab:06860826-n	fra:lemma	mode (musique tonale)#mode_majeur
wns/fra/wn-data-fra.tab:06861020-n	fra:lemma	mode (musique tonale)#mode_mineur
wns/fra/wn-data-fra.tab:07544647-n	fra:lemma	Affection#G.C3.A9n.C3.A9ralit.C3.A9
wns/fra/wn-data-fra.tab:07544647-n	fra:lemma	affection#g.c3.a9n.c3.a9ralit.c3.a9
wns/fra/wn-data-fra.tab:07596452-n	fra:lemma	sucre#les diff.c3.a9rentes formes du sucre
wns/fra/wn-data-fra.tab:07596452-n	fra:lemma	Sucre#Les diff.C3.A9rentes formes du sucre
wns/fra/wn-data-fra.tab:08020242-n	fra:lemma	Septembre noir#Le_massacre_de_septembre_1970
wns/fra/wn-data-fra.tab:08085824-n	fra:lemma	Cardinal (religion)#Le_Coll.C3.A8ge_cardinalice
wns/fra/wn-data-fra.tab:08110648-n	fra:lemma	espèce#sous-esp.c3.a8ce
wns/fra/wn-data-fra.tab:08110648-n	fra:lemma	Espèce#Sous-esp.C3.A8ce
wns/fra/wn-data-fra.tab:08327616-n	fra:lemma	alimentation en grèce antique#les banquets
wns/fra/wn-data-fra.tab:09040998-n	fra:lemma	Antioche#Histoire
wns/fra/wn-data-fra.tab:09294877-n	fra:lemma	Grotte#Culture
wns/fra/wn-data-fra.tab:09294877-n	fra:lemma	grotte#culture
wns/fra/wn-data-fra.tab:09821253-n	fra:lemma	dispositifs tactiques en football#l'attaque
wns/fra/wn-data-fra.tab:10818088-n	fra:lemma	André#Sens_et_origine_du_nom
wns/fra/wn-data-fra.tab:10991936-n	fra:lemma	Gates#Personnalit.C3.A9s
wns/fra/wn-data-fra.tab:11083656-n	fra:lemma	Christ#Religion
wns/fra/wn-data-fra.tab:13879947-n	fra:lemma	triangle#triangle_.c3.a9quilat.c3.a9ral
wns/fra/wn-data-fra.tab:13879947-n	fra:lemma	Triangle#Triangle_.C3.A9quilat.C3.A9ral
wns/fra/wn-data-fra.tab:14187869-n	fra:lemma	psoriasis#arthrite_psoriatique
wns/fra/wn-data-fra.tab:14187869-n	fra:lemma	Psoriasis#Arthrite_psoriatique
wns/fra/wn-data-fra.tab:14510401-n	fra:lemma	monoxyde de carbone#intoxication au monoxyde de carbone
wns/fra/wn-data-fra.tab:14839322-n	fra:lemma	Alliages d'aluminium pour corroyage#S.C3.A9rie_2000_.28aluminium_cuivre.29
wns/slv/wn-data-slv.tab:02001428-n	slv:lemma	močvirniki

The last one for Slovenian and some in French look more like HTML unicode escapes. There also seems to be a lot of redundancy created by upper/lower case initial letters in these examples.

@ekaf, since I don't know French, can you help me verify if the # and everything after can simply be removed? My guess as to a correction is as follows, with the #... stripped, the œ being replaced with œ, and redundant (case-normalized) lemmas for the same synset removed:

00074092-n	fra:lemma	Lapsus
01242962-n	fra:lemma	jeûne
01304820-n	fra:lemma	Guerre de Sept Ans
02515560-n	fra:lemma	cœlacanthe
02515713-n	fra:lemma	cœlacanthe
02922798-n	fra:lemma	jute
03321103-n	fra:lemma	turboréacteur
03321419-n	fra:lemma	turboréacteur
03507963-n	fra:lemma	moteur
03510244-n	fra:lemma	radiateur
03833750-n	fra:lemma	semi-conducteur
04017993-n	fra:lemma	semi-conducteur
04445952-n	fra:lemma	interrupteur
06860826-n	fra:lemma	mode (musique tonale)
06861020-n	fra:lemma	mode (musique tonale)
07544647-n	fra:lemma	affection
07596452-n	fra:lemma	sucre
08020242-n	fra:lemma	Septembre noir
08085824-n	fra:lemma	Cardinal (religion)
08110648-n	fra:lemma	espèce
08327616-n	fra:lemma	alimentation en grèce antique
09040998-n	fra:lemma	Antioche
09294877-n	fra:lemma	grotte
09821253-n	fra:lemma	dispositifs tactiques en football
10818088-n	fra:lemma	André
10991936-n	fra:lemma	Gates
11083656-n	fra:lemma	Christ
13879947-n	fra:lemma	triangle
14187869-n	fra:lemma	psoriasis
14510401-n	fra:lemma	monoxyde de carbone
14839322-n	fra:lemma	Alliages d'aluminium pour corroyage

Metadata

Metadata

Assignees

No one assigned

    Labels

    dataSomething is wrong in the datawontfixThis will not be worked on

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions