Skip to content

TSV cleanup scripts#48

Merged
goodmami merged 8 commits intomainfrom
cleanup-scripts
Feb 2, 2025
Merged

TSV cleanup scripts#48
goodmami merged 8 commits intomainfrom
cleanup-scripts

Conversation

@goodmami
Copy link
Copy Markdown
Collaborator

Making a separate PR from #33 just for the scripts, and actually cleaning up the data can go in other PRs.

This PR adds two scripts:

  • scripts/tsv-duplicates.py finds exact and normalized redundant lemmas
  • scripts/clean-tsv.py modifies TSV files (in-place to stdout) to strip quotes, replace _ with a space, and remove duplicate lemmas in a synset.

The normalized duplicates of tsv-duplicates.py are probably best fixed by hand, so clean-tsv.py only looks for exact duplicates after some basic cleaning.

@fcbond I reused parts of your quote-stripping code from the other PR, so we might not need it in two places. Better to do it once and modify the TSV than to do it every time we create a WN-LMF file, I think.

* Report exact and normalized duplicates separately
* Add support for polysemy thresholds
Replace '_' with ' ', strip quotes, and remove exact duplicate lemmas.
Use --ignore-lemma-type to only look at lemmas. This is mainly for the
Arabic wordnet which uses lemma types like arb:lemma:root and
arb:lemma:brokenplural.
This makes it more consistent with clean-tsv.py and makes it more
amenable to scripting or CI uses.
@goodmami goodmami merged commit cb39a15 into main Feb 2, 2025
@goodmami goodmami deleted the cleanup-scripts branch February 2, 2025 00:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant