TSV cleanup scripts by goodmami · Pull Request #48 · omwn/omw-data

goodmami · 2025-01-30T01:27:07Z

Making a separate PR from #33 just for the scripts, and actually cleaning up the data can go in other PRs.

This PR adds two scripts:

scripts/tsv-duplicates.py finds exact and normalized redundant lemmas
scripts/clean-tsv.py modifies TSV files (in-place to stdout) to strip quotes, replace _ with a space, and remove duplicate lemmas in a synset.

The normalized duplicates of tsv-duplicates.py are probably best fixed by hand, so clean-tsv.py only looks for exact duplicates after some basic cleaning.

@fcbond I reused parts of your quote-stripping code from the other PR, so we might not need it in two places. Better to do it once and modify the TSV than to do it every time we create a WN-LMF file, I think.

* Report exact and normalized duplicates separately * Add support for polysemy thresholds

Replace '_' with ' ', strip quotes, and remove exact duplicate lemmas.

Use --ignore-lemma-type to only look at lemmas. This is mainly for the Arabic wordnet which uses lemma types like arb:lemma:root and arb:lemma:brokenplural.

This makes it more consistent with clean-tsv.py and makes it more amenable to scripting or CI uses.

goodmami added 4 commits January 29, 2025 11:17

Add tsv-duplicates.py script

38fcdd0

Update tsv-duplicates.py

19944d5

* Report exact and normalized duplicates separately * Add support for polysemy thresholds

Add scripts/clean-tsv.py

9bbe590

Replace '_' with ' ', strip quotes, and remove exact duplicate lemmas.

Make in-place optional, log edits in clean-tsv.py

3554ed9

goodmami mentioned this pull request Jan 30, 2025

Add tsv-duplicates.py script #33

Closed

goodmami added 4 commits January 31, 2025 15:50

Add script to clean files; put date in stderr logs

1947a61

Make clean.sh work with wns with multiple lexicons

d02670b

Consider lemma type in deduping by default

c3e9077

Use --ignore-lemma-type to only look at lemmas. This is mainly for the Arabic wordnet which uses lemma types like arb:lemma:root and arb:lemma:brokenplural.

Add --ignore-lemma-type and --check to tsv-dup

cde7cf2

This makes it more consistent with clean-tsv.py and makes it more amenable to scripting or CI uses.

goodmami mentioned this pull request Feb 2, 2025

Possible duplicates in the Arabic wordnet #49

Closed

goodmami merged commit cb39a15 into main Feb 2, 2025

goodmami deleted the cleanup-scripts branch February 2, 2025 00:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TSV cleanup scripts#48

TSV cleanup scripts#48
goodmami merged 8 commits intomainfrom
cleanup-scripts

goodmami commented Jan 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

goodmami commented Jan 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant