Load PunktParameters from tab files#3283
Conversation
|
More calling functions need editing, so changing to draft. |
|
I think this PR is complete now. However, first merging the corresponding nltk_data package would be a prerequisite for testing this PR. |
|
(Edited) After applying git rebase, the following problem seems solved. This seems similar to a CI error seen recently. Here, CI failed already on pre-commit ("reformatted nltk/test/unit/test_disagreement.py"), though this PR does not modify that file:
|
|
All the doctests in nltk/tokenize/*py succeed. |
|
CI now fails with this:
@stevenbird, nltk_data/index.xml does not mention the newest data packages, so it could look like the index was not rebuilt after merging the latest nltk_data PRs. |
|
@ekaf , just doing |
|
Thanks @sadra-barikbin, this is indeed a sad situation: it is not even possible to import nltk. The reason is that the plaintext corpus reader fails to initialize a sent_tokenizer. |
|
@alvations, @stevenbird, @purificant, the plain NLTK v. 3.8.1 can confirm that the nltk_data index needs rebuilding:
|
|
This nltk_data PR should fix the index |
|
After rebuilding the nltk_data index, the "punkt_tab" package can now be downloaded using nltk from the develop branch, but not from the branch associated with this PR. So nltk still cannot start using this PR. I guess it is because the plaintext corpora reader (called from meteor) now needs to initialize its sent_tokenizer using the "punkt_tab" package. Without that package, nltk fails to start, and hence, it is not able to download the package. But once the package is downloaded using the develop branch, it is possible to test this PR. Maybe it would be possible to avoid the requirement of loading a sent_tokenizer while starting nltk. |
|
CI succeeds and everything seems ok now. |
|
Fixed a small incompatibility in the new version of ne_chunk(), so that the nltk's whole suite of tokenizer/tagger/chunker runs using the same high-level calls as before dropping the pickles: ['Consolidated', 'Gold', 'Fields', 'is', 'a', 'British', 'industrial', 'conglomerate', '.'] [('Consolidated', 'NNP'), ('Gold', 'NNP'), ('Fields', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('British', 'JJ'), ('industrial', 'JJ'), ('conglomerate', 'NN'), ('.', '.')] Tree('S', [Tree('GSP', [('Consolidated', 'NNP')]), Tree('ORGANIZATION', [('Gold', 'NNP'), ('Fields', 'NNP')]), ('is', 'VBZ'), ('a', 'DT'), Tree('GPE', [('British', 'JJ')]), ('industrial', 'JJ'), ('conglomerate', 'NN'), ('.', '.')]) |
|
Thanks @ekaf and sorry for the delay I'm doing fieldwork with almost zero bandwidth |
Tab files are not exposed to the "sleepy" vulnerabilities reported with Python pickles.
This PR adds a loader for the corresponding data package.
['tél.', 'abst.', 'dec. 15']
Using the new load_lang function, this tokenizer can change language on the fly:
['tél. abst. dec.', '15']