Skip to content

Load PunktParameters from tab files#3283

Merged
stevenbird merged 17 commits intonltk:developfrom
ekaf:punkt_tab
Aug 5, 2024
Merged

Load PunktParameters from tab files#3283
stevenbird merged 17 commits intonltk:developfrom
ekaf:punkt_tab

Conversation

@ekaf
Copy link
Copy Markdown
Member

@ekaf ekaf commented Jul 9, 2024

Tab files are not exposed to the "sleepy" vulnerabilities reported with Python pickles.

This PR adds a loader for the corresponding data package.

from nltk.tokenize.punkt import PunktTokenizer
tokenizer = PunktTokenizer()

text = 'tél. abst. dec. 15'

tokenizer.load_lang('english')
print(tokenizer.tokenize(text))

['tél.', 'abst.', 'dec. 15']

Using the new load_lang function, this tokenizer can change language on the fly:

tokenizer.load_lang('french')
print(tokenizer.tokenize(text))

['tél. abst. dec.', '15']

@ekaf ekaf requested a review from alvations July 9, 2024 15:51
@ekaf
Copy link
Copy Markdown
Member Author

ekaf commented Jul 24, 2024

More calling functions need editing, so changing to draft.

@ekaf ekaf marked this pull request as draft July 24, 2024 09:18
@ekaf ekaf reopened this Jul 26, 2024
@ekaf
Copy link
Copy Markdown
Member Author

ekaf commented Jul 26, 2024

I think this PR is complete now. However, first merging the corresponding nltk_data package would be a prerequisite for testing this PR.

@ekaf ekaf requested a review from stevenbird July 26, 2024 08:09
@ekaf ekaf marked this pull request as ready for review July 26, 2024 08:10
@ekaf
Copy link
Copy Markdown
Member Author

ekaf commented Jul 27, 2024

(Edited) After applying git rebase, the following problem seems solved.

This seems similar to a CI error seen recently. Here, CI failed already on pre-commit ("reformatted nltk/test/unit/test_disagreement.py"), though this PR does not modify that file:

2024-07-26T07:32:44.6552315Z black....................................................................Failed
2024-07-26T07:32:44.6553250Z - hook id: black
2024-07-26T07:32:44.6553575Z - files were modified by this hook
2024-07-26T07:32:44.6553816Z
2024-07-26T07:32:44.6553980Z reformatted nltk/test/unit/test_disagreement.py
2024-07-26T07:32:44.6554251Z
2024-07-26T07:32:44.6554436Z All done! ✨ 🍰 ✨
2024-07-26T07:32:44.6554745Z 1 file reformatted, 360 files left unchanged.
2024-07-26T07:32:44.6555006Z
2024-07-26T07:32:45.6834128Z isort....................................................................Passed
2024-07-26T07:32:45.7018446Z ##[error]Process completed with exit code 1.

@ekaf
Copy link
Copy Markdown
Member Author

ekaf commented Jul 27, 2024

All the doctests in nltk/tokenize/*py succeed.
In particular, sent_tokenize() does not use pickles anymore.
Users are encouraged to test this PR and report any difference with the previous pickles.

@ekaf
Copy link
Copy Markdown
Member Author

ekaf commented Jul 27, 2024

CI now fails with this:

Resource punkt_tab not found.
Please use the NLTK Downloader to obtain the resource:

@stevenbird, nltk_data/index.xml does not mention the newest data packages, so it could look like the index was not rebuilt after merging the latest nltk_data PRs.

@sadra-barikbin
Copy link
Copy Markdown

@ekaf , just doing import nltk raises the error that you mentioned above. LookupError Resource punkt_tab not found.

@ekaf
Copy link
Copy Markdown
Member Author

ekaf commented Jul 28, 2024

Thanks @sadra-barikbin, this is indeed a sad situation: it is not even possible to import nltk. The reason is that the plaintext corpus reader fails to initialize a sent_tokenizer.
As a consequence, nltk.download (or anything else) won't work, since nltk is not defined.
@stevenbird, this is almost the worst that can happen.

@ekaf
Copy link
Copy Markdown
Member Author

ekaf commented Jul 28, 2024

@alvations, @stevenbird, @purificant, the plain NLTK v. 3.8.1 can confirm that the nltk_data index needs rebuilding:

import nltk
print(f"NLTK v. {nltk.__version__}")

NLTK v. 3.8.1

nltk.download("punkt_tab")

[nltk_data] Error loading punkt_tab: Package 'punkt_tab' not found in
[nltk_data] index

@ekaf
Copy link
Copy Markdown
Member Author

ekaf commented Jul 28, 2024

This nltk_data PR should fix the index

@ekaf
Copy link
Copy Markdown
Member Author

ekaf commented Jul 29, 2024

After rebuilding the nltk_data index, the "punkt_tab" package can now be downloaded using nltk from the develop branch, but not from the branch associated with this PR. So nltk still cannot start using this PR.

I guess it is because the plaintext corpora reader (called from meteor) now needs to initialize its sent_tokenizer using the "punkt_tab" package. Without that package, nltk fails to start, and hence, it is not able to download the package.

But once the package is downloaded using the develop branch, it is possible to test this PR.

Maybe it would be possible to avoid the requirement of loading a sent_tokenizer while starting nltk.

@ekaf
Copy link
Copy Markdown
Member Author

ekaf commented Jul 29, 2024

CI succeeds and everything seems ok now.

@ekaf
Copy link
Copy Markdown
Member Author

ekaf commented Aug 2, 2024

Fixed a small incompatibility in the new version of ne_chunk(), so that the nltk's whole suite of tokenizer/tagger/chunker runs using the same high-level calls as before dropping the pickles:

>>> sent = "Consolidated Gold Fields is a British industrial conglomerate."
from nltk.tokenize import word_tokenize
>>> wsent = word_tokenize(sent)
>>> print(wsent)

['Consolidated', 'Gold', 'Fields', 'is', 'a', 'British', 'industrial', 'conglomerate', '.']

>>> from nltk.tag import pos_tag
>>> psent = pos_tag(wsent)
>>> print(psent)

[('Consolidated', 'NNP'), ('Gold', 'NNP'), ('Fields', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('British', 'JJ'), ('industrial', 'JJ'), ('conglomerate', 'NN'), ('.', '.')]

>>> from nltk.chunk import ne_chunk
>>> from pprint import pprint
>>> pprint(ne_chunk(psent))

Tree('S', [Tree('GSP', [('Consolidated', 'NNP')]), Tree('ORGANIZATION', [('Gold', 'NNP'), ('Fields', 'NNP')]), ('is', 'VBZ'), ('a', 'DT'), Tree('GPE', [('British', 'JJ')]), ('industrial', 'JJ'), ('conglomerate', 'NN'), ('.', '.')])

@stevenbird stevenbird merged commit b73aa9b into nltk:develop Aug 5, 2024
@stevenbird
Copy link
Copy Markdown
Member

Thanks @ekaf and sorry for the delay I'm doing fieldwork with almost zero bandwidth

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants