Load PunktParameters from tab files by ekaf · Pull Request #3283 · nltk/nltk

ekaf · 2024-07-09T15:51:30Z

Tab files are not exposed to the "sleepy" vulnerabilities reported with Python pickles.

This PR adds a loader for the corresponding data package.

from nltk.tokenize.punkt import PunktTokenizer
tokenizer = PunktTokenizer()

text = 'tél. abst. dec. 15'

tokenizer.load_lang('english')
print(tokenizer.tokenize(text))

['tél.', 'abst.', 'dec. 15']

Using the new load_lang function, this tokenizer can change language on the fly:

tokenizer.load_lang('french')
print(tokenizer.tokenize(text))

['tél. abst. dec.', '15']

ekaf · 2024-07-24T07:54:40Z

More calling functions need editing, so changing to draft.

ekaf · 2024-07-26T08:09:11Z

I think this PR is complete now. However, first merging the corresponding nltk_data package would be a prerequisite for testing this PR.

ekaf · 2024-07-27T05:38:05Z

(Edited) After applying git rebase, the following problem seems solved.

This seems similar to a CI error seen recently. Here, CI failed already on pre-commit ("reformatted nltk/test/unit/test_disagreement.py"), though this PR does not modify that file:

2024-07-26T07:32:44.6552315Z black....................................................................Failed
2024-07-26T07:32:44.6553250Z - hook id: black
2024-07-26T07:32:44.6553575Z - files were modified by this hook
2024-07-26T07:32:44.6553816Z
2024-07-26T07:32:44.6553980Z reformatted nltk/test/unit/test_disagreement.py
2024-07-26T07:32:44.6554251Z
2024-07-26T07:32:44.6554436Z All done! ✨ 🍰 ✨
2024-07-26T07:32:44.6554745Z 1 file reformatted, 360 files left unchanged.
2024-07-26T07:32:44.6555006Z
2024-07-26T07:32:45.6834128Z isort....................................................................Passed
2024-07-26T07:32:45.7018446Z ##[error]Process completed with exit code 1.

ekaf · 2024-07-27T05:45:07Z

All the doctests in nltk/tokenize/*py succeed.
In particular, sent_tokenize() does not use pickles anymore.
Users are encouraged to test this PR and report any difference with the previous pickles.

ekaf · 2024-07-27T12:50:28Z

CI now fails with this:

Resource punkt_tab not found.
Please use the NLTK Downloader to obtain the resource:

@stevenbird, nltk_data/index.xml does not mention the newest data packages, so it could look like the index was not rebuilt after merging the latest nltk_data PRs.

sadra-barikbin · 2024-07-27T20:39:59Z

@ekaf , just doing import nltk raises the error that you mentioned above. LookupError Resource punkt_tab not found.

ekaf · 2024-07-28T03:47:00Z

Thanks @sadra-barikbin, this is indeed a sad situation: it is not even possible to import nltk. The reason is that the plaintext corpus reader fails to initialize a sent_tokenizer.
As a consequence, nltk.download (or anything else) won't work, since nltk is not defined.
@stevenbird, this is almost the worst that can happen.

ekaf · 2024-07-28T05:15:27Z

@alvations, @stevenbird, @purificant, the plain NLTK v. 3.8.1 can confirm that the nltk_data index needs rebuilding:

import nltk
print(f"NLTK v. {nltk.__version__}")

NLTK v. 3.8.1

nltk.download("punkt_tab")

[nltk_data] Error loading punkt_tab: Package 'punkt_tab' not found in
[nltk_data] index

ekaf · 2024-07-28T05:47:09Z

This nltk_data PR should fix the index

ekaf · 2024-07-29T15:40:42Z

After rebuilding the nltk_data index, the "punkt_tab" package can now be downloaded using nltk from the develop branch, but not from the branch associated with this PR. So nltk still cannot start using this PR.

I guess it is because the plaintext corpora reader (called from meteor) now needs to initialize its sent_tokenizer using the "punkt_tab" package. Without that package, nltk fails to start, and hence, it is not able to download the package.

But once the package is downloaded using the develop branch, it is possible to test this PR.

Maybe it would be possible to avoid the requirement of loading a sent_tokenizer while starting nltk.

ekaf · 2024-07-29T23:53:40Z

CI succeeds and everything seems ok now.

ekaf · 2024-08-02T07:01:56Z

Fixed a small incompatibility in the new version of ne_chunk(), so that the nltk's whole suite of tokenizer/tagger/chunker runs using the same high-level calls as before dropping the pickles:

>>> sent = "Consolidated Gold Fields is a British industrial conglomerate."
from nltk.tokenize import word_tokenize
>>> wsent = word_tokenize(sent)
>>> print(wsent)

['Consolidated', 'Gold', 'Fields', 'is', 'a', 'British', 'industrial', 'conglomerate', '.']

>>> from nltk.tag import pos_tag
>>> psent = pos_tag(wsent)
>>> print(psent)

[('Consolidated', 'NNP'), ('Gold', 'NNP'), ('Fields', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('British', 'JJ'), ('industrial', 'JJ'), ('conglomerate', 'NN'), ('.', '.')]

>>> from nltk.chunk import ne_chunk
>>> from pprint import pprint
>>> pprint(ne_chunk(psent))

Tree('S', [Tree('GSP', [('Consolidated', 'NNP')]), Tree('ORGANIZATION', [('Gold', 'NNP'), ('Fields', 'NNP')]), ('is', 'VBZ'), ('a', 'DT'), Tree('GPE', [('British', 'JJ')]), ('industrial', 'JJ'), ('conglomerate', 'NN'), ('.', '.')])

stevenbird · 2024-08-05T10:00:42Z

Thanks @ekaf and sorry for the delay I'm doing fieldwork with almost zero bandwidth

Load PickleParameters from tab files

c34662e

github-actions bot added the tokenizer label Jul 9, 2024

ekaf requested a review from alvations July 9, 2024 15:51

ekaf added 2 commits July 13, 2024 11:24

Use new tabdata module

9b72b72

Update tabdata module

cb59825

ekaf mentioned this pull request Jul 14, 2024

Remote code execution vulnerability in NLTK #3266

Closed

This was referenced Jul 22, 2024

Pickle-free maxent chunkers #3286

Merged

Prevent data.load from unpickling classes or functions #3290

Merged

ekaf marked this pull request as draft July 24, 2024 09:18

ekaf added 4 commits July 25, 2024 09:47

Fix sent_tokenize

13e0031

Replace Punkt pickle calls by PunktTokenizer

f2054c7

Fix doctests

1e4366a

Import Punkt classes in __init__.py

e347cba

github-actions bot added corpus sentiment labels Jul 26, 2024

Update compat.py

2fcb627

ekaf closed this Jul 26, 2024

ekaf force-pushed the punkt_tab branch from 2fcb627 to 11be99e Compare July 26, 2024 06:57

Fix compat

315d73d

ekaf reopened this Jul 26, 2024

ekaf requested a review from stevenbird July 26, 2024 08:09

ekaf marked this pull request as ready for review July 26, 2024 08:10

ekaf mentioned this pull request Jul 26, 2024

BLEU Score Exceeds 1 for Certain Test Cases #3291

Closed

ekaf added 3 commits July 27, 2024 13:35

Load PickleParameters from tab files

f70fed4

Use new tabdata module

496515e

Replace Punkt pickle calls by PunktTokenizer

9bdc14f

ekaf added 2 commits July 27, 2024 13:43

Reformat metrics/agreement

233ecdb

Merge remote-tracking branch 'origin/punkt_tab' into punkt_tab

bbcfa56

github-actions bot added the metrics label Jul 27, 2024

ekaf added 2 commits July 29, 2024 20:42

Don't initialize sent_tokenizer in plaintext reader

5ee6c93

Don't use remove_suffix()

520475b

JoshuaPeddle approved these changes Jul 30, 2024

View reviewed changes

Fix ne_chunk and add doctest

45731bc

stevenbird merged commit b73aa9b into nltk:develop Aug 5, 2024

peterboost mentioned this pull request Aug 12, 2024

Incompatibility with NLTK 3.8.2 (punkt_tab) estnltk/estnltk#122

Closed

soras mentioned this pull request Aug 12, 2024

[BUG] NLTK's PunktTokenizer fails to initialize due to UnicodeDecodeError [Windows specific] #3294

Closed

ekaf mentioned this pull request Aug 13, 2024

[BUG] punkt_tab breaking change #3293

Closed

ryanamannion mentioned this pull request Aug 20, 2024

load_punkt_params() loads PunktParameters.collocations as list instead of set #3310

Closed

stumpylog mentioned this pull request Aug 23, 2024

Lookup error issue in nltk even with new version 3.9.1, similar to PR #3308 #3312

Closed

juhoinkinen mentioned this pull request Sep 20, 2024

Automate NLTK datapackage punkt_tab download NatLibFi/Annif#803

Merged

Conversation

ekaf commented Jul 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ekaf commented Jul 24, 2024

Uh oh!

ekaf commented Jul 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ekaf commented Jul 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ekaf commented Jul 27, 2024

Uh oh!

ekaf commented Jul 27, 2024

Uh oh!

sadra-barikbin commented Jul 27, 2024

Uh oh!

ekaf commented Jul 28, 2024

Uh oh!

ekaf commented Jul 28, 2024

Uh oh!

ekaf commented Jul 28, 2024

Uh oh!

ekaf commented Jul 29, 2024

Uh oh!

ekaf commented Jul 29, 2024

Uh oh!

ekaf commented Aug 2, 2024

Uh oh!

stevenbird commented Aug 5, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ekaf commented Jul 9, 2024 •

edited

Loading

ekaf commented Jul 26, 2024 •

edited

Loading

ekaf commented Jul 27, 2024 •

edited

Loading