Simplify processors - add Fasttokenizers#649
Conversation
|
270996c introduces Multiprocessing after the Multithreading by Rust tokenizers. Although Multithreading should be finished by then, forking processes in python results in: When setting the env var to false, python mp wont start. My next experiment will be adding mp on a higher level (calling dataset_from_dicts) again. |
|
I did some performance benchmarking and found the culprit: I will now work on separating the create_samples_qa and sample_to_features_qa functions into more meaningful functions:
|
| "context": f"{context}", | ||
| "label": f"{tag}", | ||
| "probability": prob, | ||
| "probability": np.float32(0.0), |
There was a problem hiding this comment.
@brandenchan
is the prob completly gone or was this just a quick fix that you forgot to revert?
There was a problem hiding this comment.
The old probability calculation was wrong. I have opened issue #658 to address this.
…to refactor_processor_qa
…adjust qa benchmark to new values
* WIP lm finetuning refactoring * WIP refactoring bert style lm * first working version of bert_style_lm * optimize speed of mask_random_words * move get_start_of_words to tokenization module * Update docstrings. fix estimation * add multithreading_rust arg * fix import. fix vocab index out of range * fix empty sequence b * make bert-style to new default for lm finetuning. disable eval_report * change evaluate_every to 1000
…to refactor_processor_qa
|
I tested this branch with haystack in the following ways:
Working on fixing the remaining tests. |
|
I cannot reproduce the failing s3 test and nothing has changed code wise. I presume it is some CI problem that we can fix later. Merging now. |
* increase transformers version * Make fast tokenizers possible * refactor QA processing * Move all fcts into dataset from dicts for QA * refactor doc classification * refactor bert_style_lm * refactor inference_processor Co-authored-by: Bogdan Kostić <[email protected]> Co-authored-by: brandenchan <[email protected]> Co-authored-by: Malte Pietsch <[email protected]> Former-commit-id: 18e7fc7 Former-commit-id: 4fdadbe87ea1a0dbfdb02959a23e56a653d1aed2
Simplifying the processor by:
Some older commits are done by Bogdan and me, for making FARM work with transformers 3.5.1 and fasttokenizers.
For descriptions about progress see comments below.