-
Notifications
You must be signed in to change notification settings - Fork 248
Data Handler dataset creation fails if features are not computed for some baskets #457
Description
Describe the bug
method farm.data_handler.processor.Processor._create_dataset fails if features are not computed for some basket.
The specific case of the error message below is the training of SQuAD question answering, the context of the issue is the following:
- the input SQuAD data contains some "errors" (some misalignment between answer text and selected text in the context);
- this misalignment cause the exclusion of the features computation the
_featurize_samplesmethod (see the try-except):FARM/farm/data_handler/processor.py
Line 295 in b8b59c4
def _featurize_samples(self): - finally in
_create_datasetwhen the basket has no featurefeatures_flat.extendFARM/farm/data_handler/processor.py
Line 308 in b8b59c4
features_flat.extend(sample.features)
gives the error message reported below.
Error message
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
result = (True, func(*args, **kwds))
File "/home/fabio/src/git_repositories/FARM/farm/data_handler/data_silo.py", line 124, in _dataset_from_chunk
dataset = processor.dataset_from_dicts(dicts=dicts, indices=indices)
File "/home/fabio/src/git_repositories/FARM/farm/data_handler/processor.py", line 1144, in dataset_from_dicts
dataset, tensor_names = self._create_dataset(keep_baskets=False)
File "/home/fabio/src/git_repositories/FARM/farm/data_handler/processor.py", line 308, in _create_dataset
features_flat.extend(sample.features)
TypeError: 'NoneType' object is not iterable
"""
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/fabio/src/git_repositories/DocumentFeaturesIdentificator/bin/farm_qa", line 10, in <module>
farm_qa.main()
File "/home/fabio/src/git_repositories/DocumentFeaturesIdentificator/bin/../documentfeaturesidentificator/farm_utils/farm_qa.py", line 211, in main
options.func(options)
File "/home/fabio/src/git_repositories/DocumentFeaturesIdentificator/bin/../documentfeaturesidentificator/farm_utils/farm_qa.py", line 107, in train_general_purpose
train_mock(options, base_lm_model, out_model_basename, train_filename, dev_filename, test_filename)
File "/home/fabio/src/git_repositories/DocumentFeaturesIdentificator/bin/../documentfeaturesidentificator/farm_utils/farm_qa.py", line 146, in train_mock
question_answering(options_mock)
File "/home/fabio/src/git_repositories/DocumentFeaturesIdentificator/bin/../documentfeaturesidentificator/farm_utils/question_answering.py", line 79, in question_answering
data_silo = DataSilo(processor=processor, batch_size=batch_size, distributed=False)
File "/home/fabio/src/git_repositories/FARM/farm/data_handler/data_silo.py", line 105, in __init__
self._load_data()
File "/home/fabio/src/git_repositories/FARM/farm/data_handler/data_silo.py", line 207, in _load_data
self.data["train"], self.tensor_names = self._get_dataset(train_file)
File "/home/fabio/src/git_repositories/FARM/farm/data_handler/data_silo.py", line 176, in _get_dataset
for dataset, tensor_names in results:
File "/usr/lib/python3.8/multiprocessing/pool.py", line 865, in next
raise value
TypeError: 'NoneType' object is not iterable
Expected behavior
Logging the errors (perhaps with more information) and automatically exclude the basket with errors from the dataset.
Additional context
Add any other context about the problem here, like type of downstream task, part of etc..
To Reproduce
Manually corrupt squad data test (e.g.: add a character in an answer in train-v2.0.json) and then run examples/question_answering.py
System:
- OS: Linux
- GPU/CPU: CPU
- FARM version: 0.4.6