Skip to content
This repository was archived by the owner on Apr 8, 2025. It is now read-only.
This repository was archived by the owner on Apr 8, 2025. It is now read-only.

Data Handler dataset creation fails if features are not computed for some baskets  #457

@ftesser

Description

@ftesser

Describe the bug
method farm.data_handler.processor.Processor._create_dataset fails if features are not computed for some basket.
The specific case of the error message below is the training of SQuAD question answering, the context of the issue is the following:

  • the input SQuAD data contains some "errors" (some misalignment between answer text and selected text in the context);
  • this misalignment cause the exclusion of the features computation the _featurize_samples method (see the try-except):
    def _featurize_samples(self):
  • finally in _create_dataset when the basket has no feature features_flat.extend
    features_flat.extend(sample.features)

    gives the error message reported below.

Error message

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/home/fabio/src/git_repositories/FARM/farm/data_handler/data_silo.py", line 124, in _dataset_from_chunk
    dataset = processor.dataset_from_dicts(dicts=dicts, indices=indices)
  File "/home/fabio/src/git_repositories/FARM/farm/data_handler/processor.py", line 1144, in dataset_from_dicts
    dataset, tensor_names = self._create_dataset(keep_baskets=False)
  File "/home/fabio/src/git_repositories/FARM/farm/data_handler/processor.py", line 308, in _create_dataset
    features_flat.extend(sample.features)
TypeError: 'NoneType' object is not iterable
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/fabio/src/git_repositories/DocumentFeaturesIdentificator/bin/farm_qa", line 10, in <module>
    farm_qa.main()
  File "/home/fabio/src/git_repositories/DocumentFeaturesIdentificator/bin/../documentfeaturesidentificator/farm_utils/farm_qa.py", line 211, in main
    options.func(options)
  File "/home/fabio/src/git_repositories/DocumentFeaturesIdentificator/bin/../documentfeaturesidentificator/farm_utils/farm_qa.py", line 107, in train_general_purpose
    train_mock(options, base_lm_model, out_model_basename, train_filename, dev_filename, test_filename)
  File "/home/fabio/src/git_repositories/DocumentFeaturesIdentificator/bin/../documentfeaturesidentificator/farm_utils/farm_qa.py", line 146, in train_mock
    question_answering(options_mock)
  File "/home/fabio/src/git_repositories/DocumentFeaturesIdentificator/bin/../documentfeaturesidentificator/farm_utils/question_answering.py", line 79, in question_answering
    data_silo = DataSilo(processor=processor, batch_size=batch_size, distributed=False)
  File "/home/fabio/src/git_repositories/FARM/farm/data_handler/data_silo.py", line 105, in __init__
    self._load_data()
  File "/home/fabio/src/git_repositories/FARM/farm/data_handler/data_silo.py", line 207, in _load_data
    self.data["train"], self.tensor_names = self._get_dataset(train_file)
  File "/home/fabio/src/git_repositories/FARM/farm/data_handler/data_silo.py", line 176, in _get_dataset
    for dataset, tensor_names in results:
  File "/usr/lib/python3.8/multiprocessing/pool.py", line 865, in next
    raise value
TypeError: 'NoneType' object is not iterable

Expected behavior
Logging the errors (perhaps with more information) and automatically exclude the basket with errors from the dataset.

Additional context
Add any other context about the problem here, like type of downstream task, part of etc..

To Reproduce
Manually corrupt squad data test (e.g.: add a character in an answer in train-v2.0.json) and then run examples/question_answering.py

System:

  • OS: Linux
  • GPU/CPU: CPU
  • FARM version: 0.4.6

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions