Skip to content
This repository was archived by the owner on Apr 8, 2025. It is now read-only.

Catch empty datasets in Inferencer#605

Merged
Timoeller merged 1 commit intomasterfrom
remove_empty_chunks
Oct 28, 2020
Merged

Catch empty datasets in Inferencer#605
Timoeller merged 1 commit intomasterfrom
remove_empty_chunks

Conversation

@Timoeller
Copy link
Copy Markdown
Contributor

fixes #454
Only happens during multiprocessing.

When one multiprocessing chunk only contains invalid examples we run into errors. This PR catches these errors and produces an error message. It does not solve the root cause.

Invalid examples can be:

  1. empty/malformed documents for classification,
  2. empty/malformed documents for QA
  3. QA annotations where answer text and offsets do not conform

@Timoeller Timoeller requested a review from tholor October 27, 2020 14:28
Copy link
Copy Markdown
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would fix the inferencer, but don't we have similar issues in other use cases (e.g. training deepset-ai/haystack#488)? Maybe a similar fix inside DataSilo could tackle this?

@Timoeller
Copy link
Copy Markdown
Contributor Author

I thought so, too and used corrupted files for normal training (using the datasilo). I could not get QA processing to fail with these corrupted files.

I can try corrupting the files a bit more to see if we can reproduce datasilo errors. Will report back here.

@Timoeller
Copy link
Copy Markdown
Contributor Author

Ok I found out some strange behaviour when the context is either empty or does not contain the answer. I opened a separate issue and assigned @bogdankostic to it.

I would prefer to merge this solution for the inferencer now and work on the datasilo in a separate PR.

Copy link
Copy Markdown
Member

@tholor tholor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok sounds good. Let's tackle that separately

@Timoeller Timoeller merged commit c21466c into master Oct 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

QA inference breaks on large dataset

2 participants