Conversation
|
As it is right now, I have removed sample level error messages in NER that would, for example, say that a given sample has a label not in the label list, or that it doesn't have a label field at all. Instead, failed featurization is handled at a higher level so that we get a single line mentioning how many samples are not properly featurized. However, since we chunk our data, we actually have one line per chunk that states how many problematic samples there are. Pros:
Cons:
If we really want a single line of error printout, we will need to make changes to some very high level fns that will have impact across all tasks (e.g. This is also a more general problem that will apply to QA so its worth strategizing about how we should tackle this @Timoeller thoughts? |
Timoeller
left a comment
There was a problem hiding this comment.
I think with problematic_ids we need some further testing, especially if other tasks break.
farm/data_handler/processor.py
Outdated
| if return_baskets: | ||
| dataset, tensor_names = self._create_dataset(keep_baskets=True) | ||
| return dataset, tensor_names, self.baskets | ||
| ret = [dataset, tensor_names] |
There was a problem hiding this comment.
I am not sure about the consequences, but lists are mutable and tuples (the return type before) are not. This might introduce some pretty weird behaviour. Do we need it?
Please also check the other return statement above
There was a problem hiding this comment.
Ok sure, I agree a tuple could be more stable. I wrote it like this to avoid having too many nested if statements since both return_baskets and return_problematic can alter the number of objects that are returned. I will convert ret into a tuple in the next commit to avoid this mutability.
| desc += f" {filename}" | ||
| with tqdm(total=len(dicts), unit=' Dicts', desc=desc) as pbar: | ||
| for dataset, tensor_names in results: | ||
| for dataset, tensor_names, problematic_samples in results: |
There was a problem hiding this comment.
Doesnt this statement break in case when the problematic samples are not returned?
There was a problem hiding this comment.
In this context, results is always the output of _dataset_from_chunk( ) which I modified to always return 3 objects so I don't think this unpacking is an issue. As an asside, this function (i.e. _get_dataset( )) is the only method in our code base which calls _dataset_from_chunk( ) so it shouldn't be problematic that I've modified it to return 3 objects.
There was a problem hiding this comment.
Ok, agreed, but this means we always need to return those ids.
We do have a flag in dataset_from_dicts, return_problematic=True, which seems to break the processing when set to false, right?
| logger.error(f"Error message: {e}") | ||
| curr_problematic_sample_ids.append(sample.id) | ||
| if curr_problematic_sample_ids: | ||
| self.problematic_sample_ids.update(curr_problematic_sample_ids) |
There was a problem hiding this comment.
Why are we updating the problematic_sample_ids here and when we collect results from MP?
There was a problem hiding this comment.
The idea here is that we have two levels at which problematic_ids are stored. Within multiprocessing, each Processor needs to collect problematic_ids as it iterates through the baskets in its chunks. These are eventually returned out of multiprocessing via dataset_from_dicts( ). At this point, these returned problematic_ids are collected together in the non-multiprocessing Processor .
I went for this approach because it allows for us to implement logging problematic samples neatly as a method of the processor (Processor.log_problematic( )) as long as the problematic sample ids are stored in the Processor.
I do admit that it might be a little confusing coordinating between the higher level single Processor and the multiple lower level Processors. Do you have any thoughts or suggestions on this?
This PR addresses many of the issues in #590 regarding error handling and logging.