Improve error handling in make_dataset#3496
Conversation
| def test_imagefolder_empty(self): | ||
| with get_tmp_dir() as root: | ||
| with self.assertRaises(RuntimeError): | ||
| with self.assertRaises(FileNotFoundError): |
There was a problem hiding this comment.
This BC breaking if someone relies on DatasetFolder or ImageFolder to raise a RuntimeError if no samples are found. Since the error type was never documented anywhere, I think this is fine.
There was a problem hiding this comment.
I'd say this is fine but might be worth checking if it breaks something in FBCODE.
| if class_to_idx is None: | ||
| _, class_to_idx = find_classes(directory) | ||
| elif not class_to_idx: | ||
| raise ValueError("'class_to_index' must have at least one entry to collect any samples.") |
There was a problem hiding this comment.
This is BC breaking, but in a "right" way. Before make_dataset silently returned an empty list in case class_to_idx was empty. IMO it is a reasonable assumption that no user calls make_dataset without having at least a single class. If you disagree with this assumption, we can get BC back by return [] here.
There was a problem hiding this comment.
I think this makes sense, especially for the folder dataset.
Is there any other dataset that inherits this other than videodatasets?
There was a problem hiding this comment.
Is there any other dataset that inherits this other than videodatasets?
Nothing built-in.
There was a problem hiding this comment.
@pmeier DatasetFolder.make_dataset is public and the docstring is:
class_to_idx (Optional[Dict[str, int]]): Dictionary mapping class name to class index. If omitted, is generated by :func:`find_classes`
The problem is that users can override DatasetFolder.find_classes too, so there's a conflict with the find_classes() function: DatasetFolder.find_classes will rely on the function rather than on the method.
Should we make class_to_idx a mandatory parameter in DatasetFolder.make_dataset to avoid any potential issue?
There was a problem hiding this comment.
To better explain myself: imagine a scenario where someone has a class MyCoolNewDataset(DatasetFolder) and they override MyCoolNewDataset.find_classes with a custom class_to_idx logic.
If they call MyCoolNewDataset.make_dataset while passing None to the class_to_idx parameter, what will be used is the class_to_idx logic from the find_classes function, instead of using the logic from the find_classes method - which is different since they overrode it.
Does that make sense?
To avoid such issues I'm suggesting to force the user to pass class_to_idx in DatasetFolder.make_dataset, or more accurately to raise an error if None is passed in DatasetFolder.make_dataset
Codecov Report
@@ Coverage Diff @@
## master #3496 +/- ##
==========================================
+ Coverage 78.83% 78.99% +0.16%
==========================================
Files 105 105
Lines 9816 9793 -23
Branches 1581 1573 -8
==========================================
- Hits 7738 7736 -2
+ Misses 1588 1573 -15
+ Partials 490 484 -6
Continue to review full report at Codecov.
|
bjuncek
left a comment
There was a problem hiding this comment.
Looks good to me. Should we add a test for make-dataset with no classes as well just to be on the safe side?
Also, I've added a comment, but are there any other datasets that rely on folder dataset?
| if class_to_idx is None: | ||
| _, class_to_idx = find_classes(directory) | ||
| elif not class_to_idx: | ||
| raise ValueError("'class_to_index' must have at least one entry to collect any samples.") |
There was a problem hiding this comment.
I think this makes sense, especially for the folder dataset.
Is there any other dataset that inherits this other than videodatasets?
| def test_imagefolder_empty(self): | ||
| with get_tmp_dir() as root: | ||
| with self.assertRaises(RuntimeError): | ||
| with self.assertRaises(FileNotFoundError): |
There was a problem hiding this comment.
I'd say this is fine but might be worth checking if it breaks something in FBCODE.
bjuncek
left a comment
There was a problem hiding this comment.
Most of my questions answered. LGTM
| @staticmethod | ||
| def _find_classes(dir: str) -> Tuple[List[str], Dict[str, int]]: |
There was a problem hiding this comment.
nit: I wouldn't have made this a staticmethod, as other instantiations of this dataset could rely on self for generating a custom set of classes and class ids
But this is not really important, so let's move forward with this PR and get this merged
Summary: * factor out find_classes * use find_classes in video datasets * adapt old tests Reviewed By: fmassa Differential Revision: D27433918 fbshipit-source-id: 60d8da2f222a19e0757197f5d38b6a9cce7694a8
Closes #3495, fixes #2903.