-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Improve error handling for empty directories in make_dataset #3495
Description
🚀 Feature
Improve error handling for empty directories in make_dataset().
Motivation
datasets.folder.make_dataset() requires the class_to_idx attribute that is then used to collect the instances:
vision/torchvision/datasets/folder.py
Lines 69 to 74 in 945f3a8
| for target_class in sorted(class_to_idx.keys()): | |
| class_index = class_to_idx[target_class] | |
| target_dir = os.path.join(directory, target_class) | |
| if not os.path.isdir(target_dir): | |
| continue | |
| for root, _, fnames in sorted(os.walk(target_dir, followlinks=True)): |
Currently, we have four places where make_dataset() is used and in all cases class_to_idx is generated the same:
-
vision/torchvision/datasets/folder.py
Line 126 in 945f3a8
classes, class_to_idx = self._find_classes(self.root) with
def _find_classes(dir):vision/torchvision/datasets/folder.py
Lines 164 to 167 in 945f3a8
classes = [d.name for d in os.scandir(dir) if d.is_dir()] classes.sort() class_to_idx = {cls_name: i for i, cls_name in enumerate(classes)} return classes, class_to_idx -
vision/torchvision/datasets/hmdb51.py
Lines 65 to 71 in 945f3a8
classes = sorted(list_dir(root)) class_to_idx = {class_: i for (i, class_) in enumerate(classes)} self.samples = make_dataset( self.root, class_to_idx, extensions, ) -
vision/torchvision/datasets/kinetics.py
Lines 58 to 60 in 945f3a8
classes = list(sorted(list_dir(root))) class_to_idx = {classes[i]: i for i in range(len(classes))} self.samples = make_dataset(self.root, class_to_idx, extensions, is_valid_file=None) -
vision/torchvision/datasets/ucf101.py
Lines 58 to 60 in 945f3a8
classes = list(sorted(list_dir(root))) class_to_idx = {classes[i]: i for i in range(len(classes))} self.samples = make_dataset(self.root, class_to_idx, extensions, is_valid_file=None)
Furthermore, only DatasetFolder has a builtin check if make_dataset found any samples:
vision/torchvision/datasets/folder.py
Lines 127 to 132 in 945f3a8
| samples = self.make_dataset(self.root, class_to_idx, extensions, is_valid_file) | |
| if len(samples) == 0: | |
| msg = "Found 0 files in subfolders of: {}\n".format(self.root) | |
| if extensions is not None: | |
| msg += "Supported extensions are: {}".format(",".join(extensions)) | |
| raise RuntimeError(msg) |
While this is better than passing silently and failing somewhere else (#2903), it still misses the underlying issue in case of an directory without subfolders.
Pitch
I propose three things:
- Factor out the implementation of the
DatasetFolder._find_classes()method into afind_classes()function similar to what we did withmake_datasetin 'make_dataset' as staticmethod of 'DatasetFolder' #3215. - Raise an expressive error in
find_classes()if no classes were found. - Make the
class_to_idxparameter optional inmake_datasetand callfind_classesif it is omitted.
With this we are as flexible as before while we remove duplicated code.
- If one does not want the default behavior,
class_to_idxcan still be passed explicitly - If one needs the returned
classes, e.g. the video datasets, a call could look like thisself.classes, class_to_idx = find_classes(root) self.samples = make_dataset(root, class_to_idx, ...)
- If one only needs the samples calling
self.samples = make_dataset(root, ...)is enough
cc @pmeier