Add GTSRB dataset to prototypes#5214
Conversation
💊 CI failures summary and remediationsAs of commit 8283332 (more details on the Dr. CI page):
🕵️ 1 new failure recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
pmeier
left a comment
There was a problem hiding this comment.
Thanks a lot @NicolasHug! I've only reviewed the prototype part, let me know if I also need to look at the other changes.
|
|
||
| with opener(archive) as fh: | ||
| for file in files: | ||
| for file in sorted(files): |
There was a problem hiding this comment.
@pmeier LMK what you think of this.
Down below I set buffer_size=1 in the Zipper, because in the original .zip files, both the image_dp and the gt_dp are fully aligned: they both contain images 00001, 00002, etc. in this order. So I'm assuming that buffer_size=1 is better than buffer_size=UNLIMITED?
Without this call to sorted(), the tests would fail: the .zip archive created by make_zip would contain the files in a shuffled order (because files is a set), and so image_dp and the gt_dp would not be aligned anymore, leading to a failure to match keys in the Zipper. (Note: this is only a problem in the tests; the code works fine otherwise on my custom script iterating over the dataset).
I hope this won't make other tests fails. This might not be a problem that we have right now, but perhaps something to keep in mind for the future: we might need the test archives to exactly match the order of the "original" archives.
There was a problem hiding this comment.
Good point, I didn't think of that. Do you know if the order of the returned paths of pathlib.Path.glob() is stable? If yes, we could simply replace the sets in _split_files_or_dirs with lists instead of sorting here.
There was a problem hiding this comment.
I'm not sure about Path.glob(). I know that glob.glob has no guaranteed order, but I don't think Path.glob() relies on it. Maybe the safest is to not assume a specific order.
BTW, slightly related, what was the reason to use sets instead of lists?
There was a problem hiding this comment.
what was the reason to use sets instead of lists?
It think the reason was to avoid duplicates, but I don't remember if there was a case where I hit something like that.
There was a problem hiding this comment.
Have you tried using lists rather than sorting afterwards? If CI is not complaining for other datasets, I feel like that would be the better approach.
There was a problem hiding this comment.
I just saw this call to remove() which might the reason for using sets:
Lines 862 to 863 in afda28a
I can still switch to lists if you'd like, I guess I would have to write something like
dirs = [dir in dirs if dir != root]LMK which one you prefer
pmeier
left a comment
There was a problem hiding this comment.
Only minor stuff inline. Thanks a lot @NicolasHug!
Besides that one other question: Given that the dataset also provides a bounding box for each image, shouldn't we also return it? For the test split we already load the data. For the training data, each image folder contains a GT-{label:05d}.csv file in the same format as the testing annotations. Basically instead of Filtering the images, you could use a Demultiplexer to get a images and a annotations datapipe.
|
|
||
| with opener(archive) as fh: | ||
| for file in files: | ||
| for file in sorted(files): |
There was a problem hiding this comment.
Have you tried using lists rather than sorting afterwards? If CI is not complaining for other datasets, I feel like that would be the better approach.
Co-authored-by: Philip Meier <[email protected]>
Co-authored-by: Philip Meier <[email protected]>
Summary: Co-authored-by: Philip Meier <[email protected]> Reviewed By: jdsgomes, prabhat00155 Differential Revision: D33739393 fbshipit-source-id: f65df4355c53a2fed2534b4bbd3ce7c1aa0606e2
This PR adds the prototype version of the GTSRB dataset
cc @pmeier @bjuncek