shuffle data from hdf5 datasets#1347
Conversation
|
Other than that it looks good to me! Cool! |
|
I changed some parts. |
|
Could you also do a speed benchmark and see how shuffling affects typical read speed? It used to cause a lot of trouble when reading randomly from a leveldb. Usually large-scale datasets don't need shuffling that much so if speed is a concern, it might be better to keep sequential read. (Since shuffling is turned off in default, I think having the capability is good.) |
|
@jeffdonahue can you review and merge if this looks good to you? |
There was a problem hiding this comment.
Please clarify this comment to explain that the HDF5 files themselves are shuffled but the order within any given file is fixed.
There was a problem hiding this comment.
Everything will be shuffled: hdf5 files and entries in these hdf5 files.
There was a problem hiding this comment.
Thanks, I see that now, my bad. I think it should still be clarified though -- it's not actually a full shuffle of the dataset (i.e., some orderings of the dataset are impossible to obtain) unless you only have a single HDF5 file (or each HDF5 file only has a single entry).
|
Replaced by #2118. |
The order of read HDF5 files and the order of the entries of the HDF5 files can be shuffled when setting the flag
shufflein the hdf5data layer