Skip to content

Conversation

@xzhu1900
Copy link
Contributor

Add a sorting policy to ChunkDataset.

This is considered an advanced parameter for developers who want to apply a 'sorting policy' to the chunk data before sampling into minibatch.

Different than the collate method, this policy is applied on the chunk level instead of minibatch level. When a chunk of data is loaded (multiple chunks if cross_chunk_shuffle_count_ is greater than 1), this policy is targeting to the full loaded data. It will be useful if developers want to perform some pre-processing (like bucketing) to the chunk data before example sampler samples the data.

@pytorchbot pytorchbot added the module: cpp Related to C++ API label Jul 19, 2019
@cpuhrsch cpuhrsch added module: dataloader Related to torch.utils.data.DataLoader and Sampler low priority We're unlikely to get around to doing this in the near future triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jul 19, 2019
@ezyang ezyang removed the low priority We're unlikely to get around to doing this in the near future label Jul 19, 2019
Copy link
Contributor

@soumith soumith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

from what I read, it is more powerful than sorting, you are applying a lambda to the chunk that is loaded.
Can you rename it to transform or preprocess instead of sorting_policy

@soumith
Copy link
Contributor

soumith commented Jul 24, 2019

waiting for CI to finish

@soumith
Copy link
Contributor

soumith commented Jul 24, 2019

@pytorchbot rebase this please

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't you just have a wrapper around your ChunkReader that applies the processing? This is a concern orthogonal to the responsibility of ChunkDataset and this unnecessarily makes the API more complex.

@xzhu1900
Copy link
Contributor Author

Why can't you just have a wrapper around your ChunkReader that applies the processing? This is a concern orthogonal to the responsibility of ChunkDataset and this unnecessarily makes the API more complex.

Thanks Adam for your feedback.
There are two reasons having a ChunkReader wrapper is not the best solution:

  1. ChunkReader by design is just a reader. It should just know how to read a chunk of data given a chunk id. It would be weird to add some preprocessing logic like sorting in this class (and wrapper class).
  2. More importantly, the wrapper solution cannot achieve across-chunk preprocessing. For example, if the user wants to do bucketing across multiple chunks, they will not be able to achieve that using the wrapper. This is the real deal breaker here.

@apaszke
Copy link
Contributor

apaszke commented Jul 26, 2019

  1. I'm sorry, but to me it's worse to keep pushing things into a class it shouldn't be doing, than to have a nice and composable wrapper around readers. Readers are really the same thing as datasets in regular PyTorch, and no one objects that you can postcompose functions with them (or map). Hence, I don't think this is a good idea.
  2. Can you please explain what "bucketing" means in this case? Why can't you do the transforms on the batch level instead of on the chunk level? Multiple chunks are only loaded when you request cross_chunk_shuffle_count_ and using a "shuffling" argument for the purpose of implementing cross-chunk transforms is a misuse. I was also wary of adding cross_chunk_shuffle_count_ before because I anticipated it would go this way, but I was hoping it was just a one time thing. Now you're pushing this flag even further and I feel this is a bit too far.

@xzhu1900
Copy link
Contributor Author

xzhu1900 commented Jul 26, 2019

Thanks Adam.

  1. the preprocessing is just like a collate function. Instead of apply to each batch, it applys to each chunk. If dataset supports batch-level collate function, it is natural and the right thing that it supports chunk-level preprocess. So dataset is the right object to have this functionality.

  2. "bucketing" is one example for preprocess data. There are ther common functions like sorting. For example, user may want to sort a chunk of data based on the length of the data sequence. Then based on each batch's longest sequence, only 0-pad the remaining batch sequence to that length. This is a very common way to avoid inefficient 0-padding and Out-of-memory error on GPU. The sorting process has to be done on chunk level.

  3. Now come to multiple chunk case. The sorting functionality mentioned in 2 can not be achieved by using a wrapper. This is the restriction from the wrapper approach and thus a deal breaker. Why? Say you are going to sort two chunks' of data. How are you going to randomly sample those two chunk from the wrapper? The ChunkSampler only produces one index at a time, how do you generate 2 random chunk indices from that? Fiddle with the index may get around with it, but it is still fake random and unnecessarily complicates the logic.

Based on that, I think this PR proposed a very clean and straight forward approach. Given that there are only few days till next release, if you still have any questions, maybe we can use a higher bandwidth channel to discuss them, and hopefully this resolved in time.

@apaszke
Copy link
Contributor

apaszke commented Jul 27, 2019

@pytorchbot merge this please

@pytorchbot pytorchbot added the merge-this-please Was marked for merge with @pytorchbot merge this please label Jul 27, 2019
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@colesbury merged this pull request in 31f1928.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

merge-this-please Was marked for merge with @pytorchbot merge this please Merged module: cpp Related to C++ API module: dataloader Related to torch.utils.data.DataLoader and Sampler open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants