add sorting policy to ChunkDataset #23053

xzhu1900 · 2019-07-19T00:35:57Z

Add a sorting policy to ChunkDataset.

This is considered an advanced parameter for developers who want to apply a 'sorting policy' to the chunk data before sampling into minibatch.

Different than the collate method, this policy is applied on the chunk level instead of minibatch level. When a chunk of data is loaded (multiple chunks if cross_chunk_shuffle_count_ is greater than 1), this policy is targeting to the full loaded data. It will be useful if developers want to perform some pre-processing (like bucketing) to the chunk data before example sampler samples the data.

soumith

from what I read, it is more powerful than sorting, you are applying a lambda to the chunk that is loaded.
Can you rename it to transform or preprocess instead of sorting_policy

soumith · 2019-07-24T19:32:07Z

waiting for CI to finish

soumith · 2019-07-24T19:53:28Z

@pytorchbot rebase this please

apaszke

Why can't you just have a wrapper around your ChunkReader that applies the processing? This is a concern orthogonal to the responsibility of ChunkDataset and this unnecessarily makes the API more complex.

xzhu1900 · 2019-07-25T16:53:28Z

Why can't you just have a wrapper around your ChunkReader that applies the processing? This is a concern orthogonal to the responsibility of ChunkDataset and this unnecessarily makes the API more complex.

Thanks Adam for your feedback.
There are two reasons having a ChunkReader wrapper is not the best solution:

ChunkReader by design is just a reader. It should just know how to read a chunk of data given a chunk id. It would be weird to add some preprocessing logic like sorting in this class (and wrapper class).
More importantly, the wrapper solution cannot achieve across-chunk preprocessing. For example, if the user wants to do bucketing across multiple chunks, they will not be able to achieve that using the wrapper. This is the real deal breaker here.

apaszke · 2019-07-26T01:21:19Z

I'm sorry, but to me it's worse to keep pushing things into a class it shouldn't be doing, than to have a nice and composable wrapper around readers. Readers are really the same thing as datasets in regular PyTorch, and no one objects that you can postcompose functions with them (or map). Hence, I don't think this is a good idea.
Can you please explain what "bucketing" means in this case? Why can't you do the transforms on the batch level instead of on the chunk level? Multiple chunks are only loaded when you request cross_chunk_shuffle_count_ and using a "shuffling" argument for the purpose of implementing cross-chunk transforms is a misuse. I was also wary of adding cross_chunk_shuffle_count_ before because I anticipated it would go this way, but I was hoping it was just a one time thing. Now you're pushing this flag even further and I feel this is a bit too far.

xzhu1900 · 2019-07-26T18:43:50Z

Thanks Adam.

the preprocessing is just like a collate function. Instead of apply to each batch, it applys to each chunk. If dataset supports batch-level collate function, it is natural and the right thing that it supports chunk-level preprocess. So dataset is the right object to have this functionality.
"bucketing" is one example for preprocess data. There are ther common functions like sorting. For example, user may want to sort a chunk of data based on the length of the data sequence. Then based on each batch's longest sequence, only 0-pad the remaining batch sequence to that length. This is a very common way to avoid inefficient 0-padding and Out-of-memory error on GPU. The sorting process has to be done on chunk level.
Now come to multiple chunk case. The sorting functionality mentioned in 2 can not be achieved by using a wrapper. This is the restriction from the wrapper approach and thus a deal breaker. Why? Say you are going to sort two chunks' of data. How are you going to randomly sample those two chunk from the wrapper? The ChunkSampler only produces one index at a time, how do you generate 2 random chunk indices from that? Fiddle with the index may get around with it, but it is still fake random and unnecessarily complicates the logic.

Based on that, I think this PR proposed a very clean and straight forward approach. Given that there are only few days till next release, if you still have any questions, maybe we can use a higher bandwidth channel to discuss them, and hopefully this resolved in time.

apaszke · 2019-07-27T01:53:16Z

@pytorchbot merge this please

facebook-github-bot

@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2019-07-29T20:36:07Z

@colesbury merged this pull request in 31f1928.

xzhu1900 added 2 commits July 18, 2019 15:32

add sorting policy

10a3c2a

add comment

1cdd167

xzhu1900 requested review from ebetica, goldsborough and yf225 as code owners July 19, 2019 00:35

pytorchbot added the module: cpp Related to C++ API label Jul 19, 2019

ezyang added the open source label Jul 19, 2019

cpuhrsch added module: dataloader Related to torch.utils.data.DataLoader and Sampler low priority We're unlikely to get around to doing this in the near future triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module labels Jul 19, 2019

ezyang removed the low priority We're unlikely to get around to doing this in the near future label Jul 19, 2019

soumith suggested changes Jul 24, 2019

View reviewed changes

rename sorting_policy to preprocessing_policy

cbe62a0

soumith approved these changes Jul 24, 2019

View reviewed changes

Merge remote-tracking branch 'origin/master' into HEAD

c85a5ae

apaszke suggested changes Jul 25, 2019

View reviewed changes

apaszke approved these changes Jul 27, 2019

View reviewed changes

pytorchbot added the merge-this-please Was marked for merge with @pytorchbot merge this please label Jul 27, 2019

facebook-github-bot reviewed Jul 28, 2019

View reviewed changes

facebook-github-bot closed this in 31f1928 Jul 29, 2019

facebook-github-bot added the merged label Jul 29, 2019

mruberry added the Merged label Oct 28, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add sorting policy to ChunkDataset #23053

add sorting policy to ChunkDataset #23053

Uh oh!

xzhu1900 commented Jul 19, 2019

Uh oh!

soumith left a comment

Uh oh!

soumith commented Jul 24, 2019

Uh oh!

soumith commented Jul 24, 2019

Uh oh!

apaszke left a comment

Uh oh!

xzhu1900 commented Jul 25, 2019

Uh oh!

apaszke commented Jul 26, 2019 •

edited

Loading

Uh oh!

xzhu1900 commented Jul 26, 2019 •

edited

Loading

Uh oh!

apaszke commented Jul 27, 2019

Uh oh!

facebook-github-bot left a comment

Uh oh!

facebook-github-bot commented Jul 29, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

add sorting policy to ChunkDataset #23053

add sorting policy to ChunkDataset #23053

Uh oh!

Conversation

xzhu1900 commented Jul 19, 2019

Uh oh!

soumith left a comment

Choose a reason for hiding this comment

Uh oh!

soumith commented Jul 24, 2019

Uh oh!

soumith commented Jul 24, 2019

Uh oh!

apaszke left a comment

Choose a reason for hiding this comment

Uh oh!

xzhu1900 commented Jul 25, 2019

Uh oh!

apaszke commented Jul 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xzhu1900 commented Jul 26, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

apaszke commented Jul 27, 2019

Uh oh!

facebook-github-bot left a comment

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Jul 29, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

apaszke commented Jul 26, 2019 •

edited

Loading

xzhu1900 commented Jul 26, 2019 •

edited

Loading