-
Notifications
You must be signed in to change notification settings - Fork 26.3k
add sorting policy to ChunkDataset #23053
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
soumith
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
from what I read, it is more powerful than sorting, you are applying a lambda to the chunk that is loaded.
Can you rename it to transform or preprocess instead of sorting_policy
|
waiting for CI to finish |
|
@pytorchbot rebase this please |
apaszke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't you just have a wrapper around your ChunkReader that applies the processing? This is a concern orthogonal to the responsibility of ChunkDataset and this unnecessarily makes the API more complex.
Thanks Adam for your feedback.
|
|
|
Thanks Adam.
Based on that, I think this PR proposed a very clean and straight forward approach. Given that there are only few days till next release, if you still have any questions, maybe we can use a higher bandwidth channel to discuss them, and hopefully this resolved in time. |
|
@pytorchbot merge this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ezyang is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@colesbury merged this pull request in 31f1928. |
Add a sorting policy to ChunkDataset.
This is considered an advanced parameter for developers who want to apply a 'sorting policy' to the chunk data before sampling into minibatch.
Different than the collate method, this policy is applied on the chunk level instead of minibatch level. When a chunk of data is loaded (multiple chunks if cross_chunk_shuffle_count_ is greater than 1), this policy is targeting to the full loaded data. It will be useful if developers want to perform some pre-processing (like bucketing) to the chunk data before example sampler samples the data.