-
Notifications
You must be signed in to change notification settings - Fork 26.3k
[C++ Frontend] Remove size() from BatchDataset and templatize IndexType #12960
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++ Frontend] Remove size() from BatchDataset and templatize IndexType #12960
Conversation
416e901 to
63f50db
Compare
| /// to configure the `DataLoader` with, and a `sampler` that specifies the | ||
| /// sampling strategy. | ||
| DataLoader(Dataset dataset, DataLoaderOptions options, Sampler sampler) | ||
| : options_(std::move(options)), |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
| index_ += index_batch.size(); | ||
| return index_batch; | ||
| } | ||
| optional<std::vector<size_t>> next(size_t batch_size) override; |
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
|
I didn't have time to review this yet, but I'm concerned that our C++ dataset API starts diverging from the Python one significantly. We should stop this and bring them together, or we should stop calling this a C++ interface to PyTorch. |
|
@apaszke that sounds a bit dramatic :) Let me give some more context on these two changes. @ssnl, @soumith and I actually sat together before meeting with some partners to discuss changes to the C++ dataloader (which MSFT wants to use), and how we can reconcile them with the Python dataloader. We want both to have the same features/functionality and @ssnl is working on making the
I hope this clarifies things and let me know if you have more thoughts. |
20b6ed3 to
9bb319a
Compare
8a97ccf to
862f8c6
Compare
7b554bd to
bcf8a94
Compare
apaszke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't checked the tests. I think the API for streams feels a bit like a hack and could be improved to make it much nicer. There are also some rough edges around making the dataset size optional.
If we're going to merge this change, I'd really like us to consider how would we integrate those things into our Python API. Those two interfaces already have different semantics, and I don't want them to diverge any further.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
bcf8a94 to
b3b2ae0
Compare
|
@apaszke I've renamed |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@goldsborough has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
|
@apaszke what do you think about the StreamSampler issue? I don't want to have this PR linger too long, there are people at microsoft waiting for it |
b3b2ae0 to
ca8bf7c
Compare
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@goldsborough is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
This PR brings to changes to the recently landed C++ Frontend dataloader:
size()method fromBatchDataset. This makes it cleaner to implement unsized ("infinite stream") datasets. The method was not used much beyond initial configuration.BatchDatasetandSampler. This essentially allows custom index types instead of onlyvector<size_t>. This greatly improves flexibility.See the
InfiniteStreamDatasetandTestIndexdatasets in the tests for what this enables.Some additional minor updates and code movements too.
@apaszke @ssnl