Skip to content

Conversation

@goldsborough
Copy link
Contributor

This PR brings to changes to the recently landed C++ Frontend dataloader:

  1. Removes the size() method from BatchDataset. This makes it cleaner to implement unsized ("infinite stream") datasets. The method was not used much beyond initial configuration.
  2. Makes the index type of a dataset a template parameter of BatchDataset and Sampler. This essentially allows custom index types instead of only vector<size_t>. This greatly improves flexibility.

See the InfiniteStreamDataset and TestIndex datasets in the tests for what this enables.

Some additional minor updates and code movements too.

@apaszke @ssnl

@goldsborough goldsborough requested a review from ebetica as a code owner October 23, 2018 00:27
@goldsborough goldsborough force-pushed the cpp-data-loader-improved branch from 416e901 to 63f50db Compare October 23, 2018 00:29
/// to configure the `DataLoader` with, and a `sampler` that specifies the
/// sampling strategy.
DataLoader(Dataset dataset, DataLoaderOptions options, Sampler sampler)
: options_(std::move(options)),

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

index_ += index_batch.size();
return index_batch;
}
optional<std::vector<size_t>> next(size_t batch_size) override;

This comment was marked as off-topic.

This comment was marked as off-topic.

@apaszke
Copy link
Contributor

apaszke commented Oct 23, 2018

I didn't have time to review this yet, but I'm concerned that our C++ dataset API starts diverging from the Python one significantly. We should stop this and bring them together, or we should stop calling this a C++ interface to PyTorch.

@goldsborough
Copy link
Contributor Author

@apaszke that sounds a bit dramatic :) Let me give some more context on these two changes. @ssnl, @soumith and I actually sat together before meeting with some partners to discuss changes to the C++ dataloader (which MSFT wants to use), and how we can reconcile them with the Python dataloader. We want both to have the same features/functionality and @ssnl is working on making the size() optional in Python too.

  1. We want to support the use case of infinite stream datasets cleanly. Sure, you can always have size() be INT_MAX, but (1) this is not a clean API and (2) if you ever accidentally used the (default) RandomSampler, you'd have a bit of a memory problem. By making the concept of unsized datasets more "first class", we can verify, for example, that you don't try to create a RandomSampler from an unsized dataset.

  2. This change actually makes the C++ frontend dataloader more like Python, because you can now have indices other than size_t. In Python this all works through duck typing of course. So in C++ we allow this now too, which yields much greater flexibility.

I hope this clarifies things and let me know if you have more thoughts.

@goldsborough goldsborough force-pushed the cpp-data-loader-improved branch 2 times, most recently from 20b6ed3 to 9bb319a Compare October 23, 2018 20:15
@ssnl ssnl mentioned this pull request Oct 23, 2018
7 tasks
@goldsborough goldsborough force-pushed the cpp-data-loader-improved branch 3 times, most recently from 8a97ccf to 862f8c6 Compare October 26, 2018 16:53
@goldsborough
Copy link
Contributor Author

Requesting review please: @apaszke @ezyang

@goldsborough goldsborough force-pushed the cpp-data-loader-improved branch 2 times, most recently from 7b554bd to bcf8a94 Compare October 29, 2018 20:58
Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't checked the tests. I think the API for streams feels a bit like a hack and could be improved to make it much nicer. There are also some rough edges around making the dataset size optional.

If we're going to merge this change, I'd really like us to consider how would we integrate those things into our Python API. Those two interfaces already have different semantics, and I don't want them to diverge any further.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

@goldsborough goldsborough force-pushed the cpp-data-loader-improved branch from bcf8a94 to b3b2ae0 Compare October 31, 2018 17:32
@goldsborough
Copy link
Contributor Author

@apaszke I've renamed Index to BatchIndex and documented StreamSampler. I added a comment about why StreamSampler was written in the "just give me batch_size values" way. Let me know what you think.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@goldsborough has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@goldsborough
Copy link
Contributor Author

@apaszke what do you think about the StreamSampler issue? I don't want to have this PR linger too long, there are people at microsoft waiting for it

@goldsborough goldsborough force-pushed the cpp-data-loader-improved branch from b3b2ae0 to ca8bf7c Compare November 5, 2018 19:36
Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@goldsborough is landing this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants