Skip to content

Conversation

@thiagocrepaldi
Copy link
Collaborator

This PR restricts the BatchType template argument of ChunkDataReader to STL
vectors only. Internally, ChunkDataReader was assuming BatchType was a
vector, but the user could pass any type to the template argument,
leading to compiling issues during CPP extensions.

Additionally to the proposed API change, this PR adds missing include
headers to chunk.h. Currently the current implementation works but if
users try to create C++ extensions that implements new ChunkDataReaders
to be along with the existing ChunkDataset, the build will fail due to
the missing headers.

In terms of functionality, nothing has changed. This PR simply makes the
implementation slightly more robust for future extensions.

@ailzhang ailzhang added the module: cpp Related to C++ API label Apr 19, 2019
@ezyang ezyang requested a review from smessmer April 19, 2019 21:51
@ezyang
Copy link
Contributor

ezyang commented Apr 19, 2019

@smessmer, you've been thinking about this recently. Could you please review this patch?

@smessmer
Copy link
Contributor

@ezyang I was thinking about this in the context of kernel APIs. I think data readers are orthogonal to that. The changes in this diff are doing what the description claims, but we'd have to discuss if we want these changes.

@thiagocrepaldi is right with the claim that ChunkDataReader uses some std::vector APIs, so even before this PR, it would only have worked with container types that implement .reserve(), .emplace(), ..., but it would have worked with custom types implementing that API. After this PR, custom container types would not work anymore. I don't think I have enough context to make this call.

@jaliyae
Copy link
Contributor

jaliyae commented Apr 23, 2019

@smessmer Yes, you are right. However, in any of the use cases we see so far, we have not seen a case for chunk reader to produce non vector types. For example, the "BatchType" type used in the C++ API is a vector of "Tensor"s or "Examples". This PR makes the pybinding the dataloaders much easier in the future as well. Since it is a minimum change, just wondering if we can take this to pytorch before 1.1.

@yf225 yf225 requested a review from apaszke April 23, 2019 20:56
@yf225
Copy link
Contributor

yf225 commented Apr 23, 2019

@apaszke @goldsborough Do you see any potential issues with the change?

@jaliyae
Copy link
Contributor

jaliyae commented Apr 25, 2019

A gentle reminder...

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see why would we do that. While the ChunkdDataReader might be written with an assumption that the object conforms to the vector API, it does not have to be exactly that, and I don't see why would you forbid someone from using e.g. SmallVector?

@jaliyae
Copy link
Contributor

jaliyae commented Apr 29, 2019

@apaszke I see your concern, but the whole point of the 'ChunkDataReader' is to read a large chunk. Also this PR makes the pybind of the 'ChunkDataSet' bit more easier. So while, I would like to keep the API flexible, I don't think the concern on a different vector is a valid one to hold this PR.

@apaszke
Copy link
Contributor

apaszke commented Apr 29, 2019

Ok, how about this. Instead of using a single template parameter for the chunk type we keep two - one for the data type, and another as a template parameter for a container, which defaults to std::vector? That's the approach which is often taken in STL.

@thiagocrepaldi
Copy link
Collaborator Author

Ok, how about this. Instead of using a single template parameter for the chunk type we keep two - one for the data type, and another as a template parameter for a container, which defaults to std::vector? That's the approach which is often taken in STL.

Hi Adam, we have a dual goal for this PR. The first is to fix the API issue that arises when we use a non-vector complaint API as template argument for the ChunkDataReader batch type. At the same type, this PR also targets to ease pybinding ChunkDataReader and ChunkDataset classes.

Your suggestion would fix the first concern beautifully in C++, but would also make Python binding very difficult because Python doesn't support templates and we would have to either 'hide' all templates from Python by either instantiating ChunkDataReader with all combinations of 'vector' / 'batch type' or create a generic batch type class that mimics STL vector API.
The former approach wouldnt be 100% bullet proof and from time to time we may have to pybind new custom classes. The second approach is less natural and more verbose for future implementation.

Although keeping the API generic is usually a good thing, I was wondering if we could keep the implementation as simple as possible until we have an use case that supports the additional complexity

@apaszke
Copy link
Contributor

apaszke commented Apr 30, 2019

How would it make pybinding any different? I don't see how e.g. exposing a pybind class for ChunkDataReader<int, std::vector> in my proposal (note that the second argument is also optional, so you don't have to specify that) would be different than exposing ChunkDataReader<int> in yours. As long as someone actually doesn't change the container type, there's nothing else they need to pybind.

@thiagocrepaldi
Copy link
Collaborator Author

How would it make pybinding any different? I don't see how e.g. exposing a pybind class for ChunkDataReader<int, std::vector> in my proposal (note that the second argument is also optional, so you don't have to specify that) would be different than exposing ChunkDataReader<int> in yours. As long as someone actually doesn't change the container type, there's nothing else they need to pybind.

Leveraging default arguments for the template during binding generations is a good idea! Although ChunkDataReader<int, std::vector> cant be instantiated because std::vector is incomplete without its arguments. ChunkDataReader<int, std::vector<int>> does the trick. All we would need to do in the current PR is changing ChunkDataReader API to something like:

template <typename DataType, typename ContainerType = std::vector<DataType>>
class ChunkDataReader { ... }

@thiagocrepaldi thiagocrepaldi force-pushed the thiagofc/improve-chunkdataset-api branch from 1d8d272 to 224cb28 Compare April 30, 2019 19:31
@thiagocrepaldi
Copy link
Collaborator Author

How would it make pybinding any different? I don't see how e.g. exposing a pybind class for ChunkDataReader<int, std::vector> in my proposal (note that the second argument is also optional, so you don't have to specify that) would be different than exposing ChunkDataReader<int> in yours. As long as someone actually doesn't change the container type, there's nothing else they need to pybind.

Leveraging default arguments for the template during binding generations is a good idea! Although ChunkDataReader<int, std::vector> cant be instantiated because std::vector is incomplete without its arguments. ChunkDataReader<int, std::vector<int>> does the trick. All we would need to do in the current PR is changing ChunkDataReader API to something like:

template <typename DataType, typename ContainerType = std::vector<DataType>>
class ChunkDataReader { ... }

I have pushed a new update with the proposed change. What do you think?

This PR restricts the BatchType template argument of ChunkDataReader to STL
vectors only. Internally, ChunkDataReader was assuming BatchType was a
vector, but the user could pass any type to the template argument,
leading to compiling issues during CPP extensions.

Additionally to the proposed API change, this PR adds missing include
headers to chunk.h. Currently the current implementation works but if
users try to create C++ extensions that implements new ChunkDataReaders
to be along with the existing ChunkDataset, the build will fail due to
the missing headers.

In terms of functionality, nothing has changed. This PR simply makes the
implementation slightly more robust for future extensions.
@thiagocrepaldi thiagocrepaldi force-pushed the thiagofc/improve-chunkdataset-api branch from 224cb28 to c787d8d Compare April 30, 2019 21:20
@thiagocrepaldi
Copy link
Collaborator Author

A gentle reminder...

@thiagocrepaldi thiagocrepaldi changed the title Restrict ChunkDataReader API to STL vectors only + fix missing headers Refactor ChunkDataReader API + fix missing headers May 6, 2019
@ezyang
Copy link
Contributor

ezyang commented May 6, 2019

@pytorchbot rebase this please

@soumith
Copy link
Contributor

soumith commented May 7, 2019

@pytorchbot merge this please

@soumith
Copy link
Contributor

soumith commented May 8, 2019

@pytorchbot rebase this please

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@yf225 yf225 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label May 9, 2019
@facebook-github-bot
Copy link
Contributor

@soumith merged this pull request in 3d4d7b9.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

module: cpp Related to C++ API open source triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants