-
Notifications
You must be signed in to change notification settings - Fork 26.3k
Refactor ChunkDataReader API + fix missing headers #19485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor ChunkDataReader API + fix missing headers #19485
Conversation
|
@smessmer, you've been thinking about this recently. Could you please review this patch? |
|
@ezyang I was thinking about this in the context of kernel APIs. I think data readers are orthogonal to that. The changes in this diff are doing what the description claims, but we'd have to discuss if we want these changes. @thiagocrepaldi is right with the claim that ChunkDataReader uses some std::vector APIs, so even before this PR, it would only have worked with container types that implement |
|
@smessmer Yes, you are right. However, in any of the use cases we see so far, we have not seen a case for chunk reader to produce non vector types. For example, the "BatchType" type used in the C++ API is a vector of "Tensor"s or "Examples". This PR makes the pybinding the dataloaders much easier in the future as well. Since it is a minimum change, just wondering if we can take this to pytorch before 1.1. |
|
@apaszke @goldsborough Do you see any potential issues with the change? |
|
A gentle reminder... |
apaszke
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see why would we do that. While the ChunkdDataReader might be written with an assumption that the object conforms to the vector API, it does not have to be exactly that, and I don't see why would you forbid someone from using e.g. SmallVector?
|
@apaszke I see your concern, but the whole point of the 'ChunkDataReader' is to read a large chunk. Also this PR makes the pybind of the 'ChunkDataSet' bit more easier. So while, I would like to keep the API flexible, I don't think the concern on a different vector is a valid one to hold this PR. |
|
Ok, how about this. Instead of using a single template parameter for the chunk type we keep two - one for the data type, and another as a template parameter for a container, which defaults to |
Hi Adam, we have a dual goal for this PR. The first is to fix the API issue that arises when we use a non-vector complaint API as template argument for the Your suggestion would fix the first concern beautifully in C++, but would also make Python binding very difficult because Python doesn't support templates and we would have to either 'hide' all templates from Python by either instantiating Although keeping the API generic is usually a good thing, I was wondering if we could keep the implementation as simple as possible until we have an use case that supports the additional complexity |
|
How would it make pybinding any different? I don't see how e.g. exposing a pybind class for |
Leveraging default arguments for the template during binding generations is a good idea! Although |
1d8d272 to
224cb28
Compare
I have pushed a new update with the proposed change. What do you think? |
This PR restricts the BatchType template argument of ChunkDataReader to STL vectors only. Internally, ChunkDataReader was assuming BatchType was a vector, but the user could pass any type to the template argument, leading to compiling issues during CPP extensions. Additionally to the proposed API change, this PR adds missing include headers to chunk.h. Currently the current implementation works but if users try to create C++ extensions that implements new ChunkDataReaders to be along with the existing ChunkDataset, the build will fail due to the missing headers. In terms of functionality, nothing has changed. This PR simply makes the implementation slightly more robust for future extensions.
224cb28 to
c787d8d
Compare
|
A gentle reminder... |
|
@pytorchbot rebase this please |
|
@pytorchbot merge this please |
|
@pytorchbot rebase this please |
facebook-github-bot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@soumith has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.
This PR restricts the BatchType template argument of ChunkDataReader to STL
vectors only. Internally, ChunkDataReader was assuming BatchType was a
vector, but the user could pass any type to the template argument,
leading to compiling issues during CPP extensions.
Additionally to the proposed API change, this PR adds missing include
headers to chunk.h. Currently the current implementation works but if
users try to create C++ extensions that implements new ChunkDataReaders
to be along with the existing ChunkDataset, the build will fail due to
the missing headers.
In terms of functionality, nothing has changed. This PR simply makes the
implementation slightly more robust for future extensions.