-
Notifications
You must be signed in to change notification settings - Fork 26.3k
BugFix for function sparse_coo_tensor_ctor #40648
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BugFix for function sparse_coo_tensor_ctor #40648
Conversation
💊 CI failures summary and remediationsAs of commit 2c84beb (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Follow this link to opt-out of these comments for your Pull Requests.Please report bugs/suggestions on the GitHub issue tracker or post in the (internal) Dr. CI Users group. This comment has been revised 51 times. |
pearu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bug fix looks good to me.
@aocsa, could you implement also the corresponding unittest?
bf8bc9f to
530f72a
Compare
pearu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
The CI failures seem to be unrelated to the changes in this PR.
530f72a to
839b070
Compare
|
So behavior introduced by this PR (unless I'm misreading) is:
|
|
Sorry for arriving late. @aocsa, did you mean to close this PR? It would be nice to fix this bug. Is there something we can help with? |
OK. |
ca764a0 to
d4fceae
Compare
|
Logic looks good but looks like you need to merge with master or rebase (to help the CI merge the branch) and fix the flake8 and clang-tidy lint issues. Once that's done we should be good to go! |
664459e to
5534084
Compare
5534084 to
b892cce
Compare
I rebased this PR and fixed the flake8 and clang-tidy lint issues. |
|
This still results in unexpected behavior if dtype is specified, but device is not (because in this case type_inference is false): |
torch/_torch_docs.py
Outdated
| Default: if None, uses the current device for the default tensor type | ||
| (see :func:`torch.set_default_tensor_type`). :attr:`device` will be the CPU | ||
| for CPU tensor types and the current CUDA device for CUDA tensor types. | ||
| Default: if None, the device of indices must match device of values, otherwise an exception is raised. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
YOu should mention that if None, infers device from :attr:values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More correctly it would be "and the sparse tensor is constructed on the same device?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both statements are technically correct.
Thanks for notice me about this issue @ngimel. For this case (dtype is specified, but device is not), in the current master branch, no matter where is the indices or the values both are going to be transformed to default device (CPU). See example below. Moreover, If I understood well we should avoid internal transformation of the device source. So, I am not sure, what should be the expected behavior for this case. |
|
Existing behavior is very strange. The expected behavior that we should document is when device is not specified (regardless of whether dtype is specified or not) indices and values should be on the same device, and the sparse tensor will be constructed on the this device. |
| auto tensor = tensor_from_cuda_array_interface(data); | ||
| const auto& inferred_scalar_type = type_inference ? tensor.scalar_type() : scalar_type; | ||
| auto device = device_opt.has_value() ? *device_opt : at::Device(computeDeviceType(dispatch_key)); | ||
| auto device = device_opt.has_value() ? *device_opt : at::Device(computeDeviceType(dispatch_key.first)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why in this case and the following case does the call not also include the index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because this is the case when the tensor comes from numpy, so, the index is not needed.
torch/csrc/utils/tensor_new.cpp
Outdated
| Tensor indices = internal_new_from_data(legacyExtractDispatchKey(values.key_set()), kLong, r.deviceOptional(3), r.pyobject(0), false, true, false); | ||
| return at::sparse_coo_tensor(indices, values, values.options().layout(at::kSparse)).set_requires_grad(r.toBool(4)); | ||
| Tensor values = internal_new_from_data(inferred_dispatch_key, inferred_scalar_type, r.deviceOptional(ARG_DEVICE), r.pyobject(ARG_VALUES), false, true, type_inference); | ||
| check_expected_devices(device_guard.original_device(), values, r.pyobject(ARG_INDICES)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about moving this check to after the construction of the indices tensor and rewriting it as `TORCH_CHECK(values.device() == indices.device(), ...)'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right!
torch/csrc/utils/tensor_new.cpp
Outdated
| const auto inferred_scalar_type = r.scalartypeWithDefault(ARG_TYPE, scalar_type); | ||
| at::OptionalDeviceGuard device_guard(r.deviceOptional(ARG_DEVICE)); | ||
| Tensor values = internal_new_from_data(inferred_dispatch_key, inferred_scalar_type, r.deviceOptional(ARG_DEVICE), r.pyobject(ARG_VALUES), false, true, type_inference); | ||
| check_expected_devices(device_guard.original_device(), values, r.pyobject(ARG_INDICES)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as above about the position and style of this check
torch/csrc/utils/tensor_new.cpp
Outdated
| at::OptionalDeviceGuard device_guard(r.deviceOptional(ARG_DEVICE)); | ||
| Tensor values = internal_new_from_data(inferred_dispatch_key, inferred_scalar_type, r.deviceOptional(ARG_DEVICE), r.pyobject(ARG_VALUES), false, true, type_inference); | ||
| Tensor indices = internal_new_from_data(legacyExtractDispatchKey(values.key_set()), kLong, r.deviceOptional(ARG_DEVICE), r.pyobject(ARG_INDICES), false, true, false); | ||
| check_expected_devices(device_guard.original_device(), values, r.pyobject(0)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question as above
| // are defined per-layout-type (e.g. tensor vs sparse_coo_tensor). | ||
| const auto& inferred_scalar_type = type_inference ? var.scalar_type() : scalar_type; | ||
| auto device = device_opt.has_value() ? *device_opt : (type_inference ? var.device() : at::Device(computeDeviceType(dispatch_key))); | ||
| auto device = device_opt.has_value() ? *device_opt : (type_inference ? var.device() : at::Device(computeDeviceType(dispatch_key.first), dispatch_key.second)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the existing line that needs to change to no longer cause dtype to have an affect on device, right? We really don't want there to be any connection between dtype and device.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, but it can be solved by using the right dispatch_key. For now, I will not touch this internal_new_from_data function.
|
@aocsa, I had a chance to review/investigate the dtype issue more with @ngimel, and we actually found that it affects other functions, like To simplify this fix and address that underlying issue, we thought we'd fix What are your thoughts with that approach? |
Sure, I think it's a good idea. I am already testing new changes that only touch the cc @mruberry |
|
Great; I'll let you know when the fix is available. |
|
Fyi #41984, which is currently being tested to see if it breaks any existing PyTorch programs we know of. |
…vice of inputs tensors they're given, by default (#41984) Summary: **BC-Breaking Note** This PR changes the behavior of the torch.tensor, torch.as_tensor, and sparse constructors. When given a tensor as input and a device is not explicitly specified, these constructors now always infer their device from the tensor. Historically, if the optional dtype kwarg was provided then these constructors would not infer their device from tensor inputs. Additionally, for the sparse ctor a runtime error is now thrown if the indices and values tensors are on different devices and the device kwarg is not specified. **PR Summary** This PR's functional change is a single line: ``` auto device = device_opt.has_value() ? *device_opt : (type_inference ? var.device() : at::Device(computeDeviceType(dispatch_key))); ``` => ``` auto device = device_opt.has_value() ? *device_opt : var.device(); ``` in `internal_new_from_data`. This line entangled whether the function was performing type inference with whether it inferred its device from an input tensor, and in practice meant that ``` t = torch.tensor((1, 2, 3), device='cuda') torch.tensor(t, dtype=torch.float64) ``` would return a tensor on the CPU, not the default CUDA device, while ``` t = torch.tensor((1, 2, 3), device='cuda') torch.tensor(t) ``` would return a tensor on the device of `t`! This behavior is niche and odd, but came up while aocsa was fixing #40648. An additional side affect of this change is that the indices and values tensors given to a sparse constructor must be on the same device, or the sparse ctor must specify the dtype kwarg. The tests in test_sparse.py have been updated to reflect this behavior. Pull Request resolved: #41984 Reviewed By: ngimel Differential Revision: D22721426 Pulled By: mruberry fbshipit-source-id: 909645124837fcdf3d339d7db539367209eccd48
Great!, I already reviewed and tested changes in PR #41984 with the identified test cases and it solves this issue too. So this PR should be closed, right? |
If you've verified the behavior then I think we're OK closing this, yes. I was surprised the change we identified as necessary here also solved this issue. Thank you for helping with it. |
This PR solves a bug when the user does not specify the device for the sparse tensor using the
sparse_coo_tensorfunction but the values and indices tensor are in different CUDA device than default.Ex:
sparse_tensor = torch.sparse_coo_tensor(sp_ind, sp_val, sp_shape).This was solved by using the same device_index from values Tensor for indices Tensor at
sparse_coo_tensor_ctorfunctionResolves bug #28500.
Note that
torch.sparse.FloatTensorconstructor worked as expected because it uses thelegacy_sparse_tensor_ctor. This function uses the same device information of the original Tensor PyThon object. In this fix we replicated this behavior in the new constructorsparse_coo_tensor_ctorfunction.So now this works!