Skip to content

Conversation

@ezyang
Copy link
Contributor

@ezyang ezyang commented May 4, 2017

Fixes #1462

@ezyang ezyang changed the title [WIP] sparse_select Polymorphic sparse_mask and sparse_copy May 8, 2017
@ezyang
Copy link
Contributor Author

ezyang commented May 8, 2017

This is ready for review!

Copy link
Contributor

@apaszke apaszke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what did it buy you - you have as many _sparse_select functions defined as many _sparse_masks you'd need, and now we have cyclic dependencies between TH(C) and TH(C)S

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

torch/tensor.py Outdated

This comment was marked as off-topic.

This comment was marked as off-topic.

@ezyang
Copy link
Contributor Author

ezyang commented May 9, 2017

I'm not sure what did it buy you - you have as many _sparse_select functions defined as many _sparse_masks you'd need

Sorry, I wasn't clear about how the savings work.

Recall that the type of sparse_mask: THSTensor sparseMask(THTensor dense, THSTensor mask). Via our macro magic, this means that I get one implementation of sparseMask per tensor type, O(n) in total; e.g., I have:

  • THSFloatTensor sparseMask(THFloatTensor dense, THSFloatTensor mask)
  • THSLongTensor sparseMask(THLongTensor dense, THSLongTensor mask)
  • etc

Now, the problem is that I want to be able to call sparseMask with a Float dense tensor, but a, say, Long sparse mask. Intuitively, there's no problem with this sort of operation, since at the end of the day we only care about the indices of the sparse tensor. However, this means that if I do this with macros, I end up with O(n^2) copies of sparseMask:

  • THSFloatTensor sparseMask(THFloatTensor dense, THSFloatTensor mask)
  • THSFloatTensor sparseMask(THFloatTensor dense, THSLongTensor mask)
  • THSFloatTensor sparseMask(THFloatTensor dense, ... mask)
  • THSFloatTensor sparseMask(THLongTensor dense, THSFloatTensor mask)
  • THSLongTensor sparseMask(THLongTensor dense, THSLongTensor mask)
  • THSLongTensor sparseMask(THLongTensor dense, THSLongTensor mask)
  • THSLongTensor sparseMask(THLongTensor dense, ... mask)
  • etc

In this patch, I get polymorphic sparseMask, but with only O(n) functions, since they accept the index tensor (of which there's only one type.) That's the benefit.

@ezyang
Copy link
Contributor Author

ezyang commented May 9, 2017

OK, nailed all of the review comments that I agreed with. (I also accidentally slipped in my ref counting docs patch oops)

ezyang added 18 commits May 15, 2017 10:31
Signed-off-by: Edward Z. Yang <[email protected]>
Signed-off-by: Edward Z. Yang <[email protected]>
Signed-off-by: Edward Z. Yang <[email protected]>
Copy link
Contributor

@adamlerer adamlerer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this PR is mostly okay functionally, but I think the whole sparseMask/sparseSelect/sparseMask interface is the wrong level of abstraction, as these are not logically sensible operations (it doesn't represent any mathematical operation). The fact that you have to add the polymorphism is a sign of that (since you should be working directly on the indices).

Here's an alternative THS API that can achieve the same thing... I think it's simpler and more general.

sparse.linear_indices()  # returns a 1D tensor with the linearized indices, same as THSTensor_(flattenedIndices)`
sparse.nelemI()  # product of nDimI dimensions
sparse.nelemV()  # product of nDimV dimensions

Now you can write sparseSelect and sparseCopy as:

dense.view(sparse.nelemI(), sparse.nelemV()).index_select(sparse.linear_indices())  # sparse_select
dense.view(sparse.nelemI(), sparse.nelemV()).index_copy(sparse.linear_indices(), sparse.values())  # sparse_copy

You could also do the index_add variant etc. with a python one-liner (which means you could reduce the THCSTensor_(spadd) function to just use these function calls).

This comment was marked as off-topic.

long *stride;
int nDimension;

// Note: storage->size may be greater than the recorded size

This comment was marked as off-topic.

This comment was marked as off-topic.

for (long d = 0; d < mask->nDimensionI; d++) {
THCudaLongTensor_mul(state, indices, indices, mask->size[d]);
THCudaLongTensor_select(state, indicesBuffer, maskIndices, 0, d);
for (long d = 0; d < nDimI; d++) {

This comment was marked as off-topic.

@ezyang
Copy link
Contributor Author

ezyang commented Jun 7, 2017

I'm cool with rewriting all of the interfaces with Adam's plan (it sounds good to me.) I'd like to hear what other people think though! CC @martinraison @apaszke @soumith @ebetica

We can of course bikeshed the names.

@martinraison
Copy link
Contributor

I like @adamlerer's suggestion! We may still want to provide higher level abstractions like sparse_mask, but implemented directly in Python using linear_indices/view/index_select.

I'm not sure what the best names would be. I'm just giving some additional suggestions for nelemI/nelemV: isize/vsize, sizeI/sizeV, inumel/vnumel, numelI/numelV (we just want to make sure those things are not confused with nnz, and I think something close to size or numel would fit well with the Tensor API)

@ezyang
Copy link
Contributor Author

ezyang commented Jun 9, 2017

This PR will be obsoleted.

@ezyang
Copy link
Contributor Author

ezyang commented Jun 11, 2017

Quick update from the implementation world. Here is a more correct definition of sparse_mask, using the numelI/numelV nomenclature:

    def _sparse_mask(self, s):
        v = self.view(s.numelI(), *s._values().size()[1:]).index_select(0, s._linear_indices())
        return type(s)(s._indices(), v, s.size())

Notice that to get the correct dimensionality of v, we want to preserve the sizes of the V dimensions. I think there should be a more convenient way to refer to this, and I think sizeV is a good match. Then we have invariants:

  • s.sizeI() + s.sizeV() == s.size()
  • s.numelI() * s.numelV() == s.to_dense().numel() (Arguably we should add numel natively for sparse)

I guess I should also add new() support, so I don't have to use type(s) to hack the constructor.

CC @adamlerer and @martinraison

@ezyang ezyang deleted the sparse-select branch September 7, 2017 20:23
jjsjann123 added a commit to jjsjann123/pytorch that referenced this pull request Mar 2, 2022
supports alias on casted inputs. This allows us to handle fp16 running stats (fp16 norm layers).

Arguably we would want to do this in the parser code instead of inside Fusion::aliasOutputToInput. But that means we'll duplicate this code in a few places, which is ugly as well.

TODO:
properly handle segment io alias cpp test in PR pytorch#1471
jjsjann123 added a commit to jjsjann123/pytorch that referenced this pull request Mar 2, 2022
fix pytorch#67610
drop aliased output from pushing to stack

Currently we don't support to mark aliased output to be real outputs. which is tracked in pytorch#1488
jagadish-amd pushed a commit to jagadish-amd/pytorch that referenced this pull request Sep 5, 2024
… sync (pytorch#1455) (pytorch#1471)

* [SWDEV-469514] hipGraphExecDestroy requires an explicit sync

There is a new hip feature where they do not free hipGraph memory
as soon as hipGraphExecDestroy is called. This is to support async
work on the GPU. See this for more details:
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#cuda-user-objects

We noticed this issue when an allreduce op inside a hipGraph hung.
Essentially, ncclCommAbort was waiting for all GPU activity to finish.
However, since hipGraph memory was technically still in use, we had an
infinite hang. So, I added an extra hipDeviceSynchronize in CUDAGraph's
destructor to esure that memory is freed and got
test_allreduce_in_cudagraph UT to pass.

However, when I ran this on CUDA machine, I noticed that they did not
require this extra sync in order to successfully run the UT. It seems
that they were calling cudaGraphInstantiateWithFlags with
cudaGraphInstantiateFlagAutoFreeOnLaunch, which aggressively frees
memory after graph lauch. There is support for this API in our ROCm
stack, but we were missing cuda to hip mappings in PyTorch. So, I
brought them in and added the necesary conditions to call this API in
HIP case also.

* Update comments

* Use USE_ROCM in keeping with convention

* Use USE_ROCM to match convention

---------

Co-authored-by: Jithun Nair <[email protected]>
(cherry picked from commit e752b4f)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants