Skip to content

Conversation

@madsbk
Copy link
Contributor

@madsbk madsbk commented Jul 1, 2019

Address the issue raised in #22377.

The PR #22016 introduces a temporary tensor of weights grad_weight_per_segment of the same dtype as the end result, which can be a problem when using float16.
In this PR, it now use a float32 temporary tensor when the input is float16.

@ngimel, can I get you to review? I think I have fixed the issues you have pointed out.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is still partials_per_segment_offset device_vector, and direct access to its elements.

@madsbk
Copy link
Contributor Author

madsbk commented Jul 1, 2019

It was my impression that the policy in thrust::exclusive_scan would make it use the PyTorch allocator?

The direct access to the two elements is unfortunately necessary. It need to know the size of partial_segment_offset and how many threads to spawn.

@ngimel
Copy link
Collaborator

ngimel commented Jul 1, 2019

device_vector with default allocator template argument https://github.com/pytorch/pytorch/pull/22401/files#diff-5e16509ef3a09789b80e5d16f8dc3062R248 won't use any specified policy, as there's no policy argument in the constructor.
It's unfortunate about accesses to device memory, if this results in the slowdown for other workloads perhaps the old path will need to be brought back (also it will be needed for cuda graphs later on), but for now it's not a blocker. device_vector is, though.

@madsbk
Copy link
Contributor Author

madsbk commented Jul 1, 2019

Make sense, thanks

Copy link
Collaborator

@ngimel ngimel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving, hope this will fix @chenyangyu1988 accuracy problems.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Contributor

@facebook-github-bot facebook-github-bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrshenli has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@mrshenli merged this pull request in c9a8413.

zdevito pushed a commit to zdevito/ATen that referenced this pull request Jul 2, 2019
Summary:
Address the issue raised in pytorch/pytorch#22377.

The PR pytorch/pytorch#22016 introduces a temporary tensor of weights `grad_weight_per_segment` of the same dtype as the end result, which can be a problem when using `float16`.
In this PR, it now use a `float32` temporary tensor when the input is `float16`.

ngimel, can I get you to review? I think I have fixed the issues you have pointed out.
Pull Request resolved: pytorch/pytorch#22401

Differential Revision: D16077319

Pulled By: mrshenli

fbshipit-source-id: 7cfad7f40b4d41a244052baa2982ab51bbbd7309
@madsbk madsbk deleted the embedding_bug branch July 4, 2019 07:55
xzhu1900 pushed a commit to xzhu1900/pytorch that referenced this pull request Jul 5, 2019
Summary:
Address the issue raised in pytorch#22377.

The PR pytorch#22016 introduces a temporary tensor of weights `grad_weight_per_segment` of the same dtype as the end result, which can be a problem when using `float16`.
In this PR, it now use a `float32` temporary tensor when the input is `float16`.

ngimel, can I get you to review? I think I have fixed the issues you have pointed out.
Pull Request resolved: pytorch#22401

Differential Revision: D16077319

Pulled By: mrshenli

fbshipit-source-id: 7cfad7f40b4d41a244052baa2982ab51bbbd7309
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Merged module: cuda Related to torch.cuda, and CUDA support in general open source

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants