Skip to content

Conversation

@masnesral
Copy link
Contributor

@masnesral masnesral commented Sep 5, 2024

@pytorch-bot
Copy link

pytorch-bot bot commented Sep 5, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135260

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a66a77e with merge base 58f2477 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
masnesral added a commit that referenced this pull request Sep 5, 2024
ghstack-source-id: 5e3a4af
Pull Request resolved: #135260
@masnesral
Copy link
Contributor Author

Also tested repro provided in #134720

@masnesral masnesral requested review from ezyang and lezcano September 5, 2024 22:42
f"libdevice.trunc({self._print(expr.args[0])}).to({V.kernel.index_dtype})"
)

def _print_Float(self, expr):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I admit I don't know if it's always valid to assume constants are float64. I'm operating on the assumption that any float literal originated from Python and is technically float64.

@masnesral masnesral marked this pull request as ready for review September 5, 2024 22:44
def f(x):
return x * (0.12 * x.shape[0])

x = torch.ones(200, device=GPU_TYPE, dtype=torch.float64)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you parameterize the dtype here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@isuruf , I don't think so? At least, there's no dtype that I know about. We just have a sympy.core.numbers.Float object that we're printing.

Copy link
Contributor Author

@masnesral masnesral Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, totally misunderstood you comment. You mean run this test for a few different dtypes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... except I got it wrong. omg. fix forthcoming

@ezyang ezyang requested a review from jansel September 6, 2024 03:17
@ezyang
Copy link
Contributor

ezyang commented Sep 6, 2024

Unfortunately, I can't tell if this correct. In particular, I don't know what the correct types of the Triton IR are supposed to be in this codegen case... Use of float64 here seems right, or at least, it is consistent with some of the other float codegen.

@jansel
Copy link
Contributor

jansel commented Sep 6, 2024

I believe this is only for constants in indexing expressions, not all constants...

@lezcano
Copy link
Collaborator

lezcano commented Sep 6, 2024

This would turn every computation within indexing with floats into fp64, as this explicit casting makes any comptuation deriving from it to be performed in fp64, which may not be desirable.

triton-lang/triton#4613 would fix this in the repro from #134720, just upcasting to fp64 constants if they are going to be involved in a computation with an fp64 tensor.

@masnesral
Copy link
Contributor Author

which may not be desirable

@lezcano , what problems would you anticipate? Of the people here, I certainly know the least about it. But if there's a float constant in an indexing expression, shouldn't that constant always be treated as fp64 to match eager semantics?

@lezcano
Copy link
Collaborator

lezcano commented Sep 6, 2024

GPUs have less dedicated silicon to compute fp64 than to compute fp32, even less so on consumer GPUs.

Regading eager semantics, if you do torch.arange(8) * 3.0 you get an fp32 not an fp64. With this PR everything will be computed on fp64 tho

@masnesral
Copy link
Contributor Author

@jansel WDYT about @lezcano's input here? How do you prefer to fix it? I've verified that triton-lang/triton#4613 indeed fixes the original repro from #134720. I don't know know how big of a deal it is to get that Triton change. Do we update the pin or is it a cherry-pick situation?

@ezyang
Copy link
Contributor

ezyang commented Sep 8, 2024

While @lezcano is right, I kind of think we should ship this change anyway. Mostly because I don't actually see anyway to recover Python-style float64 semantics when we move them to CUDA without doing it in float64. This is different from int32/int64, where we have a chance of optimizing it by value ranges analysis; I don't see anyway to see how to go from float64 to float32 and guarantee bit for bit equivalence in the end.

@masnesral
Copy link
Contributor Author

Ok, landing this now given @ezyang's guidance (and because we have an internal usage waiting on a fix). If the discussion continues and we decide this is the wrong choice, we can always revert.

@jansel
Copy link
Contributor

jansel commented Sep 10, 2024

Yeah, indexing code is usually run in Python which is float64. I also think this will be uncommon and not matter much for perf. If it does we could optimize it.

…tly"

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

[ghstack-poisoned]
masnesral added a commit that referenced this pull request Sep 10, 2024
ghstack-source-id: 832990c
Pull Request resolved: #135260
@masnesral
Copy link
Contributor Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 10, 2024
@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

masnesral added a commit that referenced this pull request Sep 10, 2024
Summary: Landed #135260 too soon and the test in that PR doesn't do exactly what I tested (actually test different dtypes).

Test Plan: `python test/inductor/test_triton_kernels.py -k float64_constant`

[ghstack-poisoned]
masnesral added a commit that referenced this pull request Sep 10, 2024
Summary: Landed #135260 too soon and the test in that PR doesn't do exactly what I tested (actually test different dtypes).

Test Plan: `python test/inductor/test_triton_kernels.py -k float64_constant`

ghstack-source-id: 0fbe80d
Pull Request resolved: #135583
pytorchmergebot pushed a commit that referenced this pull request Sep 11, 2024
Summary: Landed #135260 too soon and the test in that PR doesn't do exactly what I tested (actually test different dtypes).

Test Plan: `python test/inductor/test_triton_kernels.py -k float64_constant`

Pull Request resolved: #135583
Approved by: https://github.com/isuruf, https://github.com/eellison, https://github.com/Skylion007
Chao1Han pushed a commit to Chao1Han/pytorch that referenced this pull request Sep 20, 2024
masnesral added a commit that referenced this pull request Sep 25, 2024
…ead of 1-element tensor

Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this:

`tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])
`

#135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want?

[ghstack-poisoned]
masnesral added a commit that referenced this pull request Sep 25, 2024
…ead of 1-element tensor

Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this:

`tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])
`

#135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want?

ghstack-source-id: fc004c4
Pull Request resolved: #136594
masnesral added a commit that referenced this pull request Sep 26, 2024
…ead of 1-element tensor

Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this:

`tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])`

#135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want?

Pull Request resolved: #136594
ghstack-source-id: 4f2c28d
masnesral added a commit that referenced this pull request Sep 26, 2024
… creating f64 constant instead of 1-element tensor"

Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this:

`tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])
`

#135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want?

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

Differential Revision: [D63360293](https://our.internmc.facebook.com/intern/diff/D63360293)

[ghstack-poisoned]
masnesral added a commit that referenced this pull request Sep 26, 2024
…nstant instead of 1-element tensor"

Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this:

`tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])
`

#135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want?

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang

Differential Revision: [D63360293](https://our.internmc.facebook.com/intern/diff/D63360293)

[ghstack-poisoned]
pytorchmergebot pushed a commit that referenced this pull request Sep 27, 2024
…ead of 1-element tensor (#136594)

Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this:

`tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])
`

#135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want?

Differential Revision: [D63465169](https://our.internmc.facebook.com/intern/diff/D63465169)
Pull Request resolved: #136594
Approved by: https://github.com/mengluy0125, https://github.com/jansel
masnesral added a commit that referenced this pull request Sep 27, 2024
…ead of 1-element tensor

This is a retry of #136594, which is having trouble landing.

Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this:

`tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])`

#135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want?

[ghstack-poisoned]
masnesral added a commit that referenced this pull request Sep 27, 2024
…ead of 1-element tensor

This is a retry of #136594, which is having trouble landing.

Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this:

`tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])`

#135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want?

ghstack-source-id: 141efbd
Pull Request resolved: #136858
pytorchmergebot pushed a commit that referenced this pull request Sep 27, 2024
…ead of 1-element tensor (#136858)

This is a retry of #136594, which is having trouble landing.

Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this:

`tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])`

#135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want?

Differential Revision: [D63540693](https://our.internmc.facebook.com/intern/diff/D63540693)
Pull Request resolved: #136858
Approved by: https://github.com/atalman
@github-actions github-actions bot deleted the gh/masnesral/112/head branch October 12, 2024 02:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

index out of bounds: 0 <= tmp4 < libdevice.trunc(0.120000000000000*(ks0.to(tl.float64))).to(tl.int32)

7 participants