[inductor] print triton float64 constants correctly #135260

masnesral · 2024-09-05T19:06:22Z

Stack from ghstack (oldest at bottom):

-> [inductor] print triton float64 constants correctly #135260

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang

[ghstack-poisoned]

pytorch-bot · 2024-09-05T19:06:25Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135260

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a66a77e with merge base 58f2477 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 5e3a4af Pull Request resolved: #135260

masnesral · 2024-09-05T22:42:20Z

Also tested repro provided in #134720

masnesral · 2024-09-05T22:43:57Z

torch/_inductor/codegen/triton.py

            f"libdevice.trunc({self._print(expr.args[0])}).to({V.kernel.index_dtype})"
        )

+    def _print_Float(self, expr):


I admit I don't know if it's always valid to assume constants are float64. I'm operating on the assumption that any float literal originated from Python and is technically float64.

isuruf · 2024-09-05T23:24:04Z

test/inductor/test_triton_kernels.py

+        def f(x):
+            return x * (0.12 * x.shape[0])
+
+        x = torch.ones(200, device=GPU_TYPE, dtype=torch.float64)


Can you parameterize the dtype here?

@isuruf , I don't think so? At least, there's no dtype that I know about. We just have a sympy.core.numbers.Float object that we're printing.

Oh, totally misunderstood you comment. You mean run this test for a few different dtypes?

... except I got it wrong. omg. fix forthcoming

ezyang · 2024-09-06T03:18:02Z

Unfortunately, I can't tell if this correct. In particular, I don't know what the correct types of the Triton IR are supposed to be in this codegen case... Use of float64 here seems right, or at least, it is consistent with some of the other float codegen.

jansel · 2024-09-06T03:36:26Z

I believe this is only for constants in indexing expressions, not all constants...

lezcano · 2024-09-06T08:32:54Z

This would turn every computation within indexing with floats into fp64, as this explicit casting makes any comptuation deriving from it to be performed in fp64, which may not be desirable.

triton-lang/triton#4613 would fix this in the repro from #134720, just upcasting to fp64 constants if they are going to be involved in a computation with an fp64 tensor.

masnesral · 2024-09-06T17:45:10Z

which may not be desirable

@lezcano , what problems would you anticipate? Of the people here, I certainly know the least about it. But if there's a float constant in an indexing expression, shouldn't that constant always be treated as fp64 to match eager semantics?

lezcano · 2024-09-06T19:54:29Z

GPUs have less dedicated silicon to compute fp64 than to compute fp32, even less so on consumer GPUs.

Regading eager semantics, if you do torch.arange(8) * 3.0 you get an fp32 not an fp64. With this PR everything will be computed on fp64 tho

masnesral · 2024-09-06T21:04:05Z

@jansel WDYT about @lezcano's input here? How do you prefer to fix it? I've verified that triton-lang/triton#4613 indeed fixes the original repro from #134720. I don't know know how big of a deal it is to get that Triton change. Do we update the pin or is it a cherry-pick situation?

ezyang · 2024-09-08T04:16:27Z

While @lezcano is right, I kind of think we should ship this change anyway. Mostly because I don't actually see anyway to recover Python-style float64 semantics when we move them to CUDA without doing it in float64. This is different from int32/int64, where we have a chance of optimizing it by value ranges analysis; I don't see anyway to see how to go from float64 to float32 and guarantee bit for bit equivalence in the end.

masnesral · 2024-09-09T23:50:04Z

Ok, landing this now given @ezyang's guidance (and because we have an internal usage waiting on a fix). If the discussion continues and we decide this is the wrong choice, we can always revert.

jansel · 2024-09-10T02:20:58Z

Yeah, indexing code is usually run in Python which is float64. I also think this will be uncommon and not matter much for perf. If it does we could optimize it.

…tly" cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

ghstack-source-id: 832990c Pull Request resolved: #135260

masnesral · 2024-09-10T02:47:40Z

@pytorchbot merge

pytorchmergebot · 2024-09-10T02:50:45Z

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging

Check the merge workflow status
here

Summary: Landed #135260 too soon and the test in that PR doesn't do exactly what I tested (actually test different dtypes). Test Plan: `python test/inductor/test_triton_kernels.py -k float64_constant` [ghstack-poisoned]

Summary: Landed #135260 too soon and the test in that PR doesn't do exactly what I tested (actually test different dtypes). Test Plan: `python test/inductor/test_triton_kernels.py -k float64_constant` ghstack-source-id: 0fbe80d Pull Request resolved: #135583

Summary: Landed #135260 too soon and the test in that PR doesn't do exactly what I tested (actually test different dtypes). Test Plan: `python test/inductor/test_triton_kernels.py -k float64_constant` Pull Request resolved: #135583 Approved by: https://github.com/isuruf, https://github.com/eellison, https://github.com/Skylion007

Pull Request resolved: pytorch#135260 Approved by: https://github.com/jansel

Summary: Landed pytorch#135260 too soon and the test in that PR doesn't do exactly what I tested (actually test different dtypes). Test Plan: `python test/inductor/test_triton_kernels.py -k float64_constant` Pull Request resolved: pytorch#135583 Approved by: https://github.com/isuruf, https://github.com/eellison, https://github.com/Skylion007

…ead of 1-element tensor Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this: `tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK]) ` #135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want? [ghstack-poisoned]

…ead of 1-element tensor Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this: `tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK]) ` #135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want? ghstack-source-id: fc004c4 Pull Request resolved: #136594

…ead of 1-element tensor Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this: `tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])` #135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want? Pull Request resolved: #136594 ghstack-source-id: 4f2c28d

… creating f64 constant instead of 1-element tensor" Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this: `tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK]) ` #135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want? cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang Differential Revision: [D63360293](https://our.internmc.facebook.com/intern/diff/D63360293) [ghstack-poisoned]

…nstant instead of 1-element tensor" Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this: `tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK]) ` #135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want? cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang Differential Revision: [D63360293](https://our.internmc.facebook.com/intern/diff/D63360293) [ghstack-poisoned]

…ead of 1-element tensor (#136594) Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this: `tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK]) ` #135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want? Differential Revision: [D63465169](https://our.internmc.facebook.com/intern/diff/D63465169) Pull Request resolved: #136594 Approved by: https://github.com/mengluy0125, https://github.com/jansel

…ead of 1-element tensor This is a retry of #136594, which is having trouble landing. Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this: `tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])` #135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want? [ghstack-poisoned]

…ead of 1-element tensor This is a retry of #136594, which is having trouble landing. Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this: `tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])` #135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want? ghstack-source-id: 141efbd Pull Request resolved: #136858

…ead of 1-element tensor (#136858) This is a retry of #136594, which is having trouble landing. Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this: `tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])` #135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want? Differential Revision: [D63540693](https://our.internmc.facebook.com/intern/diff/D63540693) Pull Request resolved: #136858 Approved by: https://github.com/atalman

[inductor] print triton float64 constants correctly

65f7a23

[ghstack-poisoned]

pytorch-bot bot added ciflow/inductor module: inductor labels Sep 5, 2024

masnesral added the topic: not user facing topic category label Sep 5, 2024

Update on "[inductor] print triton float64 constants correctly"

93467a9

cc voznesenskym penguinwu EikanWang jgong5 Guobing-Chen XiaobingSuper zhuhaozhe blzheng wenzhe-nrv jiayisunx ipiszy yf225 chenyang78 kadeng muchulee8 ColinPeppler amjames desertfire chauhang [ghstack-poisoned]

masnesral added a commit that referenced this pull request Sep 5, 2024

[inductor] print triton float64 constants correctly

da249ed

ghstack-source-id: 5e3a4af Pull Request resolved: #135260

masnesral requested review from ezyang and lezcano September 5, 2024 22:42

masnesral commented Sep 5, 2024

View reviewed changes

masnesral linked an issue Sep 5, 2024 that may be closed by this pull request

index out of bounds: 0 <= tmp4 < libdevice.trunc(0.120000000000000*(ks0.to(tl.float64))).to(tl.int32) #134720

Closed

masnesral marked this pull request as ready for review September 5, 2024 22:44

isuruf reviewed Sep 5, 2024

View reviewed changes

ezyang requested a review from jansel September 6, 2024 03:17

jansel approved these changes Sep 6, 2024

View reviewed changes

masnesral added a commit that referenced this pull request Sep 10, 2024

[inductor] print triton float64 constants correctly

917ea31

ghstack-source-id: 832990c Pull Request resolved: #135260

pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label Sep 10, 2024

pytorchmergebot added the merging label Sep 10, 2024

pytorchmergebot added the Merged label Sep 10, 2024

pytorchmergebot closed this in 1adf28a Sep 10, 2024

pytorchmergebot removed the merging label Sep 10, 2024

masnesral mentioned this pull request Sep 10, 2024

Fix test_triton_kernel_float64_constant #135583

Closed

masnesral mentioned this pull request Sep 25, 2024

[inductor] Triton codegen: Use scalar when creating f64 constant instead of 1-element tensor #136594

Closed

masnesral mentioned this pull request Sep 27, 2024

[inductor] Triton codegen: Use scalar when creating f64 constant instead of 1-element tensor #136858

Closed

github-actions bot deleted the gh/masnesral/112/head branch October 12, 2024 02:06

[inductor] print triton float64 constants correctly #135260

[inductor] print triton float64 constants correctly #135260

Uh oh!

Conversation

masnesral commented Sep 5, 2024 • edited by pytorch-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/135260

✅ No Failures

Uh oh!

masnesral commented Sep 5, 2024

Uh oh!

masnesral Sep 5, 2024

Choose a reason for hiding this comment

Uh oh!

isuruf Sep 5, 2024

Choose a reason for hiding this comment

Uh oh!

masnesral Sep 6, 2024

Choose a reason for hiding this comment

Uh oh!

masnesral Sep 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

masnesral Sep 10, 2024

Choose a reason for hiding this comment

Uh oh!

masnesral Sep 10, 2024

Choose a reason for hiding this comment

Uh oh!

ezyang commented Sep 6, 2024

Uh oh!

jansel commented Sep 6, 2024

Uh oh!

lezcano commented Sep 6, 2024

Uh oh!

masnesral commented Sep 6, 2024

Uh oh!

lezcano commented Sep 6, 2024

Uh oh!

masnesral commented Sep 6, 2024

Uh oh!

ezyang commented Sep 8, 2024

Uh oh!

masnesral commented Sep 9, 2024

Uh oh!

jansel commented Sep 10, 2024

Uh oh!

masnesral commented Sep 10, 2024

Uh oh!

pytorchmergebot commented Sep 10, 2024

Merge started

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

masnesral commented Sep 5, 2024 •

edited by pytorch-bot bot

Loading

pytorch-bot bot commented Sep 5, 2024 •

edited

Loading

masnesral Sep 9, 2024 •

edited

Loading