[prototype only] Add cutlass as an alternative backend to inductor #106607

ipiszy · 2023-08-04T08:21:09Z

NOTE: This PR is a prototype to collect early feedback. I'll follow up on splitting it into small PRs and add proper unittests.

This PR adds cutlass support for simple gemm cases (matmul and linear), without arbitrary epilogue fusions. More support for epilogue fusion will be implemented on top of cutlass epilogue visitor, which is to be released in cutlass 3.2.

Key changes:

API:
Introduce "Cutlass" as an alternative backend in max_autotune_gemm_backends in torch/_inductor/config.py.
This option is exposed via torch.compile() API options arg.
CUDA env:
Add a file torch/_inductor/codegen/cuda/cuda_env.py to get cuda version / arch. Not sure whether there are better ways to do this in Pytorch.
CUDA / CutlassGemm template:
- IR: Create TritonTemplateBuffer and CUDATemplateBuffer subclasses. CUDATemplateBuffer keeps workspace_size which is required by cutlass kernels. Remembering the workspace_size is useful for memory planning.
- Templates: Create CUDATemplate and CutlassGemmTemplate for code rendering.
- Kernel: Create CUDAKernel and CUDATemplateKernel for code rendering.
- Scheduling: Create CUDAScheduling, which simply inherit from TritonScheduling, since it's only used by template codegen instead of more general node fusions. Accordingly, rely on "triton.py" for codegen_template() and can_fuse() and add simple if / else branches for CUDAScheduling. This looks a bit hacky, please suggest better ways.
- Wrapper: Add changes to support calling from Python to methods defined in .so. Also update memory_planning part to take "workspace_size" into consideration.
CUDA codecache / autotuning
- Compilation: Add CUDACodeCache for CUDA compilation. Currently it relies on nvcc. More follow-ups are needed, e.g. nvrtc, add support for non-linux platform, and add support for internal production environment.
- Max autotune: Add TritonBenchmarkRequest and CUDABenchmarkRequest subclasses to handle benchmarking.
Profiling util
- I found that current profiling time is very inaccurate as CPU-side overhead is dominant. This PR tries to rely on torch.profile() to get accurate timing info. However, one caveat is that it may exclude vectorized_elementwise_kernel kernel incorrectly which is used to clear L2 cache.

Tests:

Gemm bias / Gemm + matrix tests * A100 / H100 tests all done.
Static shape only.

Things to fix in this PR:

Update max autotune cache file format to also record "workspace_size" for each CUDA kernel. Otherwise if workspace_size > 0 and cache hits, current logic will break.

Things to fix in follow-up PRs:

Support AOTInductor.
Support dynamic shapes.
Make cutlass profiling more robust. It's possible that selected cutlass configs don't compile. We should either select configs more smartly, or avoid crashing the current process when compilation fails.
Dedup cutlass configs. It's way too many now.
Parallel compilation to improve compilation speed. Now it's super slow.
Parallel benchmarking using multi-GPUs.

Things to add:

fp8
epilogue visitor, general epilogue fusion
conv kernels

cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @Xia-Weiwen @wenzhe-nrv @jiayisunx @peterbell10 @ngimel @yf225 @chenyang78 @kadeng @muchulee8 @aakhundov

pytorch-bot · 2023-08-04T08:21:12Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/106607

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 4 New Failures

As of commit 8cbe612 with merge base c0b8b7b ():

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2023-08-04T08:22:15Z

This PR needs a `release notes:` label

If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

jansel · 2023-08-04T21:47:58Z

test/inductor/test_max_autotune.py


 import torch
+
+torch.cuda.init()


Why is this needed? Does it have to be in global scope?

jansel · 2023-08-04T21:51:53Z

torch/_inductor/autotune_process.py

+    ) -> Callable[[], None]:
+        raise NotImplementedError()
+
+    def profiling_keyword(self) -> str:


document this? not clear from name

This is useless now, will remove it.

jansel · 2023-08-04T21:54:55Z

torch/_inductor/codecache.py

+
+
+def _cutlass_include_paths() -> List[str]:
+    _CUTLASS_PATH = os.path.join(


local variables should be lowercase

jansel · 2023-08-04T21:58:36Z

torch/_inductor/codecache.py

+    def _dlclose(self):
+        f_dlclose = None
+
+        if is_windows():
+            f_dlclose = ctypes.windll.kernel32.FreeLibrary
+        elif is_linux():
+            syms = ctypes.CDLL(None)
+            if not hasattr(syms, "dlclose"):
+                # Apline Linux
+                syms = ctypes.CDLL("libc.so")
+
+            if hasattr(syms, "dlclose"):
+                f_dlclose = syms.dlclose
+
+        if f_dlclose is not None:
+            f_dlclose.argtypes = [ctypes.c_void_p]
+            f_dlclose(self.DLL._handle)
+        else:
+            logging.warning(
+                "dll unloading function was not found, library may not be unloaded properly!"
+            )


Was this copied from somewhere? (I am assuming you aren't testing on windows.) If so add a comment explaining where it came from (or reuse it if it is within our codebase).

Yeah I copied them from cpp_extension and AITemplate code. After a second thought I think I'll just remove "windows" logics since they're not tested. People could work on more proper window support later.

jansel · 2023-08-04T22:01:11Z

torch/_inductor/codegen/common.py

            outer = inplaced.other_names[-1]
            inner = inplaced.inner_name
-            dtype = buffer_types[outer]
+            dtype = buffer_types.get(outer, V.graph.get_dtype(outer))


We might be able to just call V.graph.get_dtype(outer) directly and delete buffer_types from this function. I think get_dtype already handles constants (or we could easily make it handle them).

jansel · 2023-08-04T22:43:48Z

torch/_inductor/codegen/triton.py

        return f"[{', '.join(sizes)}]"

-    def call_kernel(self, name: str):
+    def call_kernel(self, name: str, node: IRNode = None):


Let's make node required and update the users. We can drop name since there is node.get_name().

Actually name != node.get_name(). name here is generated by self.define_kernel(src_code, node_schedule).

jansel · 2023-08-04T22:45:58Z

torch/_inductor/codegen/triton.py


        # finalize must be called after adding epilogue above
-        src_code = partial_code.finalize()
+        src_code = partial_code if isinstance(partial_code, str) else partial_code.finalize()


Shouldn't be needed, see #105987

jansel · 2023-08-04T22:47:04Z

torch/_inductor/codegen/wrapper.py

+        if buffer.get_workspace_size() > 0:
+            workspace_size = tuple(buffer.get_workspace_size())
+            workspace_stride = tuple(1)
+            res.append(
+                f"{buffer.get_name()}_workspace = empty_strided("
+                f"{self.codegen_shape_tuple(workspace_size)}, "
+                f"{self.codegen_shape_tuple(workspace_stride)}, "
+                f"device='{device.type}', dtype={torch.uint8})"
+            )
+        return res


Will need to make this work with dynamic shapes.

Yes this is a TODO, won't be included in this PR.

jansel · 2023-08-04T22:50:54Z

torch/_inductor/codegen/wrapper.py

+    def make_buffer_free(self, buffer) -> List[str]:
+        res = [f"del {buffer.get_name()}"]
+        if buffer.get_workspace_size() > 0:
+            res.append(f"del {buffer.get_name()}_workspace")
+        return res

-    def make_buffer_reuse(self, old, new):
+    def make_buffer_reuse(self, old, new) -> List[str]:
        assert old.get_dtype() == new.get_dtype()
-        del_line = ""
+        res = []
+        del_line0, del_line1 = ""
        if old.get_name() not in V.graph.get_output_names():
-            del_line = f"; {self.make_buffer_free(old)}"
-        if old.get_size() == new.get_size() and old.get_stride() == new.get_stride():
-            return f"{self.declare}{new.get_name()} = {old.get_name()}{del_line}  {self.comment} reuse"
+            del_lines = self.make_buffer_free(old)
+            del_line0 = "" if len(del_lines) < 1 else del_lines[0]
+            del_line1 = "" if len(del_lines) < 2 else del_lines[1]
+        if (
+            old.get_size() == new.get_size()
+            and old.get_stride() == new.get_stride()
+            and old.get_workspace_size() == new.get_workspace_size()
+        ):
+            res.append(f"{self.declare}{new.get_name()} = {old.get_name()}{del_line0}  {self.comment} reuse")
+            if new.get_workspace_size() > 0:
+                res.append(f"{self.declare}{new.get_name()}_workspace = {old.get_name()}_workspace{del_line1}  {self.comment} reuse")
+            return res

-        return (
+        res.append(
            f"{self.declare}{new.get_name()} = {self.namespace}as_strided({old.get_name()}, "
            f"{self.codegen_shape_tuple(new.get_size())}, "
-            f"{self.codegen_shape_tuple(new.get_stride())}){del_line}  {self.comment} reuse"
+            f"{self.codegen_shape_tuple(new.get_stride())}){del_line0}  {self.comment} reuse"
        )
+        if new.get_workspace_size() > 0:
+            res.append(f"{self.declare}{new.get_name()}_workspace = {old.get_name()}_workspace{del_line1}  {self.comment} reuse")
+        return res
+


I suspect it would be simpler to allocate workspaces separately from regular buffers. Doing in this way means workspaces can only be reused if the main buffer can be reused (which often not the case).

jansel · 2023-08-04T22:53:07Z

torch/_inductor/utils.py

 VarRanges = Dict[sympy.Expr, sympy.Expr]


+def do_bench_using_profiling(fn: Callable[[], Any], warmup=25, rep=100) -> None:


Can unify the way we benchmark triton and cutlass?

TritonTemplateKernel and CUDATemplateKernel benchmark is unified at torch/_inductor/autotune_process.py:BenchmarkRequest::benchmark(). I introduced this method because the existing profiling method measures CPU-side kernel launch overhead and is not accurate. However this method could also be fragile as it needs to filter out all irrelevant kernels. I'll add more comments.

aakhundov

@ipiszy thanks for the e2e prototype! I've looked through some of the code. Will look further and likely add more comments.

aakhundov · 2023-08-11T13:53:02Z

test/inductor/test_max_autotune.py

+            {
+                "max_autotune": True,
+                "autotune_in_subproc": True,
+                "max_autotune_gemm_backends": max_autotune_gemm_backends,


Looks like max_autotune_gemm_backends doesn't exist in here. Add a @parameterized?

Good catch! I'll just remove it.

aakhundov · 2023-08-11T14:46:26Z

torch/_inductor/autotune_process.py

+                + f"create tensor {create_tensor_elapse}, bench {bench_elapse}, "
+                + f"collected time {out}"
            )
+        self.close()


Should this be called by the caller of the benchmark function? Request closing itself looks somewhat asymmetric. Or is the semantics that a request may be benchmarked only once?

I'll rename it to cleanup_run_fn() so that it's symmetric with make_run_fn().

aakhundov · 2023-08-11T15:00:25Z

torch/_inductor/autotune_process.py

+                self.workspace_size == 0
+            ), "Autotune cache needs to be updated to support non-zero workspace_size!"
+
+        workspace = torch.empty(self.workspace_size, dtype=torch.uint8, device="cuda")


Is the workspace Tensor going to be garbage-collected at the end of this function, as we're only using its .data_ptr() in the functools.partial(...) below? Perhaps better set it as an attribute of CUDABenchmarkRequest? Then you can also get rid of the workspace_size attribute, as it's only used once during the workspace computation.

Good catch! I think I still need to keep "workspace_size" as it needs to be propagated to the codegen stage for memory planning purpose. Will make workspace a member variable.

aakhundov · 2023-08-11T15:04:20Z

torch/_inductor/autotune_process.py

+        )
+
+    def profiling_keyword(self) -> str:
+        return "cutlass"


Nit: I'd suggest either using "cuda" here or CUTLASS in the class name.

aakhundov · 2023-08-11T15:13:06Z

torch/_inductor/codecache.py

+    res = ""
+    if dst_file_suffix == "o":
+        res = (
+            _cuda_compiler() + " " + " ".join(options) + f" -c -o {dst_file} {src_file}"


Nit: put the parts on separate lines for clarity? Also below.

aakhundov · 2023-08-11T16:02:26Z

torch/_inductor/config.py

+    # CUDA arch to use for CUDA template kernel compilation.
+    # Available options: "70", "75", "80", "90"
+    # When arch is None, the Inductor tries to detect CUDA arch by querying
+    # CUDA Python lib or "nvidia-smi" commandline.


Super-nit: "nvidia-smi" command.

aakhundov · 2023-08-11T16:03:26Z

torch/_inductor/config.py

+    # e.g. "11.4", "12.1", etc.
+    # When version is None, the Inductor tries to detech CUDA version by
+    # querying CUDA Python lib for "nvidia-smi" commandline.
+    version = None


Is it possible to compile with an earlier CUDA version than installed? If not, should we just auto-detect it from the system in all cases?

aakhundov · 2023-08-11T16:04:05Z

torch/_inductor/config.py

+
+    enable_ptxas_info = False
+
+    enable_debug_info = True


We may want to make this False by default, eventually.

aakhundov · 2023-08-11T16:07:07Z

torch/_inductor/ir.py


+
+    def get_layout(self):
+        return self.data.get_layout()
+
+


Nit: extra blank lines above and below. Linter will likely complain.

aakhundov · 2023-08-11T16:11:13Z

torch/_inductor/ir.py

+    def get_workspace_size(self):
+        return 0
+


Should we move this to TemplateBuffer, as that seems to be the lowest common subclass where this is relevant? Or is it necessary for arbitrary Buffer instances to be able to return this?

I feel it depends on the algorithm. In theory triton codegen code can also require extra workspace.

Also, I think, you're invoking this method on abstract Buffer in some context, I remember. So ignore the comment.

aakhundov · 2023-08-11T16:19:32Z

torch/_inductor/utils.py

    )


+def use_cutlass_template(layout):


This looks very similar to use_triton_template. Maybe extract common part into a helper?

aakhundov · 2023-08-11T16:24:29Z

torch/_inductor/utils.py

            is not None
            else "percentiles"
        )
+        return triton_do_bench


This seems like a leftover?

aakhundov · 2023-08-11T17:04:20Z

torch/_inductor/utils.py

+    n_warmup = max(1, int(warmup / estimate_ms))
+    n_repeat = max(1, int(rep / estimate_ms))


IIUC, for 0.1ms kernel this will render 250 warmup and 1000 profiling iterations. Do we really need that many?

I feel it doesn't matter? Is it an issue?

Purely from e2e time perspective. By decreasing warmup to, say, 10 iterations, you'll shave off ~1/5 of the total kernel runtime during profiling. Although maybe the total kernel runtime is not a bottleneck, so it doesn't matter indeed.

aakhundov · 2023-08-11T17:06:07Z

torch/_inductor/utils.py

+    def _should_keep(key: str) -> bool:
+        if key.startswith("cuda"):
+            return False
+        if key == "Context Sync":


Recently, we've also noticed other profiler events like:

"Event Sync", "Stream Wait Event", "Stream Sync", "Unknown Sync",

Will add them.

aakhundov · 2023-08-13T10:06:53Z

torch/_inductor/select_algorithm.py

-        if len(choices) == 1:
-            return choices[0].output_node()
+        # if len(choices) == 1:
+        #     return choices[0].output_node()


Curious what's the reason of commenting this out?

Need to make sure that workspace_size is fetched via the benchmark process.

I see. I guess, this means that even one-option autotuning (like in the ATen-only cases) will run the kernel / invoke the subprocess etc.? I'm wondering if this can degrade compilation time. Likely the cache will be hit on subsequent attempts, though.

aakhundov · 2023-08-13T13:53:18Z

torch/_inductor/codegen/wrapper.py

        )
+        if buffer.get_workspace_size() > 0:
+            workspace_size = tuple(buffer.get_workspace_size())
+            workspace_stride = tuple(1)


Should be tuple([1])? Throws an error otherwise (Python 3.11):

>>> tuple(1) Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: 'int' object is not iterable

More importantly, this likely shows that the code path with workspace size > 1 is not verified.

I'll either remove these logics or add more unittests. Default cutlass gemm kernels only have workspze_size == 0 so in E2E test this path is not exercised. From Haicheng only group_gemm and parallel split_k need this.

aakhundov · 2023-08-13T14:06:15Z

torch/_inductor/codegen/wrapper.py

                return f"auto {name} = outputs[{output_idx}];"
            else:
-                self.outputs_need_copy.add(name)
+                [self.outputs_need_copy.add(name)]


[...] is not required here, but likely required on the L1295 above?

aakhundov · 2023-08-13T14:17:23Z

torch/_inductor/kernel/mm.py

    )
    if (
-        inp_expanded.get_stride()[0] == 0
+        use_aten_gemm_kernels()


Good catch!

aakhundov · 2023-08-13T17:55:35Z

torch/_inductor/codegen/cuda/cuda_env.py

+    try:
+        from cuda import cuda
+
+        _assert_cuda(cuda.cuInit(0))


Can initializing CUDA outside torch lead to undesirable effects?

aakhundov · 2023-08-13T18:11:11Z

torch/_inductor/codegen/cuda/cutlass_utils.py

+
+
+def dtype_match(torch_dtype, cutlass_dtype) -> bool:
+    if torch_dtype == torch.float or torch_dtype == torch.float32:


Nit: use in? Here and below.

aakhundov · 2023-08-13T18:12:28Z

torch/_inductor/codegen/cuda/cutlass_utils.py

+            cutlass_dtype == cutlass_lib.DataType.f32
+            or cutlass_dtype == cutlass_lib.DataType.tf32


Is it safe to treat both f32 and tf32 as equivalent to torch.float?

tf32 is the NV way to process fp32 gemms using tensor core. In cutlass it's not used in other places. So I think it's fine.

aakhundov · 2023-08-14T07:52:03Z

torch/_inductor/codegen/cuda/cuda_kernel.py

+        for i in range(len(call_args)):
+            if V.graph.is_unspec_arg(call_args[i]):
+                call_args[i] = call_args[i] + ".item()"
+            call_args[i] = f"c_void_p({call_args[i]}.data_ptr())"


This will break with the unspec args getting the .item() call above?

aakhundov · 2023-08-14T08:13:43Z

torch/_inductor/codegen/cuda/cuda_template.py

+    _DTYPE_TO_CUTLASS = {
+        torch.float32: "float",
+        torch.float64: "double",
+        torch.float16: "cutlass::half_t",
+        torch.int32: "int",
+        torch.int8: "int8_t",
+        torch.uint8: "uint8_t",
+        torch.bool: "bool",
+        torch.bfloat16: "cutlass::bfloat16_t",
+    }
+
+    def dtype(self, node: IRNode) -> str:
+        return self._DTYPE_TO_CUTLASS.get(node.get_dtype())


Would it make sense to move this to the CutlassTemplate below?

aakhundov · 2023-08-14T08:16:11Z

torch/_inductor/codegen/cuda/cuda_template.py

+                #endif
+                #endif
+
+                using bfloat16 = nv_bfloat16;


Does this require #include <cuda_bf16.h>?

aakhundov · 2023-08-14T09:02:03Z

torch/_inductor/codegen/cuda/gemm_template.py

+                if filter_res is not None:
+                    res.append(filter_res)
+        print(f"Got cutlass configs: {len(res)=}")
+        return [res[0]]


I guess, this is to speed up the profiling? But we shouldn't forget to return res here eventually. Also maybe for more thorough testing.

ipiszy · 2023-08-17T23:33:44Z

I've finished basic A100 / H100 gemm, gemm+bias, gemm+matrix tests and fixed some issues. Will work on separate PRs to address comments, thanks!

ipiszy

Thanks @jansel @aakhundov for the review! I replied some comments, working on PRs now.

ipiszy · 2023-08-23T04:16:20Z

torch/_inductor/autotune_process.py

+                + f"create tensor {create_tensor_elapse}, bench {bench_elapse}, "
+                + f"collected time {out}"
            )
+        self.close()


I'll rename it to cleanup_run_fn() so that it's symmetric with make_run_fn().

ipiszy · 2023-08-23T04:30:08Z

torch/_inductor/autotune_process.py

+                self.workspace_size == 0
+            ), "Autotune cache needs to be updated to support non-zero workspace_size!"
+
+        workspace = torch.empty(self.workspace_size, dtype=torch.uint8, device="cuda")


Good catch! I think I still need to keep "workspace_size" as it needs to be propagated to the codegen stage for memory planning purpose. Will make workspace a member variable.

ipiszy · 2023-08-23T04:40:43Z

torch/_inductor/codecache.py

+                        raise exc.CppCompileError(cmd, error.output) from error
+                cls.cache[key] = CUDACodeCache.CacheEntry(input_path, output_path)
+
+        return (DLLWrapper(cls.cache[key].output_path), key, input_path)


I'm worried about naming conflicts if multiple DLLs have the same symbol name. e.g. In benchmark, all gemm_kernels have the same name to make sure code cache can be reused.

ipiszy · 2023-08-23T04:43:28Z

torch/_inductor/codegen/common.py

    is_load_uint8_as_float: bool = False
+
+
+@functools.lru_cache(None)


@functools.cache is only available from python3.9. In internal env lots of hosts are still using python3.8 so I'd like to make sure it won't break with python3.8.

ipiszy · 2023-08-23T04:54:26Z

torch/_inductor/codegen/cuda/gemm_template.py

+// When workspace_size is not a nullptr, populates requested workspace_size and returns.
+// Otherwise, compuates the Gemm kernel using the given workspace ptr.
+extern "C" {
+{{kernel.def_kernel(inputs=[X, W, Bias], outputs=[Y], names_str="X, W, Bias, Y", input_reorder=input_reorder)}} {


X, W, Bias are IRNodes. names_str is used to represent names of these nodes in generated code.
I feel passing IRNodes directly here would make code a bit easier to reason about. Besides, GemmTemplate also uses these IRNodes to select cutlass configs and generate templates. It looks a bit confusing to keep the node / name mapping inside the GemmTemplate.

ipiszy · 2023-08-23T05:50:40Z

torch/_inductor/codegen/triton.py

-                return True  # skip checks for compatible tiling
+                # Only allow fusion for TritonTemplates for now.
+                # Fusion for CUDATemplates are not supported.
+                return isinstance(node1.node, TritonTemplateBuffer)


isinstance(None, TritonTemplateBuffer) should return False so fusion won't happen. It should be fine?

ipiszy · 2023-08-23T05:51:53Z

torch/_inductor/codegen/wrapper.py

+        if buffer.get_workspace_size() > 0:
+            workspace_size = tuple(buffer.get_workspace_size())
+            workspace_stride = tuple(1)
+            res.append(
+                f"{buffer.get_name()}_workspace = empty_strided("
+                f"{self.codegen_shape_tuple(workspace_size)}, "
+                f"{self.codegen_shape_tuple(workspace_stride)}, "
+                f"device='{device.type}', dtype={torch.uint8})"
+            )
+        return res


Yes this is a TODO, won't be included in this PR.

ipiszy · 2023-08-23T05:58:08Z

torch/_inductor/utils.py

+    def _should_keep(key: str) -> bool:
+        if key.startswith("cuda"):
+            return False
+        if key == "Context Sync":


Will add them.

ipiszy · 2023-08-23T05:58:48Z

torch/_inductor/utils.py

 VarRanges = Dict[sympy.Expr, sympy.Expr]


+def do_bench_using_profiling(fn: Callable[[], Any], warmup=25, rep=100) -> None:


TritonTemplateKernel and CUDATemplateKernel benchmark is unified at torch/_inductor/autotune_process.py:BenchmarkRequest::benchmark(). I introduced this method because the existing profiling method measures CPU-side kernel launch overhead and is not accurate. However this method could also be fragile as it needs to filter out all irrelevant kernels. I'll add more comments.

ipiszy · 2023-08-23T06:01:13Z

torch/_inductor/codegen/wrapper.py

        )
+        if buffer.get_workspace_size() > 0:
+            workspace_size = tuple(buffer.get_workspace_size())
+            workspace_stride = tuple(1)


I'll either remove these logics or add more unittests. Default cutlass gemm kernels only have workspze_size == 0 so in E2E test this path is not exercised. From Haicheng only group_gemm and parallel split_k need this.

ipiszy · 2023-08-27T01:45:48Z

Hi @jansel @aakhundov , I created new PR stack #107933, ptal, thanks!

vadimkantorov · 2023-08-28T22:07:22Z

Related on adding nvrtc bindings:

[feature request] [discussion] Include basic ctypes bindings for cudart/cublasLt/cublas/nvrtc/cudnn with stock PyTorch #107800

ipiszy requested a review from jansel August 4, 2023 08:21

github-actions bot added module: inductor ciflow/inductor labels Aug 4, 2023

ipiszy force-pushed the ipiszy/cutlass branch from a89f8c7 to 0f180a7 Compare August 4, 2023 17:22

jansel reviewed Aug 4, 2023

View reviewed changes

ipiszy mentioned this pull request Aug 10, 2023

Add cutlass as an alternative backend of PT2 Inductor #106991

Open

ipiszy force-pushed the ipiszy/cutlass branch from 0f180a7 to 34cea85 Compare August 11, 2023 07:37

aakhundov reviewed Aug 11, 2023

View reviewed changes

aakhundov reviewed Aug 13, 2023

View reviewed changes

aakhundov reviewed Aug 14, 2023

View reviewed changes

ipiszy force-pushed the ipiszy/cutlass branch from 34cea85 to 47921b8 Compare August 15, 2023 07:06

ipiszy force-pushed the ipiszy/cutlass branch 2 times, most recently from 80f0c90 to 53e25a8 Compare August 23, 2023 01:13

ipiszy commented Aug 23, 2023

View reviewed changes

draft

8cbe612

ipiszy force-pushed the ipiszy/cutlass branch from 53e25a8 to 8cbe612 Compare August 23, 2023 06:07

ipiszy closed this Oct 20, 2023

github-actions bot deleted the ipiszy/cutlass branch February 21, 2025 02:04



		def _cutlass_include_paths() -> List[str]:
		_CUTLASS_PATH = os.path.join(

		VarRanges = Dict[sympy.Expr, sympy.Expr]


		def do_bench_using_profiling(fn: Callable[[], Any], warmup=25, rep=100) -> None:

		n_warmup = max(1, int(warmup / estimate_ms))
		n_repeat = max(1, int(rep / estimate_ms))



		def dtype_match(torch_dtype, cutlass_dtype) -> bool:
		if torch_dtype == torch.float or torch_dtype == torch.float32:

		cutlass_dtype == cutlass_lib.DataType.f32
		or cutlass_dtype == cutlass_lib.DataType.tf32

		is_load_uint8_as_float: bool = False


		@functools.lru_cache(None)

[prototype only] Add cutlass as an alternative backend to inductor #106607

[prototype only] Add cutlass as an alternative backend to inductor #106607

Uh oh!

Conversation

ipiszy commented Aug 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key changes:

Tests:

Things to fix in this PR:

Things to fix in follow-up PRs:

Things to add:

Uh oh!

pytorch-bot bot commented Aug 4, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/106607

❌ 4 New Failures

Uh oh!

github-actions bot commented Aug 4, 2023

This PR needs a release notes: label

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aakhundov left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ipiszy commented Aug 4, 2023 •

edited

Loading

pytorch-bot bot commented Aug 4, 2023 •

edited

Loading

This PR needs a `release notes:` label