conditionally enable hipsparse const descriptors for version >= 2.4.0, backport to release/1.13 by jeffdaily · Pull Request #1233 · ROCm/pytorch

jeffdaily · 2023-06-01T18:06:31Z

Backporting #1217 to release/1.13 branch. Required cherry-picking some upstream changes adding CUDA 12.0 support. Fixes SWDEV-403604.

cuSPARSE v12.0 has started to use const pointers for the descriptors, from `cusparse.h` (documentation is incorrect): ```cpp typedef struct cusparseSpVecDescr const* cusparseConstSpVecDescr_t; typedef struct cusparseDnVecDescr const* cusparseConstDnVecDescr_t; typedef struct cusparseSpMatDescr const* cusparseConstSpMatDescr_t; typedef struct cusparseDnMatDescr const* cusparseConstDnMatDescr_t; ``` Changing also the function signature for the corresponding destructors to accept a const pointer. This PR adds `ConstCuSparseDescriptorDeleter` working with `cusparseStatus_t (*destructor)(const T*)`. Some algorithm enums were deprecated during CUDA 11 and removed in CUDA 12, I replaced the following occurences ``` CUSPARSE_CSRMM_ALG1 -> CUSPARSE_SPMM_CSR_ALG1 CUSPARSE_COOMM_ALG1 -> CUSPARSE_SPMM_COO_ALG1 CUSPARSE_COOMM_ALG2 -> CUSPARSE_SPMM_COO_ALG2 ``` Pull Request resolved: pytorch#90765 Approved by: https://github.com/cpuhrsch

…pytorch#90897) [CUDA 12] Fix the endif guard position for cusparse const descriptors Related pytorch#90765 Pull Request resolved: pytorch#90897 Approved by: https://github.com/IvanYashchuk

See pytorch#91122 Summary: Some APIs are deprecated in newer version of CUDA. * cudaGraphInstantiate: From: ``` cudaGraphInstantiate ( cudaGraphExec_t* pGraphExec, cudaGraph_t graph, cudaGraphNode_t* pErrorNode, char* pLogBuffer, size_t bufferSize ) ``` To ``` __host__cudaError_t cudaGraphInstantiate ( cudaGraphExec_t* pGraphExec, cudaGraph_t graph, unsigned long long flags = 0 ) ``` * cudaProfilerInitialize: deprecated in cuda 11 and removed in cuda 12 Test Plan: GH CI Differential Revision: D41469051 Pull Request resolved: pytorch#91050 Approved by: https://github.com/jianyuh

…#1217) * conditionally enable hipsparse const descriptors * update hipsparse const API version condition to 2.4.0

jeffdaily · 2023-06-01T20:25:45Z

CI is still running but build completed successfully. The build was the critical piece of this. The test2 suite hung during test_meta.py, reminds me of other hangs there.

Merging since build passed.

This PR implements a core 'real' training loop in that it runs deepseekv2 model using a number of Titan components to train on real (C4) data with adamW and displays initial training loop metrics. There is a lot more to be done but the goal here is to get a true training loop going from which additional PRs will then improve upon it. <img width="1192" alt="Screenshot 2025-05-29 at 7 41 01 PM" src="https://github.com/user-attachments/assets/36ae2ff1-aa99-42c9-8b97-1e0a1ef8376e" /> A couple key highlights: a - the model is now controllable via toml or cmd line just like Titan main. Note that the expert parallel control is waiting for PR pytorch/torchtitan#1244 to land...atm it just manually puts ep to 2. b - we use the HF deepseek tokenizer and as a result I had to make a wrapper to deal with the bos and eos params passed by Titan. c - loss metrics, tps, etc are displaying but MFU and tflops need to be updated. A lot more improvements will come shortly but for now want to land this to ensure our base deepseek training loop is available to iterate on.

IvanYashchuk and others added 4 commits June 1, 2023 17:52

conditionally enable hipsparse const descriptors for version >= 2.4.0 (…

4ecdc84

…#1217) * conditionally enable hipsparse const descriptors * update hipsparse const API version condition to 2.4.0

jithunnair-amd approved these changes Jun 1, 2023

View reviewed changes

jeffdaily merged commit b4468f2 into release/1.13 Jun 1, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

conditionally enable hipsparse const descriptors for version >= 2.4.0, backport to release/1.13#1233

conditionally enable hipsparse const descriptors for version >= 2.4.0, backport to release/1.13#1233
jeffdaily merged 4 commits intorelease/1.13from
release/1.13-hipsparse-const

jeffdaily commented Jun 1, 2023 •

edited

Loading

Uh oh!

jeffdaily commented Jun 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

jeffdaily commented Jun 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeffdaily commented Jun 1, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jeffdaily commented Jun 1, 2023 •

edited

Loading