-
Notifications
You must be signed in to change notification settings - Fork 4.7k
sequence parallel default dtype #7364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Stas Bekman <[email protected]>
|
so what's special about a6000 gpu that it fails? I'm looking at the breakage basically it's this call that fails: for older pytorch versions, like pt-2.4, but it succeeds with pt-2.7.1. Aha! This workflow uses: so it reports 2.6 but that The caller is here: Is there a reason why this workflow is locked onto this pytorch version? If so we have to run it on an older edit: The half-baked pt-2.6 comes from |
Signed-off-by: Stas Bekman <[email protected]>
Signed-off-by: Stas Bekman <[email protected]>
the newly released nccl finally started to use fp32 accumulation for reduction ops! * Floating point summation is always done in fp32 accumulators (with the exception of fp8 on NVLS, where it uses fp16 inside the switch). Thus, the accuracy with fp8 and fp16 data types should be much improved. NVIDIA/nccl@72d2432 So we should change the fp32 comms default for SP to the same dtype as inputs if `nccl>=2.27.3` - the user can still override the default. --------- Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>
the newly released nccl finally started to use fp32 accumulation for reduction ops! * Floating point summation is always done in fp32 accumulators (with the exception of fp8 on NVLS, where it uses fp16 inside the switch). Thus, the accuracy with fp8 and fp16 data types should be much improved. NVIDIA/nccl@72d2432 So we should change the fp32 comms default for SP to the same dtype as inputs if `nccl>=2.27.3` - the user can still override the default. --------- Signed-off-by: Stas Bekman <[email protected]> Co-authored-by: Olatunji Ruwase <[email protected]>
the newly released nccl finally started to use fp32 accumulation for reduction ops!
exception of fp8 on NVLS, where it uses fp16 inside the switch). Thus,
the accuracy with fp8 and fp16 data types should be much improved.
NVIDIA/nccl@72d2432
So we should change the fp32 comms default for SP to the same dtype as inputs if
nccl>=2.27.3- the user can still override the default.