Bfloat16 support for MatMulBnb4, Training support bitsandbytes>=0.41.2 by jambayk · Pull Request #18484 · microsoft/onnxruntime

jambayk · 2023-11-17T03:52:37Z

Description

Add bfloat16 support for MatMulBnb4 contrib op. This is useful for QLoRA fine-tuning.

On GPUs with SM80+ (A100, etc), it uses the native cuda bfloat16 dtype, nv_bfloat16. On other GPUs, it uses the onnxruntime BFloat16 type which uses float for compute.
I have validated the op in a llama2-7b training scenario. The losses match pytorch training and the training throughput is better.
Cannot add a bfloat16 case in the op unit test since casting BFloat16 to and from float multiple times during the test causes the required tolerances to be unachievable.

The custom autograd function exporter in onnxruntime-training is updated to support the latest version of bitsandbytes. They changed how the quant_state is stored.

Motivation and Context

Enable QLoRA fine-tuning with bfloat16.

reset unit test alphabetical order

askhade · 2023-11-17T03:58:53Z

"Cannot add a bfloat16 case in the op unit test since casting BFloat16 to and from float multiple times during the test causes the required tolerances to be unachievable." - Can you add a test which will only run if the hardware supports the data type? We plan to move the CI pipelines on A10, once this is done we will be able to run the test.

jambayk · 2023-11-17T04:11:11Z

"Cannot add a bfloat16 case in the op unit test since casting BFloat16 to and from float multiple times during the test causes the required tolerances to be unachievable." - Can you add a test which will only run if the hardware supports the data type? We plan to move the CI pipelines on A10, once this is done we will be able to run the test.

@askhade I tried this too but because of how the test is setup, the tolerance issue still remains.

The reference value for matrix multiplication output is calculated in float. And it is then cast to bfloat16.
The actual value is computed by
- cast input to bfloat16
- compute in bfloat16
- accumulate in float -> cast to bfloat16

Because of the difference in where and how the cast to bfloat16 happens, the outputs vary significantly in some places. This was not an issue for float16 since casting from float to float16 does not change the value that much.

Edit: There are also differences from the dequantization step too.

onnxruntime/contrib_ops/cuda/quantization/dequantize_blockwise_bnb4.cu

onnxruntime/contrib_ops/cuda/quantization/matmul_bnb4.cu

pengwa

LGTM. Thanks for fixing the exporter issue BTW.

onnxruntime/contrib_ops/cuda/quantization/dequantize_blockwise_bnb4.cuh

This reverts commit 21ddce3.

microsoft#18484) ### Description  Add bfloat16 support for `MatMulBnb4` contrib op. This is useful for QLoRA fine-tuning. - On GPUs with SM80+ (A100, etc), it uses the native cuda bfloat16 dtype, `nv_bfloat16`. On other GPUs, it uses the onnxruntime `BFloat16` type which uses float for compute. - I have validated the op in a llama2-7b training scenario. The losses match pytorch training and the training throughput is better. - Cannot add a bfloat16 case in the op unit test since casting BFloat16 to and from float multiple times during the test causes the required tolerances to be unachievable. The custom autograd function exporter in onnxruntime-training is updated to support the latest version of bitsandbytes. They changed how the `quant_state` is stored. ### Motivation and Context  Enable QLoRA fine-tuning with bfloat16.

jambayk added 2 commits November 17, 2023 03:08

bfloat16 support for matmulbnb4

d4700d9

reset unit test alphabetical order

support latest bnb quant_state

5af160c

jambayk requested review from askhade, pengwa and yufenglee November 17, 2023 03:52

pengwa reviewed Nov 17, 2023

View reviewed changes

onnxruntime/contrib_ops/cuda/quantization/dequantize_blockwise_bnb4.cu Show resolved Hide resolved

pengwa reviewed Nov 17, 2023

View reviewed changes

onnxruntime/contrib_ops/cuda/quantization/matmul_bnb4.cu Outdated Show resolved Hide resolved

pengwa previously approved these changes Nov 17, 2023

View reviewed changes

jambayk commented Nov 17, 2023

View reviewed changes

onnxruntime/contrib_ops/cuda/quantization/dequantize_blockwise_bnb4.cuh Show resolved Hide resolved

jambayk added 2 commits November 17, 2023 17:13

include with path

ed7c4fb

remove scalar mul helpers

21ddce3

jambayk dismissed pengwa’s stale review via 21ddce3 November 17, 2023 18:01

Revert "remove scalar mul helpers"

6e675c0

This reverts commit 21ddce3.

jambayk requested a review from pengwa November 17, 2023 21:18

pengwa approved these changes Nov 20, 2023

View reviewed changes

jambayk merged commit 1af0681 into main Nov 20, 2023

jambayk deleted the jambayk/matmulbnb4-bf16 branch November 20, 2023 17:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bfloat16 support for MatMulBnb4, Training support bitsandbytes>=0.41.2#18484

Bfloat16 support for MatMulBnb4, Training support bitsandbytes>=0.41.2#18484
jambayk merged 5 commits intomainfrom
jambayk/matmulbnb4-bf16

jambayk commented Nov 17, 2023

Uh oh!

askhade commented Nov 17, 2023

Uh oh!

jambayk commented Nov 17, 2023 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

pengwa left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jambayk commented Nov 17, 2023

Description

Motivation and Context

Uh oh!

askhade commented Nov 17, 2023

Uh oh!

jambayk commented Nov 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pengwa left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jambayk commented Nov 17, 2023 •

edited

Loading