Skip to content

Support block-wise quantization #779

@huningxin

Description

@huningxin

Block-wise quantization divides input tensors into smaller blocks that are independently quantized, resulting in faster optimization and high precision quantization. It is used for popular language models, such as phi-3 mini int4 quantized model.

Native ML API's support

DML DML_OPERATOR_QUANTIZE and DML_OPERATOR_DEQUANTIZE introduced in Feature Level 6.3
CoreML constexpr_blockwise_shift_scale
TFLite: ?

Proposal

No API signature changes regarding to @fdwr 's proposal of dequantizeLinear and quantizeLinear ops.

MLOperand dequantizeLinear(MLOperand input, MLOperand scale, MLOperand zeroPoint, optional MLOperatorOptions options = {});
MLOperand quantizeLinear(MLOperand input, MLOperand scale, MLOperand zeroPoint, optional MLOperatorOptions options = {});

The block_size is an integer and implied by block_size = input_size / scale_size (where input_size % scale_size == 0) along a dimension. zeroPoint and scale should have the same shape.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions