-
Notifications
You must be signed in to change notification settings - Fork 59
Closed
Labels
Description
Block-wise quantization divides input tensors into smaller blocks that are independently quantized, resulting in faster optimization and high precision quantization. It is used for popular language models, such as phi-3 mini int4 quantized model.
Native ML API's support
DML DML_OPERATOR_QUANTIZE and DML_OPERATOR_DEQUANTIZE introduced in Feature Level 6.3
CoreML constexpr_blockwise_shift_scale
TFLite: ?
Proposal
No API signature changes regarding to @fdwr 's proposal of dequantizeLinear and quantizeLinear ops.
MLOperand dequantizeLinear(MLOperand input, MLOperand scale, MLOperand zeroPoint, optional MLOperatorOptions options = {});
MLOperand quantizeLinear(MLOperand input, MLOperand scale, MLOperand zeroPoint, optional MLOperatorOptions options = {});The block_size is an integer and implied by block_size = input_size / scale_size (where input_size % scale_size == 0) along a dimension. zeroPoint and scale should have the same shape.
Reactions are currently unavailable