Skip to content

(de)quantization behaviors on CoreML #822

@philloooo

Description

@philloooo

Hi folks! I've implemented some version of quantize/dequantize on CoreML, here are the findings:

CoreML constexpr_blockwise_shift_scale fully support webnn's dequantize, but only for constant input, scale, zero_point.

  • It also states Although all parameters of this op are constants, this op is not constant-folded to a single const op at the time of model serialization. The unquantized output will be decompressed later, based on the implementation detail (either at model load time or runtime). Suggesting it's quite optimized on when to decompress the constants, I assume it would want to defer decompressing until actual computation happens on some devices.

CoreML support dequantize and quantize for non-constant inputs. although it's quite limited:

  • Thes two ops fail at compile time when scale is negative. Failed to parse the model specification. Error: Unable to parse ML Program: in operation of type quantize: For operator: quantize, scale must be positive, but get -4.61708" UserInfo={NSLocalizedDescription=Failed to parse the model specification. Error: Unable to parse ML Program: in operation of type quantize: For operator: quantize, scale must be positive, but get -4.61708
  • it still requires scale and bias to be constants.
  • no blockwise (de)quantization.
  • only (u)int8 support, no (u)int4. - (u)int4 can't be emulated, because cast doesn't support (u)int4.

So for non-constant inputs, we would do emulation if:

  • scale and zero-point are non-constant
  • it's blockwise
  • same for when we need to support dynamic quantization.

And (u)int4 will just be unsupported on CoreML for non-constant inputs until CoreML evolves.

I do wonder - Is it necessary to support negative scale? Because for this one we will need to manually inspect the tensor in c++ to decide whether to fallback to emulation.

If we need to support negative scale, since CoreML require scale to be constant, we have following options:
a. in chromium c++ code, we need to manually scan the scale tensor and check if there are negative values, if so, do emulation, otherwise use dequantize or quantize.
b. Don't manually scan the scale tensor, always do emulation. 🤔

Would love to get opinions from this group.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions