-
Notifications
You must be signed in to change notification settings - Fork 59
Description
Apropos of #3 and webmachinelearning/webmachinelearning-ethics#22, an efficient matmul implementation can be fingerprinted to determine hardware capabilities.
On pre-VNNI Intel, the only efficient way to implement 8-bit multiplication is via pmaddubsw that produces a 16-bit result summed horizontally with saturation. I can construct matrices that test for this saturation, which indicates a pre-VNNI Intel CPU. Whereas ARM and NVidia implement signed * signed to 32-bit.
Saturating addition, which should be used for accuracy lest you generate large sign errors, can be used to infer the order of operations. So vpdpbusds saturation tells me what order the matmul ran in.
The slowdown from using AVX512 instructions is likely detectable with timing.
In floats one can also infer order of operations from rounding. This would reveal the SIMD length and possibly variations in the compiler used to build the user agent. A cache-efficient matmul implementation reveals cache sizes via floating point order of operations.