When multiplying and MxK matrix with a KxM matrix the current implementation in #5818 only allows K dimensions of 4 or 2 for int8 and bf16 respectively.
AMX's tiled operations should be able to scale to higher K values, as long as they're a multiple of 4 bytes (4 elements for int8, 2 for bf16).