BLAS provides a specialized routine for the common A @ A.T operation. We should detect this case (same data pointer, reversed strides) and dispatch to syrk instead of gemm. (see)
In addition to providing a nice speedup, this should presumably avoid issues like #6793, where dot(A, A.T) returned a non-exactly-symmetric matrix and this caused confusion.