Avoid the "non-contiguous X" branch in the Z = X * Y matrix multiplication #439
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
ref #249 #95
See #407 (comment)
In the
mul_mat()implementation currently we have 2 main branches:src0is contiguous in memory (code)src0is not contiguous in memory (code)In the first branch we parallelize the computation along the
src0rows. Each thread computes a dot product ofsrc0row withsrc1column and writes the result into a cell ofdst.In the second branch we parallelize along the
src1columns. Each thread computes multiply + add (mad) of asrc0column with an element fromsrc1and writes the result into a per-thread temporary buffer row. At the end of the multiplication, the results from all temporary buffers are accumulated intodst.The second branch produces variation in the final result based on the used number of threads, since the result into a single
dstcell is computed by adding different number of floating point terms, based on the used number of threads. It is a bit more efficient, but also uses a lot more memory due to the temporary buffers.I am thinking that in view of having more stable results and also simplifying significantly the code in
ggml.c, we should eliminate this second branch. The solution is to always make sure thatsrc0is contiguous, which the user can always achieve with a simpleggml_cpy()call.The benefits are quite a lot:
ggml_vec_mad_xxx()functions - can be simply deletedggml_forward_mul_mat_xxx()implementationsThe drawbacks: