Avoid the "non-contiguous X" branch in the Z = X * Y matrix multiplication #439

ggerganov · 2023-03-23T18:41:31Z

In the mul_mat() implementation currently we have 2 main branches:

src0 is contiguous in memory (code)
src0 is not contiguous in memory (code)

In the first branch we parallelize the computation along the src0 rows. Each thread computes a dot product of src0 row with src1 column and writes the result into a cell of dst.

In the second branch we parallelize along the src1 columns. Each thread computes multiply + add (mad) of a src0 column with an element from src1 and writes the result into a per-thread temporary buffer row. At the end of the multiplication, the results from all temporary buffers are accumulated into dst.

The second branch produces variation in the final result based on the used number of threads, since the result into a single dst cell is computed by adding different number of floating point terms, based on the used number of threads. It is a bit more efficient, but also uses a lot more memory due to the temporary buffers.

I am thinking that in view of having more stable results and also simplifying significantly the code in ggml.c, we should eliminate this second branch. The solution is to always make sure that src0 is contiguous, which the user can always achieve with a simple ggml_cpy() call.

The benefits are quite a lot:

no need to maintain ggml_vec_mad_xxx() functions - can be simply deleted
reproducible results for different number of threads
simpler ggml_forward_mul_mat_xxx() implementations
less memory usage

The drawbacks:

very slight performance degradation

ggerganov · 2023-03-23T19:34:14Z

@blackhole89 bringing your attention to this

IlyaBizyaev · 2023-03-24T00:04:16Z

For me, this change increases the memory usage so drastically that main crashes with ggml_new_tensor_impl: not enough space in the context's memory pool during initial prompt (there's 64G, and before the change just 8 is enough).

Ubuntu 22.04.2 LTS, GCC 12

See ggml-org/llama.cpp#439 Closes rustformers#67

See ggml-org/llama.cpp#439 Closes #67

…ication (ggml-org#439)" This reverts commit 483bab2.

Avoid the transposed X branch in the Z = X * Y matrix multiplication

d90112a

ggerganov mentioned this pull request Mar 23, 2023

Add support for batch size to --perplexity #407

Merged

ggerganov changed the title ~~Avoid the transposed X branch in the Z = X * Y matrix multiplication~~ Avoid the "transposed X" branch in the Z = X * Y matrix multiplication Mar 23, 2023

ggerganov marked this pull request as ready for review March 23, 2023 19:33

ggerganov changed the title ~~Avoid the "transposed X" branch in the Z = X * Y matrix multiplication~~ Avoid the "non-contiguous X" branch in the Z = X * Y matrix multiplication Mar 23, 2023

Green-Sky mentioned this pull request Mar 23, 2023

dynamic estimate of required memory usage #438

Closed

ggerganov merged commit 483bab2 into master Mar 23, 2023

ggerganov deleted the mat-mul-stability branch March 23, 2023 21:22

ggerganov mentioned this pull request Mar 23, 2023

Eliminate ggml_forward_mul_mat_xxx() branch for non-contiguous src0 #441

Closed

KerfuffleV2 mentioned this pull request Mar 24, 2023

Making results independent from threadcount/batch size (from llama.cpp) rustformers/llm#67

Closed

KerfuffleV2 added a commit to KerfuffleV2/llama-rs that referenced this pull request Mar 24, 2023

Copy v_transposed like llama.cpp

c2d9ab3

See ggml-org/llama.cpp#439 Closes rustformers#67

KerfuffleV2 mentioned this pull request Mar 24, 2023

Copy v_transposed like llama.cpp rustformers/llm#68

Merged

KerfuffleV2 added a commit to KerfuffleV2/llama-rs that referenced this pull request Mar 26, 2023

Copy v_transposed like llama.cpp

34ba664

See ggml-org/llama.cpp#439 Closes rustformers#67

philpax pushed a commit to rustformers/llm that referenced this pull request Mar 26, 2023

Copy v_transposed like llama.cpp (#68)

b103dcd

See ggml-org/llama.cpp#439 Closes #67

cyyynthia mentioned this pull request Mar 30, 2023

Performance Discrepancy: gpt4all Faster than Optimized llama.cpp #603

Closed

sw added a commit to sw/llama.cpp that referenced this pull request Apr 4, 2023

Revert "Avoid the transposed X branch in the Z = X * Y matrix multipl…

1cecdec

…ication (ggml-org#439)" This reverts commit 483bab2.

ggerganov mentioned this pull request Apr 5, 2023

Avoid heavy V transpose operation + improvements #775

Merged

ggerganov mentioned this pull request Sep 6, 2023

KV cache quantized to q8_0 #2969

Closed

AAbushady pushed a commit to AAbushady/llama.cpp that referenced this pull request Jan 27, 2024

Reduce warnings. (ggml-org#439)

f6ba36d

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid the "non-contiguous X" branch in the Z = X * Y matrix multiplication #439

Avoid the "non-contiguous X" branch in the Z = X * Y matrix multiplication #439

Uh oh!

ggerganov commented Mar 23, 2023 •

edited

Loading

Uh oh!

ggerganov commented Mar 23, 2023

Uh oh!

IlyaBizyaev commented Mar 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Avoid the "non-contiguous X" branch in the Z = X * Y matrix multiplication #439

Avoid the "non-contiguous X" branch in the Z = X * Y matrix multiplication #439

Uh oh!

Conversation

ggerganov commented Mar 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Mar 23, 2023

Uh oh!

IlyaBizyaev commented Mar 24, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ggerganov commented Mar 23, 2023 •

edited

Loading