llama : add benchmark example #2626

slaren · 2023-08-15T19:54:47Z

Adds an example for running performance benchmarks. Multiple values can be specified for each option, and it will run the matrix of all of them. Supports output to csv, json or markdown.

Example markdown output:

model	backend	n_gpu_layers	test	t/s
LLaMA 7B mostly Q4_0	CUDA	99	pp 512	2242.06 ± 24.26
LLaMA 7B mostly Q4_0	CUDA	99	tg 128	43.09 ± 0.41

JohannesGaessler

My results:

model	backend	n_gpu_layers	n_prompt	n_gen	t/s
LLaMA 7B mostly Q4_0	CUDA	99	512	0	2153.50 ± 24.75
LLaMA 7B mostly Q4_0	CUDA	99	0	128	128.65 ± 0.13
LLaMA 7B mostly Q4_0	CUDA	99	512	128	449.72 ± 1.27

It would be nice if the table was already aligned in the console (assuming a monospace font).

examples/llama-bench/llama-bench.cpp

Co-authored-by: Johannes Gäßler <[email protected]>

JohannesGaessler · 2023-08-15T22:36:28Z

Since you explicitly asked in the other PR, these are my particular needs:

Most frequently I run benchmarks to test one git revision vs. another. I usually put the git revisions in separate columns right next to each other with an extra column for the speedup since I think that this is the easiest way to compare the results. I also add some basic hardware info like the GPU. My current workflow is that I just type the numbers from the console into an Emacs org mode table and then replace + with | to get a markdown table.
I want to start collecting, curating, and publishing GPU data for my blog soon. This will include performance but also some perplexity benchmarks. I would want to periodically re-run those benchmarks in order to keep the numbers up-to-date. So my goal is to completely automate the process of running the benchmarks and creating the tables and plots.
I will use Slurm to schedule the compute jobs associated with the previous point on my server. I therefore want an easy way to collect the data from the various runs for my analysis scripts. A database would be ideal, separate self-contained files that contain all relevant information also work.

slaren · 2023-08-15T22:59:31Z

I think we could to add sqlite as an output option if it is isolated to the example. We could copy the sqlite3.h/c files in the example directory to avoid having to install or download it separately to build llama.cpp, or just make it optional at compile time. The hardware info seems a bit more complicated to do in a multiplatform way, but maybe we could add it only for Linux initially and then expand it later to other platforms, if anyone is interested in that. Other than that, I think that this should work as is.

JohannesGaessler · 2023-08-15T23:16:20Z

I think we could to add sqlite as an output option if it is isolated to the example.

Or we could just print SQL to stdout. SQLite comes with a binary that accepts SQL on stdin so something like llama-bench -o sql | sqlite3 llama.sqlite would work assuming that the user has SQLite installed (and the documentation has the command to initialize the database).

ggerganov · 2023-08-16T07:22:01Z

In the third row of the sample results table in OP, what is the meaning of the t/s column? Is it averaged across PP and TG?
If so, I don't think it is very informative - separate numbers would be much better.

slaren · 2023-08-16T09:07:45Z

Yes, it is the average t/s of a prompt of 512 tokens followed by a generation of 128 tokens. The way this works currently is that n_prompt and n_gen are part of the grid, by default have the values 0,512 and 0,128, and all combinations are tested (except 0,0 which would do nothing). I agree that it is probably not be very useful to do it this way, and may be better to consider each value of n_prompt and n_gen as a different test instead.

JohannesGaessler · 2023-08-16T13:12:14Z

I just remembered: I think it would be useful to print a warning when the benchmark is being run with debug settings. I previously wasted time trying to figure out why the performance is suddenly terrible when I had just forgotten to remove LLAMA_DEBUG=1 from the compile args (which had disabled all optimizations).

slaren · 2023-08-16T13:35:12Z

I will add a warning if NDEBUG is not defined, but it is not completely reliable. I am not sure if there is a better way.

JohannesGaessler · 2023-08-16T13:41:24Z

I think that would be good enough. The biggest factor for performance is compiling without optimizations (because otherwise the compiler may optimize out local variables) and to my knowledge this always implies that NDEBUG is not defined.

…instead

JohannesGaessler · 2023-08-16T15:52:22Z

How about this for the performance columns: one column for prompt t/s, one for generation t/s, and one for the total time.

slaren · 2023-08-16T16:18:07Z

Unfortunately, that doesn't fit very well with the design. There may be any number of tests with any number of different parameters, tests don't need to include both prompt and generation, it is possible to disable either one by setting it to zero, or there may be multiple values for the number of tokens of prompt and generation.

slaren · 2023-08-16T16:26:28Z

A few more examples of the markdown output with CUDA:

./llama-bench

model	backend	n_gpu_layers	test	t/s
LLaMA 7B mostly Q4_0	CUDA	99	pp 512	2240.84 ± 12.03
LLaMA 7B mostly Q4_0	CUDA	99	tg 128	43.30 ± 0.33

./llama-bench -m models/3B/ggml-model-q4_0.bin -m models/7B/ggml-model-q4_0.bin

model	backend	n_gpu_layers	test	t/s
LLaMA 3B mostly Q4_0	CUDA	99	pp 512	3545.07 ± 194.87
LLaMA 7B mostly Q4_0	CUDA	99	pp 512	2247.62 ± 6.89
LLaMA 3B mostly Q4_0	CUDA	99	tg 128	55.09 ± 0.39
LLaMA 7B mostly Q4_0	CUDA	99	tg 128	43.74 ± 0.20

./llama-bench -m models/3B/ggml-model-q4_0.bin -m models/7B/ggml-model-q4_0.bin -mmq 0,1 -n 0

model	backend	n_gpu_layers	mul_mat_q	test	t/s
LLaMA 3B mostly Q4_0	CUDA	99	0	pp 512	2675.03 ± 92.08
LLaMA 3B mostly Q4_0	CUDA	99	1	pp 512	3616.52 ± 16.05
LLaMA 7B mostly Q4_0	CUDA	99	0	pp 512	1595.56 ± 32.13
LLaMA 7B mostly Q4_0	CUDA	99	1	pp 512	2237.83 ± 13.17

./llama-bench -m models/3B/ggml-model-q4_0.bin -b 128,256,512 -n 0

model	backend	n_gpu_layers	n_batch	test	t/s
LLaMA 3B mostly Q4_0	CUDA	99	128	pp 512	1873.98 ± 16.27
LLaMA 3B mostly Q4_0	CUDA	99	256	pp 512	2618.88 ± 34.35
LLaMA 3B mostly Q4_0	CUDA	99	512	pp 512	3596.80 ± 46.84

./llama-bench -m models/3B/ggml-model-q4_0.bin -b 128,256,512 -n 0 -mmq 0,1

model	backend	n_gpu_layers	n_batch	mul_mat_q	test	t/s
LLaMA 3B mostly Q4_0	CUDA	99	128	0	pp 512	1369.07 ± 27.19
LLaMA 3B mostly Q4_0	CUDA	99	128	1	pp 512	1890.14 ± 14.18
LLaMA 3B mostly Q4_0	CUDA	99	256	0	pp 512	2105.17 ± 4.54
LLaMA 3B mostly Q4_0	CUDA	99	256	1	pp 512	2627.18 ± 18.51
LLaMA 3B mostly Q4_0	CUDA	99	512	0	pp 512	2722.97 ± 4.72
LLaMA 3B mostly Q4_0	CUDA	99	512	1	pp 512	3618.90 ± 27.27

./llama-bench -ngl 20,30,99

model	backend	n_gpu_layers	test	t/s
LLaMA 7B mostly Q4_0	CUDA	20	pp 512	460.75 ± 2.28
LLaMA 7B mostly Q4_0	CUDA	30	pp 512	626.80 ± 4.26
LLaMA 7B mostly Q4_0	CUDA	99	pp 512	2249.36 ± 8.20
LLaMA 7B mostly Q4_0	CUDA	20	tg 128	19.78 ± 1.38
LLaMA 7B mostly Q4_0	CUDA	30	tg 128	31.29 ± 0.15
LLaMA 7B mostly Q4_0	CUDA	99	tg 128	43.42 ± 0.46

slaren · 2023-08-16T16:29:08Z

When building without a GPU backend, the number of threads is always shown instead of the number of GPU layers:

./llama-bench -m models/3B/ggml-model-q4_0.bin

model	backend	n_threads	test	t/s
LLaMA 3B mostly Q4_0	CPU	16	pp 512	64.20 ± 0.88
LLaMA 3B mostly Q4_0	CPU	16	tg 128	28.67 ± 0.41

./llama-bench -m models/3B/ggml-model-q4_0.bin -p 0 -n 32 -t 8,16,32

model	backend	n_threads	test	t/s
LLaMA 3B mostly Q4_0	CPU	8	tg 32	30.12 ± 1.71
LLaMA 3B mostly Q4_0	CPU	16	tg 32	28.64 ± 0.25
LLaMA 3B mostly Q4_0	CPU	32	tg 32	26.97 ± 2.49

./llama-bench -m models/3B/ggml-model-q4_0.bin -p 32 -n 0 -t 8,16,32

model	backend	n_threads	test	t/s
LLaMA 3B mostly Q4_0	CPU	8	pp 32	55.18 ± 2.83
LLaMA 3B mostly Q4_0	CPU	16	pp 32	65.47 ± 0.26
LLaMA 3B mostly Q4_0	CPU	32	pp 32	99.27 ± 8.81

slaren · 2023-08-16T16:41:38Z

The current defaults are these:

static cmd_params cmd_params_defaults = {
    /* model         */ {"models/7B/ggml-model-q4_0.bin"},
    /* n_prompt      */ {512},
    /* n_gen         */ {128},
    /* n_batch       */ {512},
    /* f32_kv        */ {false},
    /* n_threads     */ {get_num_physical_cores()},
    /* n_gpu_layers  */ {99},
    /* main_gpu      */ {0},
    /* mul_mat_q     */ {true},
    /* low_vram      */ {false},
    /* tensor_split  */ {{}},
    /* reps          */ 5,
    /* verbose       */ false,
    /* output_format */ MARKDOWN
};

The default should serve as a good standard test, so we may want to change some of the values.

cebtenzzre · 2023-08-16T17:06:21Z

I will add a warning if NDEBUG is not defined, but it is not completely reliable. I am not sure if there is a better way.

You can check whether optimizations are enabled in gcc and clang by checking if __OPTIMIZE__ is defined—LLAMA_DEBUG uses -O0. I personally build without NDEBUG in case I encounter a bug during normal use.

slaren · 2023-08-16T17:09:36Z

Nice, thanks. Checking __OPTIMIZE__ for gcc and clang and _DEBUG for msvc should do it. This still doesn't stop anyone from compiling ggml with different flags, but it should be better than checking for NDEBUG only.

slaren · 2023-08-16T17:48:08Z

I noticed a weird bug when testing with prompt sizes 128 and 1024 in the same run with CUDA enabled. The test with size 128 will work, but the test with 1024 will fail with an out of memory in the scratch buffer error:

ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 117440512, available 113246208)

When testing only with prompt size 1024 this doesn't happen. A new llama_model and llama_context is created for every test, so this shouldn't happen unless there is some global state in llama.cpp, and I don't think there is. There is global state in ggml-cuda, but I am not sure how that could affect the ggml scratch buffer.

slaren · 2023-08-16T21:59:24Z

I found the problem. Turns out, there is global state in llama.cpp:
https://github.com/ggerganov/llama.cpp/blob/0919a0f73d95cfb93a1646a1d1741a0615fe2c5e/llama.cpp#L118-L129
The scratch buffer size map is initialized the first time MEM_REQ_SCRATCH0 is called based on the value of n_ctx passed. Subsequent calls, will reuse the same map with the previous value of n_ctx.

…rst call

slaren · 2023-08-17T15:34:13Z

I have added basic CPU info (Linux only) and GPU info (CUDA only), and SQL output (probably only works with sqlite). Is there anything else that could be added here? Otherwise, I consider this ready to review.

ggerganov

Very useful tool!

Here are some TG results on M2 Studio:

LLAMA_METAL=1 make -j && ./llama-bench -m models/llama-7b/ggml-model-q4_0.bin -m models/llama-13b/ggml-model-q4_0.bin -m models/llama-30b/ggml-model-q4_0.bin -m models/llama-65b/ggml-model-q4_0.bin -ngl 1 -p 0 -n 32 2> /dev/null

model	backend	n_gpu_layers	test	t/s
LLaMA 7B mostly Q4_0	Metal	1	tg 32	90.23 ± 0.10
LLaMA 13B mostly Q4_0	Metal	1	tg 32	55.76 ± 0.13
LLaMA 30B mostly Q4_0	Metal	1	tg 32	26.49 ± 0.04
LLaMA 65B mostly Q4_0	Metal	1	tg 32	15.08 ± 0.02

ggerganov · 2023-08-18T08:15:18Z

llama.cpp

+static std::map<e_model, size_t> MEM_REQ_SCRATCH0(int n_ctx)
 {
-    static std::map<e_model, size_t> k_sizes = {
+    std::map<e_model, size_t> k_sizes = {


slaren added 3 commits August 15, 2023 21:50

llama : add benchmark example

cfc7017

add to examples CMakeLists.txt

7ec6158

fix msvc build

6597d61

slaren mentioned this pull request Aug 15, 2023

Logging generations with SQLite #2557

Closed

add missing include

6ab6971

JohannesGaessler reviewed Aug 15, 2023

View reviewed changes

examples/llama-bench/llama-bench.cpp Outdated Show resolved Hide resolved

examples/llama-bench/llama-bench.cpp Outdated Show resolved Hide resolved

examples/llama-bench/llama-bench.cpp Outdated Show resolved Hide resolved

add Bessel's correction to stdev calculation

52b94f4

Co-authored-by: Johannes Gäßler <[email protected]>

slaren added 2 commits August 16, 2023 02:39

improve markdown formatting

f2cf01d

add missing include

f9bbc6f

print warning is NDEBUG is not defined

19e9bea

remove n_prompt and n_gen from the matrix, use each value separately …

3e3396e

…instead

better checks for non-optimized builds

5765f90

llama.cpp : fix MEM_REQ_SCRATCH0 reusing the value of n_ctx of the fi…

89a70f7

…rst call

slaren added 3 commits August 17, 2023 00:12

fix json formatting

314a6b5

add sql output

67362d9

add basic cpu and gpu info (linx/cuda only)

cac7031

slaren marked this pull request as ready for review August 17, 2023 00:54

slaren added 4 commits August 17, 2023 03:06

markdown: also show values that differ from the default

569dc6f

markdown: add build id

94218e8

cleanup

9c38660

improve formatting

b6c81e2

formatting

df87dd7

ggerganov added the high priority Very important issue label Aug 17, 2023

ggerganov self-requested a review August 17, 2023 21:22

ggerganov approved these changes Aug 18, 2023

View reviewed changes

ggerganov mentioned this pull request Aug 18, 2023

GGUF #2398

Merged

34 tasks

ggerganov reviewed Aug 18, 2023

View reviewed changes

JohannesGaessler mentioned this pull request Aug 18, 2023

YAML logging and presets #2657

Merged

slaren merged commit 097e121 into master Aug 18, 2023

slaren deleted the llama-benchmark branch August 18, 2023 10:45

llama : add benchmark example #2626

llama : add benchmark example #2626

Uh oh!

Conversation

slaren commented Aug 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JohannesGaessler commented Aug 15, 2023

Uh oh!

slaren commented Aug 15, 2023

Uh oh!

JohannesGaessler commented Aug 15, 2023

Uh oh!

ggerganov commented Aug 16, 2023

Uh oh!

slaren commented Aug 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Aug 16, 2023

Uh oh!

slaren commented Aug 16, 2023

Uh oh!

JohannesGaessler commented Aug 16, 2023

Uh oh!

JohannesGaessler commented Aug 16, 2023

Uh oh!

slaren commented Aug 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Aug 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Aug 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Aug 16, 2023

Uh oh!

cebtenzzre commented Aug 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

slaren commented Aug 16, 2023

Uh oh!

slaren commented Aug 16, 2023

Uh oh!

slaren commented Aug 16, 2023

Uh oh!

slaren commented Aug 17, 2023

Uh oh!

ggerganov left a comment

Choose a reason for hiding this comment

Uh oh!

ggerganov Aug 18, 2023

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

slaren commented Aug 15, 2023 •

edited

Loading

slaren commented Aug 16, 2023 •

edited

Loading

slaren commented Aug 16, 2023 •

edited

Loading

slaren commented Aug 16, 2023 •

edited

Loading

slaren commented Aug 16, 2023 •

edited

Loading

cebtenzzre commented Aug 16, 2023 •

edited

Loading