Skip to content

Conversation

@slaren
Copy link
Member

@slaren slaren commented Aug 15, 2023

Adds an example for running performance benchmarks. Multiple values can be specified for each option, and it will run the matrix of all of them. Supports output to csv, json or markdown.

Example markdown output:

model backend n_gpu_layers test t/s
LLaMA 7B mostly Q4_0 CUDA 99 pp 512 2242.06 ± 24.26
LLaMA 7B mostly Q4_0 CUDA 99 tg 128 43.09 ± 0.41

Copy link
Collaborator

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My results:

model backend n_gpu_layers n_prompt n_gen t/s
LLaMA 7B mostly Q4_0 CUDA 99 512 0 2153.50 ± 24.75
LLaMA 7B mostly Q4_0 CUDA 99 0 128 128.65 ± 0.13
LLaMA 7B mostly Q4_0 CUDA 99 512 128 449.72 ± 1.27

It would be nice if the table was already aligned in the console (assuming a monospace font).

@JohannesGaessler
Copy link
Collaborator

Since you explicitly asked in the other PR, these are my particular needs:

  • Most frequently I run benchmarks to test one git revision vs. another. I usually put the git revisions in separate columns right next to each other with an extra column for the speedup since I think that this is the easiest way to compare the results. I also add some basic hardware info like the GPU. My current workflow is that I just type the numbers from the console into an Emacs org mode table and then replace + with | to get a markdown table.
  • I want to start collecting, curating, and publishing GPU data for my blog soon. This will include performance but also some perplexity benchmarks. I would want to periodically re-run those benchmarks in order to keep the numbers up-to-date. So my goal is to completely automate the process of running the benchmarks and creating the tables and plots.
  • I will use Slurm to schedule the compute jobs associated with the previous point on my server. I therefore want an easy way to collect the data from the various runs for my analysis scripts. A database would be ideal, separate self-contained files that contain all relevant information also work.

@slaren
Copy link
Member Author

slaren commented Aug 15, 2023

I think we could to add sqlite as an output option if it is isolated to the example. We could copy the sqlite3.h/c files in the example directory to avoid having to install or download it separately to build llama.cpp, or just make it optional at compile time. The hardware info seems a bit more complicated to do in a multiplatform way, but maybe we could add it only for Linux initially and then expand it later to other platforms, if anyone is interested in that. Other than that, I think that this should work as is.

@JohannesGaessler
Copy link
Collaborator

I think we could to add sqlite as an output option if it is isolated to the example.

Or we could just print SQL to stdout. SQLite comes with a binary that accepts SQL on stdin so something like llama-bench -o sql | sqlite3 llama.sqlite would work assuming that the user has SQLite installed (and the documentation has the command to initialize the database).

@ggerganov
Copy link
Member

In the third row of the sample results table in OP, what is the meaning of the t/s column? Is it averaged across PP and TG?
If so, I don't think it is very informative - separate numbers would be much better.

@slaren
Copy link
Member Author

slaren commented Aug 16, 2023

Yes, it is the average t/s of a prompt of 512 tokens followed by a generation of 128 tokens. The way this works currently is that n_prompt and n_gen are part of the grid, by default have the values 0,512 and 0,128, and all combinations are tested (except 0,0 which would do nothing). I agree that it is probably not be very useful to do it this way, and may be better to consider each value of n_prompt and n_gen as a different test instead.

@JohannesGaessler
Copy link
Collaborator

I just remembered: I think it would be useful to print a warning when the benchmark is being run with debug settings. I previously wasted time trying to figure out why the performance is suddenly terrible when I had just forgotten to remove LLAMA_DEBUG=1 from the compile args (which had disabled all optimizations).

@slaren
Copy link
Member Author

slaren commented Aug 16, 2023

I will add a warning if NDEBUG is not defined, but it is not completely reliable. I am not sure if there is a better way.

@JohannesGaessler
Copy link
Collaborator

I think that would be good enough. The biggest factor for performance is compiling without optimizations (because otherwise the compiler may optimize out local variables) and to my knowledge this always implies that NDEBUG is not defined.

@JohannesGaessler
Copy link
Collaborator

How about this for the performance columns: one column for prompt t/s, one for generation t/s, and one for the total time.

@slaren
Copy link
Member Author

slaren commented Aug 16, 2023

Unfortunately, that doesn't fit very well with the design. There may be any number of tests with any number of different parameters, tests don't need to include both prompt and generation, it is possible to disable either one by setting it to zero, or there may be multiple values for the number of tokens of prompt and generation.

@slaren
Copy link
Member Author

slaren commented Aug 16, 2023

A few more examples of the markdown output with CUDA:

./llama-bench

model backend n_gpu_layers test t/s
LLaMA 7B mostly Q4_0 CUDA 99 pp 512 2240.84 ± 12.03
LLaMA 7B mostly Q4_0 CUDA 99 tg 128 43.30 ± 0.33

./llama-bench -m models/3B/ggml-model-q4_0.bin -m models/7B/ggml-model-q4_0.bin

model backend n_gpu_layers test t/s
LLaMA 3B mostly Q4_0 CUDA 99 pp 512 3545.07 ± 194.87
LLaMA 7B mostly Q4_0 CUDA 99 pp 512 2247.62 ± 6.89
LLaMA 3B mostly Q4_0 CUDA 99 tg 128 55.09 ± 0.39
LLaMA 7B mostly Q4_0 CUDA 99 tg 128 43.74 ± 0.20

./llama-bench -m models/3B/ggml-model-q4_0.bin -m models/7B/ggml-model-q4_0.bin -mmq 0,1 -n 0

model backend n_gpu_layers mul_mat_q test t/s
LLaMA 3B mostly Q4_0 CUDA 99 0 pp 512 2675.03 ± 92.08
LLaMA 3B mostly Q4_0 CUDA 99 1 pp 512 3616.52 ± 16.05
LLaMA 7B mostly Q4_0 CUDA 99 0 pp 512 1595.56 ± 32.13
LLaMA 7B mostly Q4_0 CUDA 99 1 pp 512 2237.83 ± 13.17

./llama-bench -m models/3B/ggml-model-q4_0.bin -b 128,256,512 -n 0

model backend n_gpu_layers n_batch test t/s
LLaMA 3B mostly Q4_0 CUDA 99 128 pp 512 1873.98 ± 16.27
LLaMA 3B mostly Q4_0 CUDA 99 256 pp 512 2618.88 ± 34.35
LLaMA 3B mostly Q4_0 CUDA 99 512 pp 512 3596.80 ± 46.84

./llama-bench -m models/3B/ggml-model-q4_0.bin -b 128,256,512 -n 0 -mmq 0,1

model backend n_gpu_layers n_batch mul_mat_q test t/s
LLaMA 3B mostly Q4_0 CUDA 99 128 0 pp 512 1369.07 ± 27.19
LLaMA 3B mostly Q4_0 CUDA 99 128 1 pp 512 1890.14 ± 14.18
LLaMA 3B mostly Q4_0 CUDA 99 256 0 pp 512 2105.17 ± 4.54
LLaMA 3B mostly Q4_0 CUDA 99 256 1 pp 512 2627.18 ± 18.51
LLaMA 3B mostly Q4_0 CUDA 99 512 0 pp 512 2722.97 ± 4.72
LLaMA 3B mostly Q4_0 CUDA 99 512 1 pp 512 3618.90 ± 27.27

./llama-bench -ngl 20,30,99

model backend n_gpu_layers test t/s
LLaMA 7B mostly Q4_0 CUDA 20 pp 512 460.75 ± 2.28
LLaMA 7B mostly Q4_0 CUDA 30 pp 512 626.80 ± 4.26
LLaMA 7B mostly Q4_0 CUDA 99 pp 512 2249.36 ± 8.20
LLaMA 7B mostly Q4_0 CUDA 20 tg 128 19.78 ± 1.38
LLaMA 7B mostly Q4_0 CUDA 30 tg 128 31.29 ± 0.15
LLaMA 7B mostly Q4_0 CUDA 99 tg 128 43.42 ± 0.46

@slaren
Copy link
Member Author

slaren commented Aug 16, 2023

When building without a GPU backend, the number of threads is always shown instead of the number of GPU layers:

./llama-bench -m models/3B/ggml-model-q4_0.bin

model backend n_threads test t/s
LLaMA 3B mostly Q4_0 CPU 16 pp 512 64.20 ± 0.88
LLaMA 3B mostly Q4_0 CPU 16 tg 128 28.67 ± 0.41

./llama-bench -m models/3B/ggml-model-q4_0.bin -p 0 -n 32 -t 8,16,32

model backend n_threads test t/s
LLaMA 3B mostly Q4_0 CPU 8 tg 32 30.12 ± 1.71
LLaMA 3B mostly Q4_0 CPU 16 tg 32 28.64 ± 0.25
LLaMA 3B mostly Q4_0 CPU 32 tg 32 26.97 ± 2.49

./llama-bench -m models/3B/ggml-model-q4_0.bin -p 32 -n 0 -t 8,16,32

model backend n_threads test t/s
LLaMA 3B mostly Q4_0 CPU 8 pp 32 55.18 ± 2.83
LLaMA 3B mostly Q4_0 CPU 16 pp 32 65.47 ± 0.26
LLaMA 3B mostly Q4_0 CPU 32 pp 32 99.27 ± 8.81

@slaren
Copy link
Member Author

slaren commented Aug 16, 2023

The current defaults are these:

static cmd_params cmd_params_defaults = {
    /* model         */ {"models/7B/ggml-model-q4_0.bin"},
    /* n_prompt      */ {512},
    /* n_gen         */ {128},
    /* n_batch       */ {512},
    /* f32_kv        */ {false},
    /* n_threads     */ {get_num_physical_cores()},
    /* n_gpu_layers  */ {99},
    /* main_gpu      */ {0},
    /* mul_mat_q     */ {true},
    /* low_vram      */ {false},
    /* tensor_split  */ {{}},
    /* reps          */ 5,
    /* verbose       */ false,
    /* output_format */ MARKDOWN
};

The default should serve as a good standard test, so we may want to change some of the values.

@cebtenzzre
Copy link
Collaborator

cebtenzzre commented Aug 16, 2023

I will add a warning if NDEBUG is not defined, but it is not completely reliable. I am not sure if there is a better way.

You can check whether optimizations are enabled in gcc and clang by checking if __OPTIMIZE__ is defined—LLAMA_DEBUG uses -O0. I personally build without NDEBUG in case I encounter a bug during normal use.

@slaren
Copy link
Member Author

slaren commented Aug 16, 2023

Nice, thanks. Checking __OPTIMIZE__ for gcc and clang and _DEBUG for msvc should do it. This still doesn't stop anyone from compiling ggml with different flags, but it should be better than checking for NDEBUG only.

@slaren
Copy link
Member Author

slaren commented Aug 16, 2023

I noticed a weird bug when testing with prompt sizes 128 and 1024 in the same run with CUDA enabled. The test with size 128 will work, but the test with 1024 will fail with an out of memory in the scratch buffer error:

ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 117440512, available 113246208)

When testing only with prompt size 1024 this doesn't happen. A new llama_model and llama_context is created for every test, so this shouldn't happen unless there is some global state in llama.cpp, and I don't think there is. There is global state in ggml-cuda, but I am not sure how that could affect the ggml scratch buffer.

@slaren
Copy link
Member Author

slaren commented Aug 16, 2023

I found the problem. Turns out, there is global state in llama.cpp:
https://github.com/ggerganov/llama.cpp/blob/0919a0f73d95cfb93a1646a1d1741a0615fe2c5e/llama.cpp#L118-L129
The scratch buffer size map is initialized the first time MEM_REQ_SCRATCH0 is called based on the value of n_ctx passed. Subsequent calls, will reuse the same map with the previous value of n_ctx.

@slaren slaren marked this pull request as ready for review August 17, 2023 00:54
@slaren
Copy link
Member Author

slaren commented Aug 17, 2023

I have added basic CPU info (Linux only) and GPU info (CUDA only), and SQL output (probably only works with sqlite). Is there anything else that could be added here? Otherwise, I consider this ready to review.

@ggerganov ggerganov added the high priority Very important issue label Aug 17, 2023
@ggerganov ggerganov self-requested a review August 17, 2023 21:22
Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very useful tool!

Here are some TG results on M2 Studio:

LLAMA_METAL=1 make -j && ./llama-bench -m models/llama-7b/ggml-model-q4_0.bin -m models/llama-13b/ggml-model-q4_0.bin -m models/llama-30b/ggml-model-q4_0.bin -m models/llama-65b/ggml-model-q4_0.bin -ngl 1 -p 0 -n 32 2> /dev/null
model backend n_gpu_layers test t/s
LLaMA 7B mostly Q4_0 Metal 1 tg 32 90.23 ± 0.10
LLaMA 13B mostly Q4_0 Metal 1 tg 32 55.76 ± 0.13
LLaMA 30B mostly Q4_0 Metal 1 tg 32 26.49 ± 0.04
LLaMA 65B mostly Q4_0 Metal 1 tg 32 15.08 ± 0.02

@ggerganov ggerganov mentioned this pull request Aug 18, 2023
34 tasks
static std::map<e_model, size_t> MEM_REQ_SCRATCH0(int n_ctx)
{
static std::map<e_model, size_t> k_sizes = {
std::map<e_model, size_t> k_sizes = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch

@slaren slaren merged commit 097e121 into master Aug 18, 2023
@slaren slaren deleted the llama-benchmark branch August 18, 2023 10:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

high priority Very important issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants