-
Notifications
You must be signed in to change notification settings - Fork 14.1k
llama : add benchmark example #2626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
JohannesGaessler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My results:
| model | backend | n_gpu_layers | n_prompt | n_gen | t/s |
|---|---|---|---|---|---|
| LLaMA 7B mostly Q4_0 | CUDA | 99 | 512 | 0 | 2153.50 ± 24.75 |
| LLaMA 7B mostly Q4_0 | CUDA | 99 | 0 | 128 | 128.65 ± 0.13 |
| LLaMA 7B mostly Q4_0 | CUDA | 99 | 512 | 128 | 449.72 ± 1.27 |
It would be nice if the table was already aligned in the console (assuming a monospace font).
Co-authored-by: Johannes Gäßler <[email protected]>
|
Since you explicitly asked in the other PR, these are my particular needs:
|
|
I think we could to add sqlite as an output option if it is isolated to the example. We could copy the |
Or we could just print SQL to stdout. SQLite comes with a binary that accepts SQL on stdin so something like |
|
In the third row of the sample results table in OP, what is the meaning of the |
|
Yes, it is the average t/s of a prompt of 512 tokens followed by a generation of 128 tokens. The way this works currently is that |
|
I just remembered: I think it would be useful to print a warning when the benchmark is being run with debug settings. I previously wasted time trying to figure out why the performance is suddenly terrible when I had just forgotten to remove |
|
I will add a warning if |
|
I think that would be good enough. The biggest factor for performance is compiling without optimizations (because otherwise the compiler may optimize out local variables) and to my knowledge this always implies that |
|
How about this for the performance columns: one column for prompt t/s, one for generation t/s, and one for the total time. |
|
Unfortunately, that doesn't fit very well with the design. There may be any number of tests with any number of different parameters, tests don't need to include both prompt and generation, it is possible to disable either one by setting it to zero, or there may be multiple values for the number of tokens of prompt and generation. |
|
A few more examples of the markdown output with CUDA:
|
|
When building without a GPU backend, the number of threads is always shown instead of the number of GPU layers:
|
|
The current defaults are these: static cmd_params cmd_params_defaults = {
/* model */ {"models/7B/ggml-model-q4_0.bin"},
/* n_prompt */ {512},
/* n_gen */ {128},
/* n_batch */ {512},
/* f32_kv */ {false},
/* n_threads */ {get_num_physical_cores()},
/* n_gpu_layers */ {99},
/* main_gpu */ {0},
/* mul_mat_q */ {true},
/* low_vram */ {false},
/* tensor_split */ {{}},
/* reps */ 5,
/* verbose */ false,
/* output_format */ MARKDOWN
};The default should serve as a good standard test, so we may want to change some of the values. |
You can check whether optimizations are enabled in gcc and clang by checking if __OPTIMIZE__ is defined—LLAMA_DEBUG uses -O0. I personally build without NDEBUG in case I encounter a bug during normal use. |
|
Nice, thanks. Checking |
|
I noticed a weird bug when testing with prompt sizes 128 and 1024 in the same run with CUDA enabled. The test with size 128 will work, but the test with 1024 will fail with an out of memory in the scratch buffer error: When testing only with prompt size 1024 this doesn't happen. A new |
|
I found the problem. Turns out, there is global state in llama.cpp: |
|
I have added basic CPU info (Linux only) and GPU info (CUDA only), and SQL output (probably only works with sqlite). Is there anything else that could be added here? Otherwise, I consider this ready to review. |
ggerganov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very useful tool!
Here are some TG results on M2 Studio:
LLAMA_METAL=1 make -j && ./llama-bench -m models/llama-7b/ggml-model-q4_0.bin -m models/llama-13b/ggml-model-q4_0.bin -m models/llama-30b/ggml-model-q4_0.bin -m models/llama-65b/ggml-model-q4_0.bin -ngl 1 -p 0 -n 32 2> /dev/null
| model | backend | n_gpu_layers | test | t/s |
|---|---|---|---|---|
| LLaMA 7B mostly Q4_0 | Metal | 1 | tg 32 | 90.23 ± 0.10 |
| LLaMA 13B mostly Q4_0 | Metal | 1 | tg 32 | 55.76 ± 0.13 |
| LLaMA 30B mostly Q4_0 | Metal | 1 | tg 32 | 26.49 ± 0.04 |
| LLaMA 65B mostly Q4_0 | Metal | 1 | tg 32 | 15.08 ± 0.02 |
| static std::map<e_model, size_t> MEM_REQ_SCRATCH0(int n_ctx) | ||
| { | ||
| static std::map<e_model, size_t> k_sizes = { | ||
| std::map<e_model, size_t> k_sizes = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch
Adds an example for running performance benchmarks. Multiple values can be specified for each option, and it will run the matrix of all of them. Supports output to csv, json or markdown.
Example markdown output: