-
Notifications
You must be signed in to change notification settings - Fork 14.1k
sampling : add support for backend sampling #17004
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
One place this would be useful immediately is the diffusion-cli. I'm happy to test this when it's ready |
71b0e3d to
c82b67b
Compare
56bca5e to
5d18032
Compare
f49a857 to
7c6dc02
Compare
1168c22 to
9609e7e
Compare
ORippler
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if I have a strong opinion on this but removing hybrid sampling would reduce the complexity a bit I think (basically if we always set --gpu-dist we only have two states (either full gpu sampling or full cpu sampling, and no in-between).
My thoughts are that I think we should keep the hybrid approach even though it does come with some additional complexity like you say. I think there could be use cases where one might want to perform some sampling like temp/logit_bias/top-k sampling on the device, and then only have a smaller set of logits copied to the host memory, and still enable other CPU samplers, including grammars, to be able to process the logits. This might turn out to be an incorrect assumption and not something anyone wants to use, but it feels safer to have the ability do hybrid sampling to play it safe. |
|
@danbev Let's rebase on latest |
0730c19 to
b2370c7
Compare
|
The HIP/MUSA builds should be fixed by danbev#1 . |
|
I did a quick test for the performance using
The performance with backend sampling seems to become significantly worse. What's strange is that for Edit: I used this command: LLAMA_ARG_N_PARALLEL=16 LLAMA_ARG_CTX_SIZE=100000 LLAMA_ARG_MODEL=/opt/models/qwen_3-1b-q4_0.gguf LLAMA_ARG_PORT=$((8080 + $CUDA_VISIBLE_DEVICES)) py scripts/server-bench.py --path_server build/bin/llama-serveroptionally with |
HIP/MUSA: fix build for backend sampling
|
@JohannesGaessler Thanks for looking into this. The |
I assume you did build with |
This comment was marked as outdated.
This comment was marked as outdated.
|
@ORippler thank you, I forgot to enable that compilation flag. These are the numbers I get with it:
CUB 3.2 is a consistent speedup, for a single concurrent request it's still slower than CPU sampling (likely due to overhead) but for 16 concurrent requests there is now an end-to-end speedup of ~10% (presumably because more work can be done per kernel launch). |
Currently, for every batch we always run a fixed number of As a workaround, a proper test for 1 slot would be to run with |
|
When using
For a single concurrent request |
Yup, I am thinking in this direction and how to improve the defaults in the best way. |
From my side, I have observed that for single-sequence inference qwen3-1.5b is too-small of a workload to reliably get the 5090 to clock up/enter P0 mode. Also, the default sampler chain in llama.cpp is I personally don't know how llama.cpp's default sampler chain was constructed/determined, but literature seems to be arguing that top_p should outperform top_k, and min_p should be superior to both as it can generate creative outputs same as top_k while maintaining the coherence of top_p. So from a scientific perspective the default chain can definitely be challenged. The long-term solution to a potential debate about "default samplers and associated hyperparameters" is, in my eyes, to honor the sampler-guidance given out by model-builders, which we can already embed into gguf files #17120. Doing so for gpt-oss (which guides towards omitting pre-filtering by means of a sampler-seq), one will see a much bigger perf delta between enabled/disabled Unrelated to this, backend-sampling is a pre-req for a fully asynchronous inference-orchestration loop, which would maximize backend utilization eventually :) |
Co-authored-by: Johannes Gäßler <[email protected]>
By using `tmp_vals` to store both max values and exponential accumulator there was a potential data-race, where the exponential accumulator for a given CTA may have written to `tmp_vals` before all others CTAs have read the max value from it. To avoid a third g.sync(), an additional temporary data-storage was added. Given that there are syncs in place after writing to gmem, it is guaranteed that the previous values for sums/max were read by all CTAs now.

This is a work in progress to add support for backend (like GPU) sampling.
The motivation for this feature is to enable sampling to be performed directly on the backend as part of the computation graph being executed, allowing for some or all of the sampling to be done on the backend.
For example, the backend sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory.
It is also possible for the backend samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers.
Currently the backend sampling works in a similar manner to how pooling works, it is a function that is called by build_graph and the sampler operations become part of the models computation graph.
Backend samplers can be configured by creating sampler chains, where each sampler chain is associated with a specific sequence id:
The struct is defined as:
These sampler configs are then passed as context params:
llama_context_params cparams = llama_context_default_params(); cparams.samplers = sampler_configs.data(); cparams.n_samplers = sampler_configs.size();When the model graph is built the GPU samplers will be called to enable them to add their operations to the graph:
The llama_sampler_i interface as been extended with 4 new methods in the API, and they are currently all named with a
_ggmlsuffix to indicate that they are for backend sampling:The init_ggml function allows backend samplers to create input tensors that they might need. The ggml_backend_buffer_type should be used so that the tensors are created using this backend buffer type, which is the same as the output logits backend. This avoids splits in the computation graph that would require data transfer between different backends.
The set_input_ggml function is called after the computation graph has been scheduled but before it is computed. This allows the backend sampler to set any input for the tensors it created in init_ggml.
The apply_ggml function is where the backend sampler adds its operations to the graphs. When the graph is built, the configured sampler's _apply function is called which allows them to add operations/nodes to the computation graph.
The accept_ggml functions allows backend samplers to update their tensor states if needed.
This enables the sampling to happen fully, or partially on the backend. The samplers could sample a single token in which case that is what will be transferred from the device memory to host memory after llama_decode has been called. The sampled token can then be retrieved using:
Is it also possible to run a backend sampler that only filters the logits and then only the filtered logits are transferred back to the host and the sampling can proceed on the CPU with the normal (CPU) sampler chain. In this case the CPU samplers are configured as usual but they will now operate on already filtered logits.
Similar to the above handling of logits, it is possible for a GPU samplers to compute the full probability distribution and transfer that to the host. And the CPU samplers can then operate on the those probabilities.
Configuration
Backend sampling is enabled using
--backend_sampling, and the sampler chain, either explicitly specified using--samplersor the default, is automatically analyzed to determine which samplers can run on the backend. The system finds the longest contiguous chain of backend supported samplers from the start of the sampler sequence.For example:
If the chain is
top-k -> temperature -> top-p, and bothtop-kandtemperatureare backend-supported buttop-pis not, thentop-kandtemperaturewill run on the backend, whiletop-pand subsequent samplers run on the CPU.If all configured samplers are supported, the final distribution sampling will also happen on the backend, transferring only the sampled token IDs back to the host.
If the sampler chain starts with an unsupported sampler, and the sampler is active, all sampling runs on the CPU. Note that this is currently the case with the default sampler so to use backend sampling it is required to specify a sampler chain. See below for an example.
llama-cli
Initial support for llama-cli has been added and can be used as follows:
To enable a partial backend sampling (hybrid sampling), for example running
top_kandtemperatureon the backend andtyp_pon the CPU the following sampler chain could be specified:llama-server
GPU sampling can be enabled for llama-server similar to how it was done above for llama-cli
It is then possible to specify send GPU request parameters as follows:
Building and running the tests
Download a model for testing:
$ cd models && wget https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories15M-q4_0.ggufBuilding the test:
$ cmake --build build --target test-backend-sampler -j8Runing all tests:
The following individual tests are available:
These can be run individually, for example:
TODO
penalties samplers (to figure out/verify how accept_ggml should work)Will be done in a follow up PR.Implemented backend samplers
Remaining backend samplers
The list below are the current CPU sampler that exist. All of these might not be appropriate as GPU samplers. These will be implemented separate follow up PRs.