sampling : add support for backend sampling #17004

danbev · 2025-11-04T17:34:17Z

This is a work in progress to add support for backend (like GPU) sampling.

The motivation for this feature is to enable sampling to be performed directly on the backend as part of the computation graph being executed, allowing for some or all of the sampling to be done on the backend.

For example, the backend sampler chain might select/sample a token directly in which case only the sampled token needs to be transferred from device memory to host memory.

It is also possible for the backend samplers to perform filtering of the logits, or compute and filter the probability distribution, in which case only the filtered logits or probabilites need to be transferred back to system memory for further processing by CPU samplers.

Currently the backend sampling works in a similar manner to how pooling works, it is a function that is called by build_graph and the sampler operations become part of the models computation graph.

Backend samplers can be configured by creating sampler chains, where each sampler chain is associated with a specific sequence id:

    struct llama_sampler_chain_params params = llama_sampler_chain_default_params();
    struct llama_sampler * chain = llama_sampler_chain_init(params);
    llama_sampler_chain_add(chain, llama_sampler_backend_init_greedy());
    std::vector<llama_sampler_seq_config> sampler_configs = {
        { 0, chain }
    };

The struct is defined as:

    struct llama_sampler_seq_config {
        llama_seq_id           seq_id;
        struct llama_sampler * sampler;
    };

These sampler configs are then passed as context params:

    llama_context_params cparams = llama_context_default_params();
    cparams.samplers = sampler_configs.data();
    cparams.n_samplers = sampler_configs.size();

When the model graph is built the GPU samplers will be called to enable them to add their operations to the graph:

ggml_cgraph * llama_model::build_graph(const llm_graph_params & params) const {
    std::unique_ptr<llm_graph_context> llm;
    ...

    // add backend sampling layers (if any)
    llm->build_sampling(*this, params);

The llama_sampler_i interface as been extended with 4 new methods in the API, and they are currently all named with a _ggml suffix to indicate that they are for backend sampling:

        void                   (*init_ggml)(struct llama_sampler      * smpl,
                                            ggml_backend_buffer_type_t  buft);

        void                   (*set_input_ggml)( struct llama_sampler * smpl,
                                                       ggml_context * ctx,
                                                        ggml_cgraph * gf);

        void                   (*apply_ggml)(  struct llama_sampler * smpl,
                                                       ggml_context * ctx,
                                                        ggml_cgraph * gf,
                                            llama_sampler_ggml_data * ggml_data);

        void                   (*accept_ggml)( struct llama_sampler * smpl,
                                                       ggml_context * ctx,
                                                        ggml_cgraph * gf,
                                               struct ggml_tensor * selected_token);

The init_ggml function allows backend samplers to create input tensors that they might need. The ggml_backend_buffer_type should be used so that the tensors are created using this backend buffer type, which is the same as the output logits backend. This avoids splits in the computation graph that would require data transfer between different backends.

The set_input_ggml function is called after the computation graph has been scheduled but before it is computed. This allows the backend sampler to set any input for the tensors it created in init_ggml.

The apply_ggml function is where the backend sampler adds its operations to the graphs. When the graph is built, the configured sampler's _apply function is called which allows them to add operations/nodes to the computation graph.

The accept_ggml functions allows backend samplers to update their tensor states if needed.

This enables the sampling to happen fully, or partially on the backend. The samplers could sample a single token in which case that is what will be transferred from the device memory to host memory after llama_decode has been called. The sampled token can then be retrieved using:

    llama_token id = llama_get_backend_sampled_token_ith(test_ctx.ctx, index);

Is it also possible to run a backend sampler that only filters the logits and then only the filtered logits are transferred back to the host and the sampling can proceed on the CPU with the normal (CPU) sampler chain. In this case the CPU samplers are configured as usual but they will now operate on already filtered logits.

Similar to the above handling of logits, it is possible for a GPU samplers to compute the full probability distribution and transfer that to the host. And the CPU samplers can then operate on the those probabilities.

Configuration

Backend sampling is enabled using --backend_sampling, and the sampler chain, either explicitly specified using --samplers or the default, is automatically analyzed to determine which samplers can run on the backend. The system finds the longest contiguous chain of backend supported samplers from the start of the sampler sequence.

For example:

If the chain is top-k -> temperature -> top-p, and both top-k and temperature are backend-supported but top-p is not, then top-k and temperature will run on the backend, while top-p and subsequent samplers run on the CPU.
If all configured samplers are supported, the final distribution sampling will also happen on the backend, transferring only the sampled token IDs back to the host.
If the sampler chain starts with an unsupported sampler, and the sampler is active, all sampling runs on the CPU. Note that this is currently the case with the default sampler so to use backend sampling it is required to specify a sampler chain. See below for an example.

llama-cli

Initial support for llama-cli has been added and can be used as follows:

    $ llama-cli -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \
        --prompt 'What is the capital of Sweden?' \
        -n 20 \
        -no-cnv \
        --verbose-prompt \
        -ngl 40 \
        --backend-sampling \
        --samplers 'top_k;temperature'

To enable a partial backend sampling (hybrid sampling), for example running top_k and temperature on the backend and typ_p on the CPU the following sampler chain could be specified:

    $ llama-cli -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \
        --prompt 'What is the capital of Sweden?' \
        -n 20 \
        -no-cnv \
        --verbose-prompt \
        -ngl 40 \
        --backend-sampling \
        --samplers 'top_k;temperature;top_p'

llama-server

GPU sampling can be enabled for llama-server similar to how it was done above for llama-cli

gdb --args ./build-gpu-sampler/bin/llama-server \
      -m models/Qwen2.5-VL-3B-Instruct-Q8_0.gguf \
      --backend-sampling \
      --samplers 'top_k;temperature' \
      --temp 0.8 \
      --top-k 40 \
      -ngl 50 \
      -v

It is then possible to specify send GPU request parameters as follows:

curl --request POST \
    --url http://localhost:8080/completion \
    --header "Content-Type: application/json" \
    --data '{"prompt": "What is the capital of Sweden?","n_predict": 20, "top_k": 40, "backend_sampling": true}'

Building and running the tests

Download a model for testing:

$ cd models && wget https://huggingface.co/ggml-org/models/resolve/main/tinyllamas/stories15M-q4_0.gguf

Building the test:

$ cmake --build build --target test-backend-sampler -j8

Runing all tests:

$ env LLAMACPP_TEST_MODELFILE=../models/stories15M-q4_0.gguf \
    ctest --test-dir build -R '^test-backen-sampler$' -V

The following individual tests are available:

$ ctest --test-dir build-gpu-sampler/ -N -R test-backend-sampler-
Internal ctest changing into directory: /home/danbev/work/ai/llama.cpp-debug/build-gpu-sampler
Test project /home/danbev/work/ai/llama.cpp-debug/build-gpu-sampler
  Test #36: test-backend-sampler-greedy
  Test #37: test-backend-sampler-temp
  Test #38: test-backend-sampler-top_k
  Test #39: test-backend-sampler-dist
  Test #40: test-backend-sampler-dist-and-cpu
  Test #41: test-backend-sampler-logit-bias
  Test #42: test-backend-sampler-mul_seq
  Test #43: test-backend-sampler-set-sampler

Total Tests: 8

These can be run individually, for example:

$ env LLAMACPP_TEST_MODELFILE=../models/stories15M-q4_0.gguf \
    ctest --test-dir build -R 'test-backend-sampler-temp' -V

TODO

Implemented backend samplers

Remaining backend samplers

The list below are the current CPU sampler that exist. All of these might not be appropriate as GPU samplers. These will be implemented separate follow up PRs.

am17an · 2025-11-05T09:55:06Z

One place this would be useful immediately is the diffusion-cli. I'm happy to test this when it's ready

ggml/src/ggml.c

ORippler

Not sure if I have a strong opinion on this but removing hybrid sampling would reduce the complexity a bit I think (basically if we always set --gpu-dist we only have two states (either full gpu sampling or full cpu sampling, and no in-between).

common/arg.cpp

common/sampling.cpp

include/llama.h

src/llama-backend-sampler.cpp

tools/server/server.cpp

danbev · 2025-11-13T06:34:41Z

Not sure if I have a strong opinion on this but removing hybrid sampling would reduce the complexity a bit I think (basically if we always set --gpu-dist we only have two states (either full gpu sampling or full cpu sampling, and no in-between).

My thoughts are that I think we should keep the hybrid approach even though it does come with some additional complexity like you say. I think there could be use cases where one might want to perform some sampling like temp/logit_bias/top-k sampling on the device, and then only have a smaller set of logits copied to the host memory, and still enable other CPU samplers, including grammars, to be able to process the logits.

This might turn out to be an incorrect assumption and not something anyone wants to use, but it feels safer to have the ability do hybrid sampling to play it safe.

ggerganov · 2025-11-14T07:41:57Z

@danbev Let's rebase on latest master to pick up the recent changes.

JohannesGaessler · 2025-12-10T21:24:42Z

The HIP/MUSA builds should be fixed by danbev#1 .

JohannesGaessler · 2025-12-10T23:02:21Z

I did a quick test for the performance using scripts/server-bench.py and Qwen 3 0.6b:

GPU	Slots	Runtime CPU sampling [s]	Runtime backend sampling [s]
RTX 3090	1	587	658
RTX 3090	16	184	197
RTX 4090	1	478	526
RTX 4090	16	167	173
RTX 5090	1	354	400
RTX 5090	16	147	160

The performance with backend sampling seems to become significantly worse. What's strange is that for llama-completion when I just measure the runtime backend sampling is slightly faster.

Edit: I used this command:

LLAMA_ARG_N_PARALLEL=16 LLAMA_ARG_CTX_SIZE=100000 LLAMA_ARG_MODEL=/opt/models/qwen_3-1b-q4_0.gguf LLAMA_ARG_PORT=$((8080 + $CUDA_VISIBLE_DEVICES)) py scripts/server-bench.py --path_server build/bin/llama-server

optionally with LLAMA_ARG_BACKEND_SAMPLING=1.

HIP/MUSA: fix build for backend sampling

ggerganov · 2025-12-11T08:33:21Z

@JohannesGaessler Thanks for looking into this. The logit_bias sampler is currently very slow because it naively copies a full vocab of logits for each sampler chain which is quite slow. It is activated because of the ignore_eos: True parameter. I'll look into optimizing it soon.

ORippler · 2025-12-11T14:35:53Z

The performance with backend sampling seems to become significantly worse. What's strange is that for llama-completion when I just measure the runtime backend sampling is slightly faster.

I assume you did build with -DGGML_CUDA_CUB_3DOT2=ON? I have not dug into the server loop as much yet, but afaik there were some issues related to efficient ggml_cgraph reuse (where rebuilding the graph frequently ate up all the perf gains made executing it)

JohannesGaessler · 2025-12-12T10:14:42Z

@ORippler thank you, I forgot to enable that compilation flag. These are the numbers I get with it:

GPU	Slots	Runtime CPU sampling [s]	Runtime backend sampling (no CUB 3.2) [s]	Runtime backend sampling (with CUB 3.2) [s]
RTX 3090	1	587	658	608
RTX 3090	16	184	197	170
RTX 4090	1	478	526	486
RTX 4090	16	167	173	147
RTX 5090	1	354	400	360
RTX 5090	16	147	160	132

CUB 3.2 is a consistent speedup, for a single concurrent request it's still slower than CPU sampling (likely due to overhead) but for 16 concurrent requests there is now an end-to-end speedup of ~10% (presumably because more work can be done per kernel launch).

ggerganov · 2025-12-12T11:50:40Z

CUB 3.2 is a consistent speedup, for a single concurrent request it's still slower than CPU sampling (likely due to overhead)

Currently, for every batch we always run a fixed number of -np sampler chains, regardless of the contents of the batch. This means that if you start with -np 4 and send single concurrent requests, it would still execute 4 sampling chains (1 for the request and 3 dummy). This is needed to avoid reconstructing the graph all the time.

As a workaround, a proper test for 1 slot would be to run with LLAMA_ARG_N_PARALLEL=1 LLAMA_ARG_KV_UNIFIED=1 until the logic above is improved. You need the LLAMA_ARG_KV_UNIFIED=1 because without it, the server will actually start 4 slots, resulting in 3 extra sampler chains running unnecessarily.

JohannesGaessler · 2025-12-12T12:42:51Z

When using LLAMA_ARG_KV_UNIFIED=1 for a single concurrent request I am seeing a consistent speedup when also using CUB 3.2:

GPU	Slots	Runtime CPU sampling [s]	Runtime backend sampling (no CUB 3.2) [s]	Runtime backend sampling (with CUB 3.2) [s]
RTX 3090	1	378	366	359
RTX 3090	16	184	197	170
RTX 4090	1	295	279	274
RTX 4090	16	167	173	147
RTX 5090	1	311	298	291
RTX 5090	16	147	160	132

For a single concurrent request LLAMA_ARG_KV_UNIFIED=1 was faster more generally, it may make sense to adjust the defaults of the server. I don't understand why the 4090 is faster than the 5090 for a single concurrent request.

ggerganov · 2025-12-12T13:16:32Z

it may make sense to adjust the defaults of the server.

Yup, I am thinking in this direction and how to improve the defaults in the best way.

ORippler · 2025-12-12T13:35:30Z

I don't understand why the 4090 is faster than the 5090 for a single concurrent request.

From my side, I have observed that for single-sequence inference qwen3-1.5b is too-small of a workload to reliably get the 5090 to clock up/enter P0 mode.

Also, the default sampler chain in llama.cpp is kpmt with top_k=40, top_p=0.95, min_p=0.05, temp=0.8). It thus effectively reduces the workload to 40 elements in the first sampler. Depending on the CPU/GPU HW config, operating inside the constraints of ggml's opset & graph for pmt will subsequently eat up part of the perf gained by a faster top_k selection.

I personally don't know how llama.cpp's default sampler chain was constructed/determined, but literature seems to be arguing that top_p should outperform top_k, and min_p should be superior to both as it can generate creative outputs same as top_k while maintaining the coherence of top_p. So from a scientific perspective the default chain can definitely be challenged.

The long-term solution to a potential debate about "default samplers and associated hyperparameters" is, in my eyes, to honor the sampler-guidance given out by model-builders, which we can already embed into gguf files #17120. Doing so for gpt-oss (which guides towards omitting pre-filtering by means of a sampler-seq), one will see a much bigger perf delta between enabled/disabled --backend-sampling (and if OAI's evaluation of model quality is trusted this should also give us the best quality for the generated outputs).

Unrelated to this, backend-sampling is a pre-req for a fully asynchronous inference-orchestration loop, which would maximize backend utilization eventually :)

Co-authored-by: Johannes Gäßler <[email protected]>

By using `tmp_vals` to store both max values and exponential accumulator there was a potential data-race, where the exponential accumulator for a given CTA may have written to `tmp_vals` before all others CTAs have read the max value from it. To avoid a third g.sync(), an additional temporary data-storage was added. Given that there are syncs in place after writing to gmem, it is guaranteed that the previous values for sums/max were read by all CTAs now.

github-actions bot added the testing Everything test related label Nov 4, 2025

danbev force-pushed the gpu-sampling branch 2 times, most recently from 71b0e3d to c82b67b Compare November 6, 2025 06:14

github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs ggml changes relating to the ggml tensor library for machine learning labels Nov 6, 2025

danbev force-pushed the gpu-sampling branch 2 times, most recently from 56bca5e to 5d18032 Compare November 6, 2025 06:27

DajanaV mentioned this pull request Nov 6, 2025

UPSTREAM PR #17004: sampling : add support for GPU sampling (wip) auroralabs-loci/llama.cpp#102

Open

5 tasks

danbev force-pushed the gpu-sampling branch from 2747aac to c0ac70c Compare November 7, 2025 09:52

github-actions bot added examples server labels Nov 7, 2025

danbev force-pushed the gpu-sampling branch 7 times, most recently from f49a857 to 7c6dc02 Compare November 11, 2025 12:05

slaren reviewed Nov 11, 2025

View reviewed changes

ggml/src/ggml.c Outdated Show resolved Hide resolved

danbev force-pushed the gpu-sampling branch 4 times, most recently from 1168c22 to 9609e7e Compare November 12, 2025 13:10

ORippler reviewed Nov 12, 2025

View reviewed changes

common/arg.cpp Outdated Show resolved Hide resolved

common/sampling.cpp Outdated Show resolved Hide resolved

include/llama.h Outdated Show resolved Hide resolved

src/llama-backend-sampler.cpp Outdated Show resolved Hide resolved

tools/server/server.cpp Show resolved Hide resolved

danbev force-pushed the gpu-sampling branch from cf139de to c7dbcfc Compare November 13, 2025 04:07

danbev force-pushed the gpu-sampling branch 2 times, most recently from 0730c19 to b2370c7 Compare November 16, 2025 07:16

ggerganov and others added 2 commits December 10, 2025 20:40

graph : respect sampler order for graph reuse

804e7e3

HIP/MUSA: fix build for backend sampling

42cf5c0

Merge pull request #1 from JohannesGaessler/gpu-sampling-hip

56720f8

HIP/MUSA: fix build for backend sampling

ggerganov added 5 commits December 11, 2025 11:14

sampling : optimize logit_bias sampler

54e9054

cont : fix build

d5d1665

sampling : generic ggml op support detection

8544aba

sampling : fix greedy

74b112e

tests : run backend sampler tests always on the CPU

ab65b47

ggerganov mentioned this pull request Dec 11, 2025

vulkan: support GET_ROWS for k-quants #16235

Merged

Merge branch 'master' into HEAD

4d10b78

ggerganov mentioned this pull request Dec 11, 2025

common : refactor common_sampler + grammar logic changes #17937

Merged

This comment was marked as outdated.

Sign in to view

ORippler and others added 9 commits December 12, 2025 15:07

Apply suggestions from code review

07b809b

Co-authored-by: Johannes Gäßler <[email protected]>

Merge branch 'master' into HEAD

22c7f85

Merge branch 'master' into HEAD

0086c24

webui : fix lint

2652e74

Apply automated code-formating to softmax.cu

e5737f6

Merge remote-tracking branch 'upstream/master' into backend-sampling

ad1b60a

llama : clarify backend_accept/backend_set_input comments [no ci]

68a1c4d

llama : fix typo in comment [no ci]

c5d44b8

sampling : add support for backend sampling #17004

Are you sure you want to change the base?

sampling : add support for backend sampling #17004

Uh oh!

Conversation

danbev commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Configuration

llama-cli

llama-server

Building and running the tests

TODO

Implemented backend samplers

Remaining backend samplers

Uh oh!

am17an commented Nov 5, 2025

Uh oh!

Uh oh!

ORippler left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

danbev commented Nov 13, 2025

Uh oh!

ggerganov commented Nov 14, 2025

Uh oh!

JohannesGaessler commented Dec 10, 2025

Uh oh!

JohannesGaessler commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Dec 11, 2025

Uh oh!

ORippler commented Dec 11, 2025

Uh oh!

This comment was marked as outdated.

JohannesGaessler commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Dec 12, 2025

Uh oh!

ggerganov commented Dec 12, 2025

Uh oh!

ORippler commented Dec 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

danbev commented Nov 4, 2025 •

edited

Loading

JohannesGaessler commented Dec 10, 2025 •

edited

Loading

JohannesGaessler commented Dec 12, 2025 •

edited

Loading

ggerganov commented Dec 12, 2025 •

edited

Loading

ORippler commented Dec 12, 2025 •

edited

Loading