Llama.
cpp C API Reference
Initialization and Backend
• llama_model_default_params() – Returns a llama_model_params struct filled with
default values for loading a model 1 . Use this to get sensible defaults (e.g. no tensor overrides,
use or disable use_mmap , etc.) before customizing model parameters.
• llama_context_default_params() – Returns a llama_context_params struct with
default settings for creating a new context 1 . Modify fields (like n_ctx , GPU layers, etc.) as
needed.
• llama_sampler_chain_default_params() – Returns a llama_sampler_chain_params
struct with default options for initializing a sampler chain 2 .
• llama_model_quantize_default_params() – Returns a
llama_model_quantize_params struct with default options for model quantization 2 .
• llama_backend_init() – Initialize the global llama/ggml backend (should be called once at
program start). This sets up timing and FP16 tables internally 3 .
• llama_backend_free() – Free any global llama resources (e.g. quantization tables) before
program exit 4 .
• llama_numa_init(ggml_numa_strategy numa) – Enable CPU NUMA optimizations
according to numa (or disable with GGML_NUMA_STRATEGY_DISABLED ) 5 . Call before
loading models if NUMA is needed.
• llama_time_us() – Return the current wall-clock time in microseconds, as used by ggml for
profiling 6 . Useful for custom timing.
• llama_max_devices() – Return the maximum number of GPU devices the library can use
(currently fixed at 16) 7 .
• llama_supports_mmap() , llama_supports_mlock() ,
llama_supports_gpu_offload() , llama_supports_rpc() – Return booleans indicating
whether memory-mapped IO, locked memory, GPU offload, or RPC backends are supported on
this build/platform 8 9 .
• llama_attach_threadpool(ctx, threadpool, threadpool_batch) – Attach a custom
GGML threadpool to ctx for parallel decoding. The threadpool handles single-token
operations and threadpool_batch handles batch processing. If not called, ggml will create its
own thread pool automatically 10 .
• llama_detach_threadpool(ctx) – Detach any previously attached threadpool from ctx ,
reverting to default behavior 11 .
• llama_print_system_info() – Return a string summarizing system and hardware
information (CPU, GPU, instruction sets, etc.) 12 .
1
• llama_log_set(log_callback, user_data) – Set a callback for internal logging
messages. By default, logs go to stderr . Provide a ggml_log_callback function and
optional user data to redirect or handle logs 13 .
Model Loading and Saving
• llama_model_load_from_file(path_model, params) – Load a model from a single GGUF
file at path_model , using options in params (e.g. use_mmap , n_gpu_layers ,
vocab_only , etc.) 14 . Returns a pointer to a newly allocated llama_model , or NULL on
failure. After loading, the model can be used to create contexts.
• llama_model_load_from_splits(paths, n_paths, params) – Load a model split across
multiple files. paths is an array of file paths (in correct order) of length n_paths , and
params are loading options. Returns a llama_model pointer on success 15 . Use this when
the model is split but does not follow the default naming scheme.
• llama_model_save_to_file(model, path_model) – Save the given model’s tensors to a
GGUF file at path_model . This writes out all layers and weights of model . Requires a model
created or loaded earlier 16 .
• llama_model_free(model) – Free a loaded model and all its associated memory. After
calling this, the model pointer must not be used again 17 . (If you had LoRA adapters loaded,
they will be freed automatically with the model.)
• llama_model_quantize(fname_inp, fname_out, params) – Quantize a model file. Reads
the model from fname_inp and writes a quantized model to fname_out using parameters
params (e.g. target bits, algorithm). Returns 0 on success 18 .
• llama_model_n_ctx_train(model) , llama_model_n_embd(model) ,
llama_model_n_layer(model) , llama_model_n_head(model) ,
llama_model_n_head_kv(model) – Return architecture parameters of model : training
context size, embedding dimension, number of layers, number of attention heads, and KV heads
per layer, respectively 19 . These tell you the model’s structure.
• llama_model_rope_freq_scale_train(model) – Return the RoPE (rotary position
encoding) frequency scaling factor used during training (usually 1.0 ) 20 . This can inform
position handling at inference.
• llama_model_rope_type(model) – Return the type of RoPE (rotary embeddings) used by the
model 21 . (Requires including llama_rope_type enum.)
• llama_model_n_params(model) – Return the total number of parameters (weights) in the
model 22 .
• llama_model_size(model) – Return the total size in bytes of all model tensors (roughly the
memory footprint) 23 .
• llama_model_has_encoder(model) / llama_model_has_decoder(model) – Return
true if the model has an encoder/decoder component (for encoder-decoder models) 24 .
• llama_model_decoder_start_token(model) – For encoder-decoder models, return the
token ID that should be passed to the decoder to start generation. For non-encoder-decoder
models, returns -1 25 .
2
• llama_model_is_recurrent(model) – Return true if the model is a recurrent (stateful)
model such as RWKV or Mamba 26 .
• llama_model_meta_val_str(model, key, buf, buf_size) – Retrieve a string metadata
value by key from the model’s GGUF metadata. Writes into buf up to buf_size bytes.
Returns the string length on success or -1 on failure 27 .
• llama_model_meta_count(model) – Return the number of metadata entries in the model
28.
• llama_model_meta_key_by_index(model, i, buf, buf_size) – Get the metadata key
name of index i (0-based) into buf . Returns length or -1 on failure 29 .
• llama_model_meta_val_str_by_index(model, i, buf, buf_size) – Get the metadata
value (as string) of index i into buf . Returns length or -1 30 .
• llama_model_desc(model, buf, buf_size) – Get a human-readable description of the
model type into buf (up to buf_size ). Returns length or -1 31 .
• llama_model_chat_template(model, name) – Get a default chat prompt template for the
model. If name is non-NULL, get a named template; if name is NULL , get the default
template. Returns a C-string (or NULL if not available) 32 .
Context Management and Decoding
• llama_init_from_model(model, params) – Create a new llama_context from a loaded
model. params (from llama_context_default_params() ) can customize context size
( n_ctx ), number of GPU layers, embedding mode, etc. Returns a pointer to the new context, or
NULL on error 33 .
• llama_free(ctx) – Free a llama_context and all internal buffers (activations, KV cache,
etc.). After this, ctx must not be used 34 .
• llama_get_model(ctx) – Return the llama_model* associated with ctx 35 .
• llama_get_kv_self(ctx) – Return the llama_kv_cache* used by ctx for self-attention
(the KV cache) 35 .
• llama_pooling_type(ctx) – Return the pooling type ( llama_pooling_type ) of ctx
36 .
• llama_n_ctx(ctx) , llama_n_batch(ctx) , llama_n_ubatch(ctx) ,
llama_n_seq_max(ctx) – Query context dimensions:
• n_ctx : maximum context length (tokens) in training;
• n_batch : current batch size;
• n_ubatch : ? (batch parallelism count);
• n_seq_max : maximum number of active sequences.
These are useful for understanding the context state. (See [46†L531-L534] for signatures.)
3
• llama_encode(ctx, batch) – Process a token batch with the encoder (no KV cache). For
encoder-decoder models, this runs the encoder on batch . Returns 0 on success or negative on
error 37 .
• llama_decode(ctx, batch) – Process a token batch through the decoder (requires KV
cache). Applies autoregressive attention. Returns 0 on success, positive for warnings (e.g.
context full), or negative on error 38 . On error or warning, the KV cache state is restored to
before the call.
• llama_batch_init(n_tokens, embd, n_seq_max) – Allocate a llama_batch struct for
up to n_tokens tokens. If embd != 0 , allocates an embeddings buffer of size n_tokens *
embd ; otherwise, it allocates token and logits arrays. n_seq_max is the max sequences per
token. Caller must fill [Link][] and other fields before decode.
• llama_batch_free(batch) – Free a llama_batch allocated by llama_batch_init()
39 .
• llama_set_n_threads(ctx, n_threads, n_threads_batch) – Set how many CPU
threads ctx uses: n_threads for single-token (generation) steps, n_threads_batch for
prompt/batch processing 40 .
• llama_n_threads(ctx) , llama_n_threads_batch(ctx) – Query the thread counts
currently set for ctx 41 .
• llama_set_embeddings(ctx, embeddings) – If embeddings is true , enable
embeddings-only mode: on decode, the model will compute and return embeddings but not
logits 42 .
• llama_set_causal_attn(ctx, causal_attn) – Enable ( true ) or disable causal
(autoregressive) attention for generation. By default causal is used (attend only to past tokens)
43 .
• llama_set_warmup(ctx, warmup) – If warmup is true , the first call to llama_decode
will preload all weight tensors into memory (warm up the cache), which can make the first
generation faster 44 .
• llama_set_abort_callback(ctx, abort_callback, abort_callback_data) – Set a
user callback that can abort long computations. abort_callback should return non-zero to
stop. This is checked periodically during decode 45 .
• llama_synchronize(ctx) – Wait for all pending computations in ctx to finish. Normally
not needed if you immediately access results (logits/embeddings), but can be used to
synchronize explicitly 46 .
• llama_get_logits(ctx) – After a call to llama_decode , returns a pointer to the logits
array of the last output token(s). The array is size [n_rows][n_vocab] , where rows
correspond to tokens with non-zero [Link] flags 47 .
• llama_get_logits_ith(ctx, i) – Get pointer to the logits of the i -th output token
(counting from 0) in the last batch. Supports negative indexing ( -1 for last). Returns NULL if
i is out of range 48 .
4
• llama_get_embeddings(ctx) – After decode, returns a pointer to all output token
embeddings (if available). If pooling is NONE (or generative model), embeddings for tokens with
outputs are stored contiguously 49 . Otherwise returns NULL .
• llama_get_embeddings_ith(ctx, i) – Get the embedding vector for the i -th output
token. Returns a pointer to [n_embd] floats, or NULL if invalid index 50 .
• llama_get_embeddings_seq(ctx, seq_id) – Get the embedding or value for a full
sequence seq_id . If pooling is RANK, returns a single float (the rank). If NONE or last-token
pooling, returns a float array of length n_embd . Returns NULL if pooling type is NONE and not
generative 51 .
Tokenization and Vocabulary
• llama_tokenize(vocab, text, text_len, tokens, n_tokens_max, add_special,
parse_special) – Convert input UTF-8 text into token IDs using vocab . Writes up to
n_tokens_max tokens into tokens . If add_special=true , BOS/EOS are added per
model’s settings; if parse_special=true , special/control tokens in the text (like <|
endoftext|> ) are tokenized rather than treated as raw text 52 . Returns the number of tokens
produced, or a negative number if the text would exceed n_tokens_max (absolute value of
return is required capacity).
• llama_token_to_piece(vocab, token, buf, length, lstrip, special) – Convert a
single token ID to its text piece. Writes up to length characters into buf (without null-
terminator). Skips up to lstrip leading spaces in the token text before copying (for multi-
token decoding). If special=true , then special tokens are converted to their name strings
53 . Returns the number of characters written (or negative on failure).
• llama_detokenize(vocab, tokens, n_tokens, text, text_len_max,
remove_special, unparse_special) – Convert an array of n_tokens token IDs back into
UTF-8 text into text . Writes at most text_len_max bytes. If remove_special=true , BOS/
EOS tokens (if present) are omitted from output; if unparse_special=true , special tokens
are rendered as their string form (e.g. <|endoftext|> ). Returns number of bytes written (or
negative if overflow) 54 .
• llama_vocab_get_text(vocab, token) – Return the null-terminated string for a given
token using vocab ’s decoding table 55 .
• llama_vocab_get_score(vocab, token) – Return the score (logit bias or term frequency)
associated with token in the vocab 56 .
• llama_vocab_get_attr(vocab, token) – Return attributes ( llama_token_attr ) of
token (e.g. whether it is a beginning-of-sentence) 57 .
• llama_vocab_is_eog(vocab, token) – Return true if token is an end-of-generation
token (EOS, EOT, etc.) 58 .
• llama_vocab_is_control(vocab, token) – Return true if token is a control token
(special non-renderable token) 59 .
• Special token getters: Functions that return special token IDs for the given vocab :
• llama_vocab_bos(vocab) – BOS (beginning-of-sentence) token 60 .
• llama_vocab_eos(vocab) – EOS (end-of-sentence) token 61 .
5
• llama_vocab_eot(vocab) – EOT (end-of-turn) token 62 .
• llama_vocab_sep(vocab) – Sentence separator token 63 .
• llama_vocab_nl(vocab) – Newline token 64 .
• llama_vocab_pad(vocab) – Padding token 65 .
• llama_vocab_get_add_bos(vocab) , llama_vocab_get_add_eos(vocab) – Return
whether the model’s tokenizer automatically adds BOS or EOS tokens to input prompts 66 .
• Fill-in-the-middle (FIM) tokens:
• llama_vocab_fim_pre(vocab) , llama_vocab_fim_suf(vocab) ,
llama_vocab_fim_mid(vocab) , llama_vocab_fim_pad(vocab) ,
llama_vocab_fim_rep(vocab) , llama_vocab_fim_sep(vocab) – Special tokens used for
fill-in-the-middle tasks (prefix, suffix, middle, padding, repetition, separator) 67 . Use these if
generating with FIM.
Chat Templates
• llama_chat_apply_template(tmpl, chat, n_msg, add_ass, buf, length) – Apply a
chat-completion template to a list of messages. tmpl is an optional custom template name (or
NULL to use model default). chat is an array of llama_chat_message structs of length
n_msg . If add_ass=true , the assistant prompt token(s) are added at the end. The formatted
prompt is written to buf (up to length bytes). Returns the number of bytes written (may
exceed length – if so, reallocate buffer) 68 .
• llama_chat_builtin_templates(output, len) – Get a list of built-in template names.
Writes up to len C-strings into the output array and returns the number of names written.
Useful to discover supported templates 69 .
Adapters
• llama_adapter_lora_init(model, path_lora) – Load a LoRA adapter from file
path_lora into memory, associated with model . Returns a new llama_adapter_lora* .
This adapter can then be applied to contexts without modifying the base model 70 .
• llama_adapter_lora_free(adapter) – Manually free a LoRA adapter object if needed.
(Note: adapters loaded into a model will be freed automatically when llama_model_free is
called.) 71 .
• llama_set_adapter_lora(ctx, adapter, scale) – Add the LoRA adapter adapter to
the context ctx with a multiplication scale . This applies the adapter’s weights on-the-fly to
the model for this context. Returns 0 on success or a negative code on failure 72 .
• llama_rm_adapter_lora(ctx, adapter) – Remove a LoRA adapter previously applied to
ctx . Returns 0 on success or -1 if that adapter was not found in the context 73 .
• llama_clear_adapter_lora(ctx) – Remove all LoRA adapters from ctx , restoring the
original model weights for that context 74 .
• llama_apply_adapter_cvec(ctx, data, len, n_embd, il_start, il_end) – Apply a
control vector (adapter) to ctx . data points to a float buffer of length len (should equal
6
n_embd * n_layers ). This adds a custom bias to each layer’s input (from layer il_start to
il_end , inclusive). If data is NULL , the current control vector is cleared. Returns 0 on
success 75 .
KV Cache Operations
• llama_kv_self_clear(ctx) – Clear the entire self-attention KV cache of ctx . This erases
all stored keys/values and resets sequence info 76 .
• llama_kv_self_seq_rm(ctx, seq_id, p0, p1) – Remove tokens of sequence seq_id
with positions in [p0,p1) from the KV cache. Negative seq_id or p0/p1 act as wildcards
( seq_id<0 means all sequences, p0<0 means start at 0, etc.). Returns true if successful, or
false if a partial range couldn’t be removed. Removing a whole sequence always succeeds
77.
• llama_kv_self_seq_cp(ctx, seq_id_src, seq_id_dst, p0, p1) – Copy tokens from
sequence seq_id_src to seq_id_dst for positions in [p0,p1). This does not allocate new
memory; it just reassigns the tokens to the new sequence ID 78 .
• llama_kv_self_seq_keep(ctx, seq_id) – Remove all tokens not belonging to seq_id ,
effectively keeping only that sequence in the cache 79 .
• llama_kv_self_seq_add(ctx, seq_id, p0, p1, delta) – Add delta to the positions
of tokens belonging to seq_id in [p0,p1). This shifts their effective positions. If positional
embeddings (RoPE) are used, the cache will be updated lazily on the next decode or explicitly by
llama_kv_self_update 80 .
• llama_kv_self_seq_div(ctx, seq_id, p0, p1, d) – Divide the positions of tokens in
seq_id over [p0,p1) by integer d>1 . Also updates RoPE if needed (lazily) 81 .
• llama_kv_self_seq_pos_min(ctx, seq_id) – Return the smallest position index present
for sequence seq_id in the KV cache, or -1 if the sequence is empty 82 .
• llama_kv_self_seq_pos_max(ctx, seq_id) – Return the largest position index for
seq_id , or -1 if empty 83 .
• llama_kv_self_defrag(ctx) – Defragment the KV cache memory (repack and remove
gaps). This is applied lazily on next decode or immediately by llama_kv_self_update 84 .
• llama_kv_self_can_shift(ctx) – Return true if the context’s KV cache supports shifting
operations (some caches can’t be shifted) 85 .
• llama_kv_self_update(ctx) – Apply all pending KV cache updates now (shifts, defrags,
etc.). Otherwise, updates happen lazily on the next llama_decode 86 .
State and Sessions
• llama_state_get_size(ctx) – Return the size in bytes needed to save the full state of ctx
(logits, embeddings, and KV cache). Use this to allocate a buffer before calling
llama_state_get_data 87 .
7
• llama_state_get_data(ctx, dst, size) – Copy the current model state into dst
(buffer of length size ). Returns the number of bytes written. The buffer must be large enough
(use llama_state_get_size ) 88 .
• llama_state_set_data(ctx, src, size) – Restore the model state of ctx from the
buffer src of length size . Returns the number of bytes read 89 . Use this to rewind or
switch states.
• llama_state_load_file(ctx, path_session, tokens_out, n_token_capacity,
n_token_count_out) – Load a session from file at path_session . This restores context
state (KV, etc.) and fills tokens_out with the tokens from the session, up to capacity
n_token_capacity . Writes the actual token count to n_token_count_out . Returns true
on success 90 .
• llama_state_save_file(ctx, path_session, tokens, n_token_count) – Save the
current session to path_session , writing out the tokens array of length n_token_count .
Returns true on success 91 .
• llama_state_seq_get_size(ctx, seq_id) – Get the number of bytes needed to copy the
KV cache of sequence seq_id alone. This allows saving or transferring a single sequence state
92 .
• llama_state_seq_get_data(ctx, dst, size, seq_id) – Copy the KV cache of sequence
seq_id into dst (buffer of length size ). Returns bytes written 93 .
• llama_state_seq_set_data(ctx, src, size, dest_seq_id) – Load sequence state
from src buffer (length size ) into dest_seq_id in ctx . Returns positive on success, 0
on failure 94 .
• llama_state_seq_save_file(ctx, filepath, seq_id, tokens, n_token_count) –
Save the KV and token data of sequence seq_id to filepath (similar to
state_save_file but for one sequence) 95 .
• llama_state_seq_load_file(ctx, filepath, dest_seq_id, tokens_out,
n_token_capacity, n_token_count_out) – Load a sequence state from file into
dest_seq_id , and output the token list to tokens_out . Fills n_token_count_out .
Returns number of bytes read 96 .
Sampling
• llama_sampler_init(iface, ctx) – Create a sampler object using a custom interface
iface (function pointers) and optional context data ctx . This is for advanced custom
samplers. Returns a new llama_sampler* 97 .
• llama_sampler_name(smpl) – Return the name of sampler smpl (as a C-string) 98 . May be
NULL if not implemented.
• llama_sampler_accept(smpl, token) – Inform sampler smpl that token has been
accepted (finalized) as next output. This allows the sampler to update internal state (e.g. for
repetition penalty) 99 .
8
• llama_sampler_apply(smpl, cur_p) – Apply the sampler to modify token data array
cur_p (which contains logits/probs) according to its rules (e.g. top-k filtering). The sampler
selects one token internally.
• llama_sampler_reset(smpl) – Reset sampler state to initial (as if no tokens have been
sampled yet) 100 .
• llama_sampler_clone(smpl) – Create a copy of sampler smpl with the same settings and
state. Useful for forking a sampler without affecting the original 101 .
• llama_sampler_free(smpl) – Free a sampler object. Note: do not free a sampler that has
been added to a chain (the chain will own it) 102 .
• llama_sampler_chain_init(params) – Create a sampler chain (a composite sampler) with
given chain parameters. A chain can hold multiple samplers applied in sequence 103 .
• llama_sampler_chain_add(chain, smpl) – Add sampler smpl into chain chain . The
chain takes ownership of smpl (and will free it when the chain is freed) 104 .
• llama_sampler_chain_get(chain, i) , llama_sampler_chain_n(chain) ,
llama_sampler_chain_remove(chain, i) – Query and modify the chain: get the i -th
sampler, get the number of samplers, or remove (and return) the i -th sampler. Removed
samplers are no longer owned by the chain 105 .
• Predefined samplers (create with default parameters):
• llama_sampler_init_greedy() – Greedy decoding (choose highest-prob token).
• llama_sampler_init_dist(seed) – Random sampling with no filtering (choose according
to softmax), seeded by seed .
• llama_sampler_init_top_k(k) – Keep only top- k highest-logit tokens, set others to zero
probability 106 .
• llama_sampler_init_top_p(p, min_keep) – Nucleus (top-p) sampling: keep a subset of
tokens whose cumulative probability ≥ p 107 .
• llama_sampler_init_min_p(p, min_keep) – Minimum-P sampling (an experimental
variant of nucleus) 108 .
• llama_sampler_init_typical(p, min_keep) – Locally Typical Sampling (balances
entropy) 109 .
• llama_sampler_init_temp(t) – Temperature rescaling: scales logits by 1/t (higher t =
more random) 110 .
• llama_sampler_init_temp_ext(t, delta, exponent) – Extended dynamic temperature
(entropy) sampler 111 .
• llama_sampler_init_xtc(p, t, min_keep, seed) – XTC sampler (experimental variant)
112.
• llama_sampler_init_top_n_sigma(n) – Top-nσ sampler (search top tokens for sigma
measure) 113 .
• llama_sampler_init_mirostat(n_vocab, seed, tau, eta, m) – Mirostat 1.0 sampling
(target cross-entropy tau , learning rate eta , lookahead m ) 114 .
• llama_sampler_init_mirostat_v2(seed, tau, eta) – Mirostat 2.0 sampling (simplified
version) 115 .
• llama_sampler_init_grammar(vocab, grammar_str, grammar_root) – Grammar-
based sampler: constrains output to follow a GBNF grammar 116 .
9
• llama_sampler_init_grammar_lazy_patterns(vocab, grammar_str, grammar_root,
trigger_patterns, num_patterns, trigger_tokens, num_tokens) – Lazy grammar
sampler that triggers when certain patterns or tokens appear 117 .
• llama_sampler_init_penalties(penalty_last_n, penalty_repeat, penalty_freq,
penalty_present) – Repetition and frequency penalty sampler (higher penalties discourage
recent or common tokens) 118 .
• llama_sampler_init_dry(vocab, n_ctx_train, dry_multiplier, dry_base,
dry_allowed_length, dry_penalty_last_n, seq_breakers, num_breakers) – DRY
sampler (decoding restart penalty) 119 .
• llama_sampler_init_logit_bias(n_vocab, n_logit_bias, logit_bias) – Bias
sampler: directly add biases from a logit_bias array (length n_logit_bias ) to model logits
120 .
• llama_sampler_init_infill(vocab) – Infill sampler for fill-in-the-middle: collapses
probabilities of tokens with shared prefixes to prefer common prefixes 121 .
• llama_sampler_get_seed(smpl) – Return the random seed used by sampler smpl (if
applicable), or LLAMA_DEFAULT_SEED if none 122 .
• llama_sampler_sample(smpl, ctx, idx) – Convenience function: sample a token from
the logits of context ctx at output index idx using sampler smpl . It internally gets logits,
applies the sampler (via llama_sampler_apply ), selects the token, calls
llama_sampler_accept , and returns the token ID 123 .
Utility Functions
• llama_split_path(split_path, maxlen, path_prefix, split_no, split_count) –
Given a base path prefix (without “-[Link]”), build a split file path. Writes into
split_path and returns its length. E.g. with prefix /models/ggml-model-q4_0 ,
split_no=2 , split_count=4 , produces /models/ggml-model-q4_0-00002-
[Link] 124 .
• llama_split_prefix(split_prefix, maxlen, split_path, split_no, split_count)
– Given a split file path and its indices, extract the original prefix. Writes prefix into
split_prefix and returns its length, or -1 if split_no/count don’t match the path 125 .
• llama_perf_context(ctx) , llama_perf_context_print(ctx) ,
llama_perf_context_reset(ctx) – Performance profiling utilities for a context.
llama_perf_context(ctx) returns a struct with timing data (ms for loading, partial eval,
eval) and evaluation counts. _print prints to stdout, and _reset clears the stats. (Used in
examples; for precise profiling you may measure directly) 126 .
• llama_perf_sampler(chain) , llama_perf_sampler_print(chain) ,
llama_perf_sampler_reset(chain) – Similar performance stats for a sampler chain: timing
of sampling and number of samples processed. Works only for chains built via
llama_sampler_chain_init 127 .
10
Training and Optimization
• llama_opt_param_filter_all(tensor, userdata) – A default parameter filter that
returns true for every tensor. Useful to tell the optimizer that all model parameters are
trainable 128 .
• llama_opt_init(lctx, model, lopt_params) – Initialize optimization (training) for
context lctx and model with options lopt_params . This sets up an internal optimizer
(SGD/Adam, etc.) for fine-tuning. Must be called before training loops 129 .
• llama_opt_epoch(lctx, dataset, result_train, result_eval, idata_split,
callback_train, callback_eval) – Run one training epoch. dataset provides training
data, result_train / result_eval record metrics, idata_split can separate train/
validation, and callbacks are invoked per batch. This uses GGML’s optimizer under the hood.
Parameters follow llama_opt_params settings in lopt_params given to
llama_opt_init 130 .
Note: The above functions marked DEPRECATED in the header (e.g. llama_free_model , old context
creation, token helper names) are not listed, as newer alternatives exist. Always prefer the non-
deprecated llama_model_free , llama_init_from_model , llama_vocab_* , etc.
Each function is documented based on its declaration in llama.h and behavior in [Link] 131 132 .
Please refer to those source lines for exact signatures and context. The API names and parameters are
case-sensitive and must be used as shown.
1 2 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93
94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120
121 122 123 124 125 126 127 128 129 130 131 132 llama.h
[Link]
3 4 5 6 7 8 [Link]
[Link]
11