0% found this document useful (0 votes)

458 views11 pages

Llama - CPP C API Reference

The document provides a reference for the Llama.cpp C API, detailing functions for initializing models, contexts, sampler chains, and model quantization with default parameters. Key functions include llama_model_default_params(), llama_context_default_params(), and llama_sampler_chain_default_params(), which return structs filled with default settings. Additionally, it mentions llama_backend_init() for initializing the global backend.

Uploaded by

Uttam Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

458 views11 pages

Llama - CPP C API Reference

Uploaded by

Uttam Kumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Llama.

cpp C API Reference

Initialization and Backend
• llama_model_default_params() – Returns a llama_model_params struct filled with
default values for loading a model 1 . Use this to get sensible defaults (e.g. no tensor overrides,
use or disable use_mmap , etc.) before customizing model parameters.
• llama_context_default_params() – Returns a llama_context_params struct with
default settings for creating a new context 1 . Modify fields (like n_ctx , GPU layers, etc.) as
needed.
• llama_sampler_chain_default_params() – Returns a llama_sampler_chain_params
struct with default options for initializing a sampler chain 2 .

• llama_model_quantize_default_params() – Returns a
llama_model_quantize_params struct with default options for model quantization 2 .

• llama_backend_init() – Initialize the global llama/ggml backend (should be called once at

program start). This sets up timing and FP16 tables internally 3 .

• llama_backend_free() – Free any global llama resources (e.g. quantization tables) before
program exit 4 .

• llama_numa_init(ggml_numa_strategy numa) – Enable CPU NUMA optimizations

according to numa (or disable with GGML_NUMA_STRATEGY_DISABLED ) 5 . Call before
loading models if NUMA is needed.

• llama_time_us() – Return the current wall-clock time in microseconds, as used by ggml for
profiling 6 . Useful for custom timing.

• llama_max_devices() – Return the maximum number of GPU devices the library can use
(currently fixed at 16) 7 .

• llama_supports_mmap() , llama_supports_mlock() ,
llama_supports_gpu_offload() , llama_supports_rpc() – Return booleans indicating
whether memory-mapped IO, locked memory, GPU offload, or RPC backends are supported on
this build/platform 8 9 .

• llama_attach_threadpool(ctx, threadpool, threadpool_batch) – Attach a custom

GGML threadpool to ctx for parallel decoding. The threadpool handles single-token
operations and threadpool_batch handles batch processing. If not called, ggml will create its
own thread pool automatically 10 .

• llama_detach_threadpool(ctx) – Detach any previously attached threadpool from ctx ,

reverting to default behavior 11 .

• llama_print_system_info() – Return a string summarizing system and hardware

information (CPU, GPU, instruction sets, etc.) 12 .

1
• llama_log_set(log_callback, user_data) – Set a callback for internal logging
messages. By default, logs go to stderr . Provide a ggml_log_callback function and
optional user data to redirect or handle logs 13 .

Model Loading and Saving

• llama_model_load_from_file(path_model, params) – Load a model from a single GGUF
file at path_model , using options in params (e.g. use_mmap , n_gpu_layers ,
vocab_only , etc.) 14 . Returns a pointer to a newly allocated llama_model , or NULL on
failure. After loading, the model can be used to create contexts.

• llama_model_load_from_splits(paths, n_paths, params) – Load a model split across

multiple files. paths is an array of file paths (in correct order) of length n_paths , and
params are loading options. Returns a llama_model pointer on success 15 . Use this when
the model is split but does not follow the default naming scheme.

• llama_model_save_to_file(model, path_model) – Save the given model’s tensors to a

GGUF file at path_model . This writes out all layers and weights of model . Requires a model
created or loaded earlier 16 .

• llama_model_free(model) – Free a loaded model and all its associated memory. After
calling this, the model pointer must not be used again 17 . (If you had LoRA adapters loaded,
they will be freed automatically with the model.)

• llama_model_quantize(fname_inp, fname_out, params) – Quantize a model file. Reads

the model from fname_inp and writes a quantized model to fname_out using parameters
params (e.g. target bits, algorithm). Returns 0 on success 18 .

• llama_model_n_ctx_train(model) , llama_model_n_embd(model) ,
llama_model_n_layer(model) , llama_model_n_head(model) ,
llama_model_n_head_kv(model) – Return architecture parameters of model : training
context size, embedding dimension, number of layers, number of attention heads, and KV heads
per layer, respectively 19 . These tell you the model’s structure.
• llama_model_rope_freq_scale_train(model) – Return the RoPE (rotary position
encoding) frequency scaling factor used during training (usually 1.0 ) 20 . This can inform
position handling at inference.
• llama_model_rope_type(model) – Return the type of RoPE (rotary embeddings) used by the
model 21 . (Requires including llama_rope_type enum.)
• llama_model_n_params(model) – Return the total number of parameters (weights) in the
model 22 .

• llama_model_size(model) – Return the total size in bytes of all model tensors (roughly the
memory footprint) 23 .

• llama_model_has_encoder(model) / llama_model_has_decoder(model) – Return

true if the model has an encoder/decoder component (for encoder-decoder models) 24 .

• llama_model_decoder_start_token(model) – For encoder-decoder models, return the

token ID that should be passed to the decoder to start generation. For non-encoder-decoder
models, returns -1 25 .

2
• llama_model_is_recurrent(model) – Return true if the model is a recurrent (stateful)
model such as RWKV or Mamba 26 .

• llama_model_meta_val_str(model, key, buf, buf_size) – Retrieve a string metadata

value by key from the model’s GGUF metadata. Writes into buf up to buf_size bytes.
Returns the string length on success or -1 on failure 27 .

• llama_model_meta_count(model) – Return the number of metadata entries in the model

28.
• llama_model_meta_key_by_index(model, i, buf, buf_size) – Get the metadata key
name of index i (0-based) into buf . Returns length or -1 on failure 29 .
• llama_model_meta_val_str_by_index(model, i, buf, buf_size) – Get the metadata
value (as string) of index i into buf . Returns length or -1 30 .

• llama_model_desc(model, buf, buf_size) – Get a human-readable description of the

model type into buf (up to buf_size ). Returns length or -1 31 .

• llama_model_chat_template(model, name) – Get a default chat prompt template for the

model. If name is non-NULL, get a named template; if name is NULL , get the default
template. Returns a C-string (or NULL if not available) 32 .

Context Management and Decoding

• llama_init_from_model(model, params) – Create a new llama_context from a loaded
model. params (from llama_context_default_params() ) can customize context size
( n_ctx ), number of GPU layers, embedding mode, etc. Returns a pointer to the new context, or
NULL on error 33 .

• llama_free(ctx) – Free a llama_context and all internal buffers (activations, KV cache,

etc.). After this, ctx must not be used 34 .

• llama_get_model(ctx) – Return the llama_model* associated with ctx 35 .

• llama_get_kv_self(ctx) – Return the llama_kv_cache* used by ctx for self-attention

(the KV cache) 35 .

• llama_pooling_type(ctx) – Return the pooling type ( llama_pooling_type ) of ctx

36 .

• llama_n_ctx(ctx) , llama_n_batch(ctx) , llama_n_ubatch(ctx) ,

llama_n_seq_max(ctx) – Query context dimensions:

• n_ctx : maximum context length (tokens) in training;

• n_batch : current batch size;
• n_ubatch : ? (batch parallelism count);

• n_seq_max : maximum number of active sequences.

These are useful for understanding the context state. (See [46†L531-L534] for signatures.)

3
• llama_encode(ctx, batch) – Process a token batch with the encoder (no KV cache). For
encoder-decoder models, this runs the encoder on batch . Returns 0 on success or negative on
error 37 .

• llama_decode(ctx, batch) – Process a token batch through the decoder (requires KV

cache). Applies autoregressive attention. Returns 0 on success, positive for warnings (e.g.
context full), or negative on error 38 . On error or warning, the KV cache state is restored to
before the call.

• llama_batch_init(n_tokens, embd, n_seq_max) – Allocate a llama_batch struct for

up to n_tokens tokens. If embd != 0 , allocates an embeddings buffer of size n_tokens *
embd ; otherwise, it allocates token and logits arrays. n_seq_max is the max sequences per
token. Caller must fill [Link][] and other fields before decode.

• llama_batch_free(batch) – Free a llama_batch allocated by llama_batch_init()

39 .

• llama_set_n_threads(ctx, n_threads, n_threads_batch) – Set how many CPU

threads ctx uses: n_threads for single-token (generation) steps, n_threads_batch for
prompt/batch processing 40 .

• llama_n_threads(ctx) , llama_n_threads_batch(ctx) – Query the thread counts

currently set for ctx 41 .

• llama_set_embeddings(ctx, embeddings) – If embeddings is true , enable

embeddings-only mode: on decode, the model will compute and return embeddings but not
logits 42 .

• llama_set_causal_attn(ctx, causal_attn) – Enable ( true ) or disable causal

(autoregressive) attention for generation. By default causal is used (attend only to past tokens)
43 .

• llama_set_warmup(ctx, warmup) – If warmup is true , the first call to llama_decode

will preload all weight tensors into memory (warm up the cache), which can make the first
generation faster 44 .
• llama_set_abort_callback(ctx, abort_callback, abort_callback_data) – Set a
user callback that can abort long computations. abort_callback should return non-zero to
stop. This is checked periodically during decode 45 .

• llama_synchronize(ctx) – Wait for all pending computations in ctx to finish. Normally

not needed if you immediately access results (logits/embeddings), but can be used to
synchronize explicitly 46 .

• llama_get_logits(ctx) – After a call to llama_decode , returns a pointer to the logits

array of the last output token(s). The array is size [n_rows][n_vocab] , where rows
correspond to tokens with non-zero [Link] flags 47 .

• llama_get_logits_ith(ctx, i) – Get pointer to the logits of the i -th output token

(counting from 0) in the last batch. Supports negative indexing ( -1 for last). Returns NULL if
i is out of range 48 .

4
• llama_get_embeddings(ctx) – After decode, returns a pointer to all output token
embeddings (if available). If pooling is NONE (or generative model), embeddings for tokens with
outputs are stored contiguously 49 . Otherwise returns NULL .
• llama_get_embeddings_ith(ctx, i) – Get the embedding vector for the i -th output
token. Returns a pointer to [n_embd] floats, or NULL if invalid index 50 .
• llama_get_embeddings_seq(ctx, seq_id) – Get the embedding or value for a full
sequence seq_id . If pooling is RANK, returns a single float (the rank). If NONE or last-token
pooling, returns a float array of length n_embd . Returns NULL if pooling type is NONE and not
generative 51 .

Tokenization and Vocabulary

• llama_tokenize(vocab, text, text_len, tokens, n_tokens_max, add_special,
parse_special) – Convert input UTF-8 text into token IDs using vocab . Writes up to
n_tokens_max tokens into tokens . If add_special=true , BOS/EOS are added per
model’s settings; if parse_special=true , special/control tokens in the text (like <|
endoftext|> ) are tokenized rather than treated as raw text 52 . Returns the number of tokens
produced, or a negative number if the text would exceed n_tokens_max (absolute value of
return is required capacity).

• llama_token_to_piece(vocab, token, buf, length, lstrip, special) – Convert a

single token ID to its text piece. Writes up to length characters into buf (without null-
terminator). Skips up to lstrip leading spaces in the token text before copying (for multi-
token decoding). If special=true , then special tokens are converted to their name strings
53 . Returns the number of characters written (or negative on failure).

• llama_detokenize(vocab, tokens, n_tokens, text, text_len_max,

remove_special, unparse_special) – Convert an array of n_tokens token IDs back into
UTF-8 text into text . Writes at most text_len_max bytes. If remove_special=true , BOS/
EOS tokens (if present) are omitted from output; if unparse_special=true , special tokens
are rendered as their string form (e.g. <|endoftext|> ). Returns number of bytes written (or
negative if overflow) 54 .

• llama_vocab_get_text(vocab, token) – Return the null-terminated string for a given

token using vocab ’s decoding table 55 .

• llama_vocab_get_score(vocab, token) – Return the score (logit bias or term frequency)

associated with token in the vocab 56 .
• llama_vocab_get_attr(vocab, token) – Return attributes ( llama_token_attr ) of
token (e.g. whether it is a beginning-of-sentence) 57 .
• llama_vocab_is_eog(vocab, token) – Return true if token is an end-of-generation
token (EOS, EOT, etc.) 58 .

• llama_vocab_is_control(vocab, token) – Return true if token is a control token

(special non-renderable token) 59 .

• Special token getters: Functions that return special token IDs for the given vocab :

• llama_vocab_bos(vocab) – BOS (beginning-of-sentence) token 60 .

• llama_vocab_eos(vocab) – EOS (end-of-sentence) token 61 .

5
• llama_vocab_eot(vocab) – EOT (end-of-turn) token 62 .
• llama_vocab_sep(vocab) – Sentence separator token 63 .
• llama_vocab_nl(vocab) – Newline token 64 .
• llama_vocab_pad(vocab) – Padding token 65 .

• llama_vocab_get_add_bos(vocab) , llama_vocab_get_add_eos(vocab) – Return

whether the model’s tokenizer automatically adds BOS or EOS tokens to input prompts 66 .

• Fill-in-the-middle (FIM) tokens:

• llama_vocab_fim_pre(vocab) , llama_vocab_fim_suf(vocab) ,
llama_vocab_fim_mid(vocab) , llama_vocab_fim_pad(vocab) ,
llama_vocab_fim_rep(vocab) , llama_vocab_fim_sep(vocab) – Special tokens used for
fill-in-the-middle tasks (prefix, suffix, middle, padding, repetition, separator) 67 . Use these if
generating with FIM.

Chat Templates
• llama_chat_apply_template(tmpl, chat, n_msg, add_ass, buf, length) – Apply a
chat-completion template to a list of messages. tmpl is an optional custom template name (or
NULL to use model default). chat is an array of llama_chat_message structs of length
n_msg . If add_ass=true , the assistant prompt token(s) are added at the end. The formatted
prompt is written to buf (up to length bytes). Returns the number of bytes written (may
exceed length – if so, reallocate buffer) 68 .
• llama_chat_builtin_templates(output, len) – Get a list of built-in template names.
Writes up to len C-strings into the output array and returns the number of names written.
Useful to discover supported templates 69 .

Adapters
• llama_adapter_lora_init(model, path_lora) – Load a LoRA adapter from file
path_lora into memory, associated with model . Returns a new llama_adapter_lora* .
This adapter can then be applied to contexts without modifying the base model 70 .

• llama_adapter_lora_free(adapter) – Manually free a LoRA adapter object if needed.

(Note: adapters loaded into a model will be freed automatically when llama_model_free is
called.) 71 .

• llama_set_adapter_lora(ctx, adapter, scale) – Add the LoRA adapter adapter to

the context ctx with a multiplication scale . This applies the adapter’s weights on-the-fly to
the model for this context. Returns 0 on success or a negative code on failure 72 .

• llama_rm_adapter_lora(ctx, adapter) – Remove a LoRA adapter previously applied to

ctx . Returns 0 on success or -1 if that adapter was not found in the context 73 .

• llama_clear_adapter_lora(ctx) – Remove all LoRA adapters from ctx , restoring the

original model weights for that context 74 .

• llama_apply_adapter_cvec(ctx, data, len, n_embd, il_start, il_end) – Apply a

control vector (adapter) to ctx . data points to a float buffer of length len (should equal

6
n_embd * n_layers ). This adds a custom bias to each layer’s input (from layer il_start to
il_end , inclusive). If data is NULL , the current control vector is cleared. Returns 0 on
success 75 .

KV Cache Operations
• llama_kv_self_clear(ctx) – Clear the entire self-attention KV cache of ctx . This erases
all stored keys/values and resets sequence info 76 .
• llama_kv_self_seq_rm(ctx, seq_id, p0, p1) – Remove tokens of sequence seq_id
with positions in [p0,p1) from the KV cache. Negative seq_id or p0/p1 act as wildcards
( seq_id<0 means all sequences, p0<0 means start at 0, etc.). Returns true if successful, or
false if a partial range couldn’t be removed. Removing a whole sequence always succeeds
77.
• llama_kv_self_seq_cp(ctx, seq_id_src, seq_id_dst, p0, p1) – Copy tokens from
sequence seq_id_src to seq_id_dst for positions in [p0,p1). This does not allocate new
memory; it just reassigns the tokens to the new sequence ID 78 .

• llama_kv_self_seq_keep(ctx, seq_id) – Remove all tokens not belonging to seq_id ,

effectively keeping only that sequence in the cache 79 .

• llama_kv_self_seq_add(ctx, seq_id, p0, p1, delta) – Add delta to the positions

of tokens belonging to seq_id in [p0,p1). This shifts their effective positions. If positional
embeddings (RoPE) are used, the cache will be updated lazily on the next decode or explicitly by
llama_kv_self_update 80 .

• llama_kv_self_seq_div(ctx, seq_id, p0, p1, d) – Divide the positions of tokens in

seq_id over [p0,p1) by integer d>1 . Also updates RoPE if needed (lazily) 81 .

• llama_kv_self_seq_pos_min(ctx, seq_id) – Return the smallest position index present

for sequence seq_id in the KV cache, or -1 if the sequence is empty 82 .

• llama_kv_self_seq_pos_max(ctx, seq_id) – Return the largest position index for

seq_id , or -1 if empty 83 .

• llama_kv_self_defrag(ctx) – Defragment the KV cache memory (repack and remove

gaps). This is applied lazily on next decode or immediately by llama_kv_self_update 84 .

• llama_kv_self_can_shift(ctx) – Return true if the context’s KV cache supports shifting

operations (some caches can’t be shifted) 85 .
• llama_kv_self_update(ctx) – Apply all pending KV cache updates now (shifts, defrags,
etc.). Otherwise, updates happen lazily on the next llama_decode 86 .

State and Sessions

• llama_state_get_size(ctx) – Return the size in bytes needed to save the full state of ctx
(logits, embeddings, and KV cache). Use this to allocate a buffer before calling
llama_state_get_data 87 .

7
• llama_state_get_data(ctx, dst, size) – Copy the current model state into dst
(buffer of length size ). Returns the number of bytes written. The buffer must be large enough
(use llama_state_get_size ) 88 .

• llama_state_set_data(ctx, src, size) – Restore the model state of ctx from the
buffer src of length size . Returns the number of bytes read 89 . Use this to rewind or
switch states.

• llama_state_load_file(ctx, path_session, tokens_out, n_token_capacity,

n_token_count_out) – Load a session from file at path_session . This restores context
state (KV, etc.) and fills tokens_out with the tokens from the session, up to capacity
n_token_capacity . Writes the actual token count to n_token_count_out . Returns true
on success 90 .

• llama_state_save_file(ctx, path_session, tokens, n_token_count) – Save the

current session to path_session , writing out the tokens array of length n_token_count .
Returns true on success 91 .

• llama_state_seq_get_size(ctx, seq_id) – Get the number of bytes needed to copy the

KV cache of sequence seq_id alone. This allows saving or transferring a single sequence state
92 .

• llama_state_seq_get_data(ctx, dst, size, seq_id) – Copy the KV cache of sequence

seq_id into dst (buffer of length size ). Returns bytes written 93 .

• llama_state_seq_set_data(ctx, src, size, dest_seq_id) – Load sequence state

from src buffer (length size ) into dest_seq_id in ctx . Returns positive on success, 0
on failure 94 .

• llama_state_seq_save_file(ctx, filepath, seq_id, tokens, n_token_count) –

Save the KV and token data of sequence seq_id to filepath (similar to
state_save_file but for one sequence) 95 .

• llama_state_seq_load_file(ctx, filepath, dest_seq_id, tokens_out,

n_token_capacity, n_token_count_out) – Load a sequence state from file into
dest_seq_id , and output the token list to tokens_out . Fills n_token_count_out .
Returns number of bytes read 96 .

Sampling
• llama_sampler_init(iface, ctx) – Create a sampler object using a custom interface
iface (function pointers) and optional context data ctx . This is for advanced custom
samplers. Returns a new llama_sampler* 97 .
• llama_sampler_name(smpl) – Return the name of sampler smpl (as a C-string) 98 . May be
NULL if not implemented.
• llama_sampler_accept(smpl, token) – Inform sampler smpl that token has been
accepted (finalized) as next output. This allows the sampler to update internal state (e.g. for
repetition penalty) 99 .

8
• llama_sampler_apply(smpl, cur_p) – Apply the sampler to modify token data array
cur_p (which contains logits/probs) according to its rules (e.g. top-k filtering). The sampler
selects one token internally.
• llama_sampler_reset(smpl) – Reset sampler state to initial (as if no tokens have been
sampled yet) 100 .
• llama_sampler_clone(smpl) – Create a copy of sampler smpl with the same settings and
state. Useful for forking a sampler without affecting the original 101 .

• llama_sampler_free(smpl) – Free a sampler object. Note: do not free a sampler that has
been added to a chain (the chain will own it) 102 .

• llama_sampler_chain_init(params) – Create a sampler chain (a composite sampler) with

given chain parameters. A chain can hold multiple samplers applied in sequence 103 .

• llama_sampler_chain_add(chain, smpl) – Add sampler smpl into chain chain . The

chain takes ownership of smpl (and will free it when the chain is freed) 104 .

• llama_sampler_chain_get(chain, i) , llama_sampler_chain_n(chain) ,
llama_sampler_chain_remove(chain, i) – Query and modify the chain: get the i -th
sampler, get the number of samplers, or remove (and return) the i -th sampler. Removed
samplers are no longer owned by the chain 105 .

• Predefined samplers (create with default parameters):

• llama_sampler_init_greedy() – Greedy decoding (choose highest-prob token).

• llama_sampler_init_dist(seed) – Random sampling with no filtering (choose according
to softmax), seeded by seed .
• llama_sampler_init_top_k(k) – Keep only top- k highest-logit tokens, set others to zero
probability 106 .
• llama_sampler_init_top_p(p, min_keep) – Nucleus (top-p) sampling: keep a subset of
tokens whose cumulative probability ≥ p 107 .
• llama_sampler_init_min_p(p, min_keep) – Minimum-P sampling (an experimental
variant of nucleus) 108 .
• llama_sampler_init_typical(p, min_keep) – Locally Typical Sampling (balances
entropy) 109 .
• llama_sampler_init_temp(t) – Temperature rescaling: scales logits by 1/t (higher t =
more random) 110 .
• llama_sampler_init_temp_ext(t, delta, exponent) – Extended dynamic temperature
(entropy) sampler 111 .
• llama_sampler_init_xtc(p, t, min_keep, seed) – XTC sampler (experimental variant)
112.
• llama_sampler_init_top_n_sigma(n) – Top-nσ sampler (search top tokens for sigma
measure) 113 .
• llama_sampler_init_mirostat(n_vocab, seed, tau, eta, m) – Mirostat 1.0 sampling
(target cross-entropy tau , learning rate eta , lookahead m ) 114 .
• llama_sampler_init_mirostat_v2(seed, tau, eta) – Mirostat 2.0 sampling (simplified
version) 115 .
• llama_sampler_init_grammar(vocab, grammar_str, grammar_root) – Grammar-
based sampler: constrains output to follow a GBNF grammar 116 .

9
• llama_sampler_init_grammar_lazy_patterns(vocab, grammar_str, grammar_root,
trigger_patterns, num_patterns, trigger_tokens, num_tokens) – Lazy grammar
sampler that triggers when certain patterns or tokens appear 117 .
• llama_sampler_init_penalties(penalty_last_n, penalty_repeat, penalty_freq,
penalty_present) – Repetition and frequency penalty sampler (higher penalties discourage
recent or common tokens) 118 .
• llama_sampler_init_dry(vocab, n_ctx_train, dry_multiplier, dry_base,
dry_allowed_length, dry_penalty_last_n, seq_breakers, num_breakers) – DRY
sampler (decoding restart penalty) 119 .
• llama_sampler_init_logit_bias(n_vocab, n_logit_bias, logit_bias) – Bias
sampler: directly add biases from a logit_bias array (length n_logit_bias ) to model logits
120 .

• llama_sampler_init_infill(vocab) – Infill sampler for fill-in-the-middle: collapses

probabilities of tokens with shared prefixes to prefer common prefixes 121 .

• llama_sampler_get_seed(smpl) – Return the random seed used by sampler smpl (if

applicable), or LLAMA_DEFAULT_SEED if none 122 .

• llama_sampler_sample(smpl, ctx, idx) – Convenience function: sample a token from

the logits of context ctx at output index idx using sampler smpl . It internally gets logits,
applies the sampler (via llama_sampler_apply ), selects the token, calls
llama_sampler_accept , and returns the token ID 123 .

Utility Functions
• llama_split_path(split_path, maxlen, path_prefix, split_no, split_count) –
Given a base path prefix (without “-[Link]”), build a split file path. Writes into
split_path and returns its length. E.g. with prefix /models/ggml-model-q4_0 ,
split_no=2 , split_count=4 , produces /models/ggml-model-q4_0-00002-
[Link] 124 .

• llama_split_prefix(split_prefix, maxlen, split_path, split_no, split_count)

– Given a split file path and its indices, extract the original prefix. Writes prefix into
split_prefix and returns its length, or -1 if split_no/count don’t match the path 125 .

• llama_perf_context(ctx) , llama_perf_context_print(ctx) ,
llama_perf_context_reset(ctx) – Performance profiling utilities for a context.
llama_perf_context(ctx) returns a struct with timing data (ms for loading, partial eval,
eval) and evaluation counts. _print prints to stdout, and _reset clears the stats. (Used in
examples; for precise profiling you may measure directly) 126 .

• llama_perf_sampler(chain) , llama_perf_sampler_print(chain) ,
llama_perf_sampler_reset(chain) – Similar performance stats for a sampler chain: timing
of sampling and number of samples processed. Works only for chains built via
llama_sampler_chain_init 127 .

10
Training and Optimization
• llama_opt_param_filter_all(tensor, userdata) – A default parameter filter that
returns true for every tensor. Useful to tell the optimizer that all model parameters are
trainable 128 .
• llama_opt_init(lctx, model, lopt_params) – Initialize optimization (training) for
context lctx and model with options lopt_params . This sets up an internal optimizer
(SGD/Adam, etc.) for fine-tuning. Must be called before training loops 129 .
• llama_opt_epoch(lctx, dataset, result_train, result_eval, idata_split,
callback_train, callback_eval) – Run one training epoch. dataset provides training
data, result_train / result_eval record metrics, idata_split can separate train/
validation, and callbacks are invoked per batch. This uses GGML’s optimizer under the hood.
Parameters follow llama_opt_params settings in lopt_params given to
llama_opt_init 130 .

Note: The above functions marked DEPRECATED in the header (e.g. llama_free_model , old context
creation, token helper names) are not listed, as newer alternatives exist. Always prefer the non-
deprecated llama_model_free , llama_init_from_model , llama_vocab_* , etc.

Each function is documented based on its declaration in llama.h and behavior in [Link] 131 132 .
Please refer to those source lines for exact signatures and context. The API names and parameters are
case-sensitive and must be used as shown.

1 2 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35

36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64

65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93

94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120

121 122 123 124 125 126 127 128 129 130 131 132 llama.h
[Link]

3 4 5 6 7 8 [Link]
[Link]

GPT
No ratings yet
GPT
6 pages
Multi-Agent AI System Setup Guide
No ratings yet
Multi-Agent AI System Setup Guide
293 pages
Creating A ChatGPT Clone That Runs On Your Laptop With Go by Sau Sheong
No ratings yet
Creating A ChatGPT Clone That Runs On Your Laptop With Go by Sau Sheong
20 pages
LLM - Creating A ChatGPT Clone That Runs On Your Laptop With Go - by Sau Sheong - Stackademic
No ratings yet
LLM - Creating A ChatGPT Clone That Runs On Your Laptop With Go - by Sau Sheong - Stackademic
19 pages
Faster Llama-3.2 Finetuning Guide
No ratings yet
Faster Llama-3.2 Finetuning Guide
19 pages
Local LLM Inference and Fine-Tuning
100% (3)
Local LLM Inference and Fine-Tuning
26 pages
Getting Started with Llama 2 Guide
No ratings yet
Getting Started with Llama 2 Guide
37 pages
Documentation - Llama
No ratings yet
Documentation - Llama
7 pages
LLM Creation Steps
No ratings yet
LLM Creation Steps
1 page
Build Medical Chatbot with Llama2
No ratings yet
Build Medical Chatbot with Llama2
52 pages
LangChain QuickStart With Llama 2
0% (1)
LangChain QuickStart With Llama 2
16 pages
Meta Releases Prompt Engineering Guide
No ratings yet
Meta Releases Prompt Engineering Guide
11 pages
Retr0's Register
No ratings yet
Retr0's Register
15 pages
Multimodal Llama 3.2 Overview
No ratings yet
Multimodal Llama 3.2 Overview
29 pages
LLAMA 2.0 CPU Setup for In-Context Learning
No ratings yet
LLAMA 2.0 CPU Setup for In-Context Learning
20 pages
Assignment Development Minimalist of The Llama2 Model
No ratings yet
Assignment Development Minimalist of The Llama2 Model
9 pages
LLama CPP Examples
No ratings yet
LLama CPP Examples
15 pages
Open-Source Models with LlamaIndex
100% (1)
Open-Source Models with LlamaIndex
34 pages
Session 8 Open Source LLM Ecosystem
No ratings yet
Session 8 Open Source LLM Ecosystem
21 pages
Meta-Llama - Llama-4-Scout-17B-16E-Original Hugging Face
No ratings yet
Meta-Llama - Llama-4-Scout-17B-16E-Original Hugging Face
15 pages
Alpaca + Llama-3 8b Full Example - Ipynb - Colab
No ratings yet
Alpaca + Llama-3 8b Full Example - Ipynb - Colab
10 pages
PythonAI LLMs ForSharing
100% (2)
PythonAI LLMs ForSharing
47 pages
Llama 3.1 Model Cards & Prompt Formats
No ratings yet
Llama 3.1 Model Cards & Prompt Formats
25 pages
Fine-Tuning Code Documentation with LLMs
No ratings yet
Fine-Tuning Code Documentation with LLMs
1 page
Retorno 1
No ratings yet
Retorno 1
29 pages
Chat With Multiple PDFs Using Llama 2 and LangChain
No ratings yet
Chat With Multiple PDFs Using Llama 2 and LangChain
17 pages
Llama 3 - Open Model That Is Truly Useful
No ratings yet
Llama 3 - Open Model That Is Truly Useful
19 pages
Local LLMs: Key Terms and Concepts
No ratings yet
Local LLMs: Key Terms and Concepts
13 pages
Llama.cpp and Vicuna: CPU Inference Guide
No ratings yet
Llama.cpp and Vicuna: CPU Inference Guide
21 pages
How To Run LLMs Locally
No ratings yet
How To Run LLMs Locally
8 pages
Fine-Tuning LLMs for Code Documentation
No ratings yet
Fine-Tuning LLMs for Code Documentation
1 page
PyTorch Guide for Deep Learning
No ratings yet
PyTorch Guide for Deep Learning
5 pages
Rollama
No ratings yet
Rollama
11 pages
Code Llama: Open Foundation Models For Code
No ratings yet
Code Llama: Open Foundation Models For Code
48 pages
Research Paper Llama
No ratings yet
Research Paper Llama
27 pages
OpenLLAMA-The Future of Large Language Models
No ratings yet
OpenLLAMA-The Future of Large Language Models
5 pages
LLaMA Open and Efficient Foundation Language Models
No ratings yet
LLaMA Open and Efficient Foundation Language Models
27 pages
LLMs in Python Free Course by Inder P Singh
No ratings yet
LLMs in Python Free Course by Inder P Singh
28 pages
GEN AI Course Handouts Week 3
No ratings yet
GEN AI Course Handouts Week 3
25 pages
LLaMA Ankit - Rawat
No ratings yet
LLaMA Ankit - Rawat
52 pages
MLFlow: Easy Model Deployment Guide
No ratings yet
MLFlow: Easy Model Deployment Guide
15 pages
Run LLMs Locally with Ollama
No ratings yet
Run LLMs Locally with Ollama
7 pages
Sources
No ratings yet
Sources
7 pages
Medium Com Xuer Chen Human Beginners Guide To Running Llama 3 8b On A Macbook Air Ffb380aeef0c
No ratings yet
Medium Com Xuer Chen Human Beginners Guide To Running Llama 3 8b On A Macbook Air Ffb380aeef0c
4 pages
Generative Models for Beginners
No ratings yet
Generative Models for Beginners
17 pages
Unit 4 Part 3
No ratings yet
Unit 4 Part 3
8 pages
ExLlamaV2 - The Fastest Library To Run LLMs
No ratings yet
ExLlamaV2 - The Fastest Library To Run LLMs
1 page
PyTorch NLP Tutorial Documentation
No ratings yet
PyTorch NLP Tutorial Documentation
35 pages
DL Unit II
No ratings yet
DL Unit II
29 pages
PyTorch for Deep Learning Beginners
No ratings yet
PyTorch for Deep Learning Beginners
31 pages
Unsloth: Fast, Efficient LLM Training
No ratings yet
Unsloth: Fast, Efficient LLM Training
20 pages
PyTorch for Deep Learning Experts
No ratings yet
PyTorch for Deep Learning Experts
72 pages
MLflow Présentation
No ratings yet
MLflow Présentation
51 pages
Local RAFT - Fine-Tuning Llama3 With Domain-Specific Knowledge Locally and Privately - Automate Your Network
No ratings yet
Local RAFT - Fine-Tuning Llama3 With Domain-Specific Knowledge Locally and Privately - Automate Your Network
47 pages
Az 1
No ratings yet
Az 1
9 pages
Build An AI Coding Agent With LangGraph by LangChain
No ratings yet
Build An AI Coding Agent With LangGraph by LangChain
11 pages
Course 1 - Chatgpt Prompt Engineering For Developers Guidelines For Prompting Clear and Specific Instructions
No ratings yet
Course 1 - Chatgpt Prompt Engineering For Developers Guidelines For Prompting Clear and Specific Instructions
7 pages
PyTorch: Dynamic Deep Learning Library
No ratings yet
PyTorch: Dynamic Deep Learning Library
12 pages
00 2021 Tima Ra Compressed
No ratings yet
00 2021 Tima Ra Compressed
100 pages
Online Grocery Battle: Reliance vs. BigBasket
No ratings yet
Online Grocery Battle: Reliance vs. BigBasket
7 pages
Alfalah Solar Financing - Nov19
No ratings yet
Alfalah Solar Financing - Nov19
11 pages
Tantrabhidhanam With Bijanighantu & Mudranighantu PDF
No ratings yet
Tantrabhidhanam With Bijanighantu & Mudranighantu PDF
1 page
Function Quiz Review
No ratings yet
Function Quiz Review
4 pages
c.pCO and tERA Integration Guide
No ratings yet
c.pCO and tERA Integration Guide
4 pages
Etcs307a L1
No ratings yet
Etcs307a L1
35 pages
15879A - REVQUOTATION FORM - Provision of Robot 2 Access Platform (Design Option 1) @K2C - @estrella - @arciaga
No ratings yet
15879A - REVQUOTATION FORM - Provision of Robot 2 Access Platform (Design Option 1) @K2C - @estrella - @arciaga
1 page
Huizhou Fudi Electrical Appliances Limited Company: 3. Mechanical Drawing
No ratings yet
Huizhou Fudi Electrical Appliances Limited Company: 3. Mechanical Drawing
4 pages
Data Analysis With R: Sai Vaibhavi Tulasi
No ratings yet
Data Analysis With R: Sai Vaibhavi Tulasi
2 pages
AWP Practicals-51-99
No ratings yet
AWP Practicals-51-99
53 pages
P.O.B, Economics, Accounts SBA
100% (3)
P.O.B, Economics, Accounts SBA
26 pages
Controlling NOx Emissions
No ratings yet
Controlling NOx Emissions
5 pages
NPK Under Water Manual
No ratings yet
NPK Under Water Manual
18 pages
TECNED Industrial Inverters UPS
No ratings yet
TECNED Industrial Inverters UPS
7 pages
Pricelist Hanzo Local Retail
No ratings yet
Pricelist Hanzo Local Retail
5 pages
Sample Exam For Hiring
No ratings yet
Sample Exam For Hiring
7 pages
BN68-13792A-01 - Leaflet-Remote - QLED LS03 - MENA - L02 - 220304.0
No ratings yet
BN68-13792A-01 - Leaflet-Remote - QLED LS03 - MENA - L02 - 220304.0
2 pages
Lanzhong-Kitchen Fire Nozzle
No ratings yet
Lanzhong-Kitchen Fire Nozzle
7 pages
MTC Attach A Puller Specs
No ratings yet
MTC Attach A Puller Specs
4 pages
Parameters Proforma
No ratings yet
Parameters Proforma
1 page
Video 7
No ratings yet
Video 7
3 pages
For Nuclear Science and Techniques Journal
No ratings yet
For Nuclear Science and Techniques Journal
4 pages
Grade 12 Media Literacy Reviewer
No ratings yet
Grade 12 Media Literacy Reviewer
3 pages
Preturi Telefoane
No ratings yet
Preturi Telefoane
14 pages
Faisal Adhar WPW
No ratings yet
Faisal Adhar WPW
1 page
Criticality Summary Report: Selection Criteria Summary Inspection Priority Category
No ratings yet
Criticality Summary Report: Selection Criteria Summary Inspection Priority Category
6 pages
Preparing For Audio Transcription
No ratings yet
Preparing For Audio Transcription
16 pages
Catalog Appleton Atx Pre Series 16 Amp Plugs Sockets Cover
No ratings yet
Catalog Appleton Atx Pre Series 16 Amp Plugs Sockets Cover
5 pages
Project Manager - Bhavnagar
No ratings yet
Project Manager - Bhavnagar
6 pages

Llama - CPP C API Reference

Uploaded by

Llama - CPP C API Reference

Uploaded by

Llama.

cpp C API Reference

• llama_backend_init() – Initialize the global llama/ggml backend (should be called once at

• llama_numa_init(ggml_numa_strategy numa) – Enable CPU NUMA optimizations

• llama_attach_threadpool(ctx, threadpool, threadpool_batch) – Attach a custom

• llama_detach_threadpool(ctx) – Detach any previously attached threadpool from ctx ,

• llama_print_system_info() – Return a string summarizing system and hardware

Model Loading and Saving

• llama_model_load_from_splits(paths, n_paths, params) – Load a model split across

• llama_model_save_to_file(model, path_model) – Save the given model’s tensors to a

• llama_model_quantize(fname_inp, fname_out, params) – Quantize a model file. Reads

• llama_model_has_encoder(model) / llama_model_has_decoder(model) – Return

• llama_model_decoder_start_token(model) – For encoder-decoder models, return the

• llama_model_meta_val_str(model, key, buf, buf_size) – Retrieve a string metadata

• llama_model_meta_count(model) – Return the number of metadata entries in the model

• llama_model_desc(model, buf, buf_size) – Get a human-readable description of the

• llama_model_chat_template(model, name) – Get a default chat prompt template for the

Context Management and Decoding

• llama_free(ctx) – Free a llama_context and all internal buffers (activations, KV cache,

• llama_get_model(ctx) – Return the llama_model* associated with ctx 35 .

• llama_get_kv_self(ctx) – Return the llama_kv_cache* used by ctx for self-attention

• llama_pooling_type(ctx) – Return the pooling type ( llama_pooling_type ) of ctx

• llama_n_ctx(ctx) , llama_n_batch(ctx) , llama_n_ubatch(ctx) ,

• n_ctx : maximum context length (tokens) in training;

• n_seq_max : maximum number of active sequences.

• llama_decode(ctx, batch) – Process a token batch through the decoder (requires KV

• llama_batch_init(n_tokens, embd, n_seq_max) – Allocate a llama_batch struct for

• llama_batch_free(batch) – Free a llama_batch allocated by llama_batch_init()

• llama_set_n_threads(ctx, n_threads, n_threads_batch) – Set how many CPU

• llama_n_threads(ctx) , llama_n_threads_batch(ctx) – Query the thread counts

• llama_set_embeddings(ctx, embeddings) – If embeddings is true , enable

• llama_set_causal_attn(ctx, causal_attn) – Enable ( true ) or disable causal

• llama_set_warmup(ctx, warmup) – If warmup is true , the first call to llama_decode

• llama_synchronize(ctx) – Wait for all pending computations in ctx to finish. Normally

• llama_get_logits(ctx) – After a call to llama_decode , returns a pointer to the logits

• llama_get_logits_ith(ctx, i) – Get pointer to the logits of the i -th output token

Tokenization and Vocabulary

• llama_token_to_piece(vocab, token, buf, length, lstrip, special) – Convert a

• llama_detokenize(vocab, tokens, n_tokens, text, text_len_max,

• llama_vocab_get_text(vocab, token) – Return the null-terminated string for a given

• llama_vocab_get_score(vocab, token) – Return the score (logit bias or term frequency)

• llama_vocab_is_control(vocab, token) – Return true if token is a control token

• llama_vocab_bos(vocab) – BOS (beginning-of-sentence) token 60 .

• llama_vocab_get_add_bos(vocab) , llama_vocab_get_add_eos(vocab) – Return

• Fill-in-the-middle (FIM) tokens:

• llama_adapter_lora_free(adapter) – Manually free a LoRA adapter object if needed.

• llama_set_adapter_lora(ctx, adapter, scale) – Add the LoRA adapter adapter to

• llama_rm_adapter_lora(ctx, adapter) – Remove a LoRA adapter previously applied to

• llama_clear_adapter_lora(ctx) – Remove all LoRA adapters from ctx , restoring the

• llama_apply_adapter_cvec(ctx, data, len, n_embd, il_start, il_end) – Apply a

• llama_kv_self_seq_keep(ctx, seq_id) – Remove all tokens not belonging to seq_id ,

• llama_kv_self_seq_add(ctx, seq_id, p0, p1, delta) – Add delta to the positions

• llama_kv_self_seq_div(ctx, seq_id, p0, p1, d) – Divide the positions of tokens in

• llama_kv_self_seq_pos_min(ctx, seq_id) – Return the smallest position index present

• llama_kv_self_seq_pos_max(ctx, seq_id) – Return the largest position index for

• llama_kv_self_defrag(ctx) – Defragment the KV cache memory (repack and remove

• llama_kv_self_can_shift(ctx) – Return true if the context’s KV cache supports shifting

State and Sessions

• llama_state_load_file(ctx, path_session, tokens_out, n_token_capacity,

• llama_state_save_file(ctx, path_session, tokens, n_token_count) – Save the

• llama_state_seq_get_size(ctx, seq_id) – Get the number of bytes needed to copy the

• llama_state_seq_get_data(ctx, dst, size, seq_id) – Copy the KV cache of sequence

• llama_state_seq_set_data(ctx, src, size, dest_seq_id) – Load sequence state

• llama_state_seq_save_file(ctx, filepath, seq_id, tokens, n_token_count) –

• llama_state_seq_load_file(ctx, filepath, dest_seq_id, tokens_out,

• llama_sampler_chain_init(params) – Create a sampler chain (a composite sampler) with

• llama_sampler_chain_add(chain, smpl) – Add sampler smpl into chain chain . The

• Predefined samplers (create with default parameters):

• llama_sampler_init_greedy() – Greedy decoding (choose highest-prob token).

• llama_sampler_init_infill(vocab) – Infill sampler for fill-in-the-middle: collapses

• llama_sampler_get_seed(smpl) – Return the random seed used by sampler smpl (if

• llama_sampler_sample(smpl, ctx, idx) – Convenience function: sample a token from

• llama_split_prefix(split_prefix, maxlen, split_path, split_no, split_count)

You might also like