guide : using the new WebUI of llama.cpp #16938

ggerganov · 2025-11-02T13:52:56Z

ggerganov
Nov 2, 2025
Maintainer

Overview

This guide highlights the key features of the new SvelteKit-based WebUI of llama.cpp.

The new WebUI in combination with the advanced backend capabilities of the llama-server delivers the ultimate local AI chat experience. A few characteristics that set this project ahead of the alternatives:

Free, open-source and community-driven
Excellent performance on all hardware
Advanced context and prefix caching
Parallel and remote user support
Extremely lightweight and memory efficient
Vibrant and creative community
100% privacy

Getting started

Get llama.cpp: Install | Download | Build

Start the llama-server tool:

# sample server running gpt-oss-20b at http://127.0.0.1:8033

llama-server -hf ggml-org/gpt-oss-20b-GGUF --jinja -c 0 --host 127.0.0.1 --port 8033

Open and start using the WebUI in your browser:

Tip

For a simple, GUI-based setup of llama.cpp on Mac, try the new LlamaBarn application

Features

The new WebUI is packed with many useful features to enhance your local AI experience. Following are a few examples.

Text document processing

Add multiple text files from disk or from the clipboard to the context of your conversation:

PDF document processing

Attach one or multiple PDFs to your conversation. By default, the contents of the PDFs will be converted to RAW text, excluding any visuals.

Optionally, the WebUI can process the PDFs as images when the AI model supports it.

Image inputs

When the selected AI model has vision input capabilities, the WebUI allows you to insert images into your conversation:

Images can be inserted in addition to a textual context:

Conversation branching

Branch from previous points of the conversation by editing or regenerating a message:

webui-edits-0-thumb-small.mp4

Parallel conversations

Run multiple chat conversations at the same time:

webui-parallel-0-thumb-small.mp4

Parallel image processing is also supported:

webui-parallel-1-thumb-small.mp4

Override default sampling parameters

Start the llama-server using a set of default sampling parameters:

# set the default Top-K to be 5 and the default Temperature to be 0.80

llama-server -hf ggml-org/gpt-oss-120b-GGUF --jinja -c 0 --port 8033 --alias gpt-oss-120b --top-k 5 --temp 0.80

These parameters will now become the default values in the WebUI settings:

webui-parameters-0-thumb-small.mp4

More info: #16515

Render math expressions

The WebUI can render mathematical expressions:

Input via URL parameters

The WebUI supports passing input through the URL parameters:

webui-url-input-0-thumb-small.mp4

HTML/JS preview

The WebUI supports inline rendering of generated HTML/JS code:

webui-js-0-thumb-small.mp4

More info: #16757

Constrained generation

Specify a custom JSON schema to constrain the generated output to a specific format. As an example, here is generic invoice data extraction from multiple documents:

webui-constrained-0-thumb-small.mp4

Import/Export

Use the Import/Export options to manage your private conversations directly through the WebUI:

Efficient SSM context management

The context management and prefix caching of State Space Models (SSMs, e.g. Mamba) can be tricky. llama-server solves this problem efficiently for one or multiple users with minimum reprocessing.

Here is an example of context branching using a hybrid LLM:

webui-ssm-0-thumb-small.mp4

Mobile compatibility

The new WebUI is mobile friendly:

Sample commands

A few llama-server commands used for the examples above:

# lightweight, gpt-oss-20b
llama-server --jinja -c 0 --port 8033 -hf ggml-org/gpt-oss-20b-GGUF --alias "gpt-oss-20b"

# text-only, gpt-oss-120b with greedy sampling by default
llama-server --jinja -c 0 --port 8033 -hf ggml-org/gpt-oss-120b-GGUF --alias "gpt-oss-120b" --top-k 1

# vision-enabled, Qwen3 VL 30B A3B, accessible from the local network
llama-server --jinja -c 0 --port 8033 -hf ggml-org/Qwen3-VL-30B-A3B-Instruct-Q8_0-GGUF --alias "Qwen3 VL 30B A3B" --host 192.168.100.3

# hybrid, Granite 4.0 H Small with 1 million tokens context
llama-server --jinja -c 0 --port 8033 -hf ggml-org/granite-4.0-h-small-Q8_0-GGUF --alias "Granite 4.0 Hybrid Small"

Acknowledgements

Aleksander Grygier for leading the development
ServeurpersoCom for the valuable contributions
Hugging Face for the general support

ggerganov · 2025-11-03T09:43:59Z

ggerganov
Nov 3, 2025
Maintainer Author

Does anyone have a neat example to share for constrained output using the custom JSON option of the WebUI? Something that would be suitable for demonstration purposes.

cc @ServeurpersoCom @tarruda @aldehir

3 replies

ServeurpersoCom Nov 3, 2025
Collaborator

Does anyone have a neat example to share for constrained output using the custom JSON option of the WebUI? Something that would be suitable for demonstration purposes.

cc @ServeurpersoCom @tarruda @aldehir

Not sure if this is exactly what you're looking for, but I've been using the Custom JSON option mainly for tool-call schemas.
Here's a simple example that works nicely for demonstrating constrained output:

{
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "simple_addition_tool",
        "description": "A dummy calculator tool used for testing multi-argument tool call streaming.",
        "parameters": {
          "type": "object",
          "properties": {
            "a": { "type": "number", "description": "The first number to add." },
            "b": { "type": "number", "description": "The second number to add." }
          },
          "required": ["a", "b"]
        }
      }
    }
  ]
}

The model can then be prompted to call this tool by generating:

{ "name": "simple_addition_tool", "arguments": { "a": 3, "b": 5 } }

I'm currently waiting for the merge of my https://github.com/ServeurpersoCom/llama.cpp/tree/harmony-toolcall-debug-option branch : once that's in, we'll be able to inspect the model's JSON tool-call responses directly in the WebUI, which makes this kind of constrained example perfect for demos.

Another demo of toolcall in a local pod with a custom MCP client proxy :D
https://github.com/user-attachments/assets/bbbdeab3-97e8-4f7c-b314-6fd272caece6
There’s huge potential with the new Svelte UI Alek’s architecture makes it possible to plug advanced features easily, like an integrated MCP client for seamless local testing.

linuxmagic-mp Nov 18, 2025

It's probably not what you are thinking of .. but I did a similar thing to @ServeurpersoCom by putting a dummy tool call in there, while debugging custom jinja template use for Mistral.. (who ever moved common_chat_params_init_mistral_nemo detection after // Plain handler (no tools), wish he would have commented why in common_chat_templates_apply_jinja()) however, wasn't as lucky as Pascal, with the default mistral_nemo template.. JSON parsing error.. Look forward to toolcall-debug-option, might have saved me a lot of time. Need to update the prompt afterall. Yeah, the Svelte will be fun, as trying to add some logic in the UI for users to choose to use-memory, with proper warning of the implications.. I go look at that branch now..

tetrahedronix Nov 21, 2025

Hi! Can the following communication pipeline (Frontend <-> Go microservice <-> llama-server <-> Tool) work as a workaround?

The frontend (served by Go microservice) sends user prompts to Go microservice.
The Go microservice forwards these prompts to the llama-server API (likely using the OpenAI-compatible /v1/chat/completions endpoint, which simplifies tool-call handling).
When the Go microservice receives a response from llama-server, it parses it.
If the response contains tool calls, the Go microservice intercepts them, runs the appropriate logic (by invoking external tools) and then potentially re-injects the tool-execution result into a new request to llama-server to continue the conversation.
The final response (or streamed tokens) is then sent from the Go microservice to the client’s web-ui frontend.

ServeurpersoCom · 2025-11-03T11:20:05Z

ServeurpersoCom
Nov 3, 2025
Collaborator

I tried this one :

Inside Developer / Custom JSON :

{
  "json_schema": {
    "type": "object",
    "properties": {
      "sentiment": {
        "type": "string",
        "enum": ["positive", "neutral", "negative"]
      },
      "confidence": {
        "type": "number",
        "minimum": 0,
        "maximum": 1
      },
      "summary": {
        "type": "string",
        "maxLength": 200
      }
    },
    "required": ["sentiment", "confidence", "summary"]
  }
}

Prompt "you feel good ?"

Model answer (on SvelteUI) :

{
"sentiment": "positive",
"confidence": 0.95,
"summary": "The user is asking about my wellbeing, and I'm responding in a positive way with a smiley face. The sentiment is clearly positive with high confidence."
}

0 replies

tarruda · 2025-11-03T11:35:35Z

tarruda
Nov 3, 2025

Not sure if this is a neat example, but something easy you can do with vision LLMs is extract data from images in a structured way.

Add this to Developer/Custom JSON

Details

{
  "json_schema": {
    "$defs": {
      "Address": {
        "properties": {
          "street": {
            "title": "Street",
            "type": "string"
          },
          "city": {
            "title": "City",
            "type": "string"
          },
          "state": {
            "title": "State",
            "type": "string"
          },
          "zip_code": {
            "title": "Zip Code",
            "type": "string"
          }
        },
        "required": [
          "street",
          "city",
          "state",
          "zip_code"
        ],
        "title": "Address",
        "type": "object"
      },
      "BillTo": {
        "properties": {
          "company_name": {
            "title": "Company Name",
            "type": "string"
          },
          "address": {
            "$ref": "#/$defs/Address"
          },
          "attention": {
            "title": "Attention",
            "type": "string"
          }
        },
        "required": [
          "company_name",
          "address",
          "attention"
        ],
        "title": "BillTo",
        "type": "object"
      },
      "Company": {
        "properties": {
          "name": {
            "title": "Name",
            "type": "string"
          },
          "address": {
            "$ref": "#/$defs/Address"
          },
          "phone": {
            "title": "Phone",
            "type": "string"
          },
          "email": {
            "title": "Email",
            "type": "string"
          }
        },
        "required": [
          "name",
          "address",
          "phone",
          "email"
        ],
        "title": "Company",
        "type": "object"
      },
      "InvoiceLine": {
        "properties": {
          "description": {
            "title": "Description",
            "type": "string"
          },
          "quantity": {
            "title": "Quantity",
            "type": "integer"
          },
          "rate": {
            "anyOf": [
              {
                "type": "number"
              },
              {
                "type": "string"
              }
            ],
            "title": "Rate"
          },
          "amount": {
            "anyOf": [
              {
                "type": "number"
              },
              {
                "type": "string"
              }
            ],
            "title": "Amount"
          }
        },
        "required": [
          "description",
          "quantity",
          "rate",
          "amount"
        ],
        "title": "InvoiceLine",
        "type": "object"
      },
      "PaymentMethods": {
        "properties": {
          "bank_account": {
            "anyOf": [
              {
                "type": "string"
              },
              {
                "type": "null"
              }
            ],
            "default": null,
            "title": "Bank Account"
          },
          "routing_number": {
            "anyOf": [
              {
                "type": "string"
              },
              {
                "type": "null"
              }
            ],
            "default": null,
            "title": "Routing Number"
          },
          "check_payable_to": {
            "anyOf": [
              {
                "type": "string"
              },
              {
                "type": "null"
              }
            ],
            "default": null,
            "title": "Check Payable To"
          }
        },
        "title": "PaymentMethods",
        "type": "object"
      }
    },
    "properties": {
      "invoice_number": {
        "title": "Invoice Number",
        "type": "string"
      },
      "invoice_date": {
        "format": "date",
        "title": "Invoice Date",
        "type": "string"
      },
      "due_date": {
        "format": "date",
        "title": "Due Date",
        "type": "string"
      },
      "company": {
        "$ref": "#/$defs/Company"
      },
      "bill_to": {
        "$ref": "#/$defs/BillTo"
      },
      "lines": {
        "items": {
          "$ref": "#/$defs/InvoiceLine"
        },
        "title": "Lines",
        "type": "array"
      },
      "subtotal": {
        "anyOf": [
          {
            "type": "number"
          },
          {
            "type": "string"
          }
        ],
        "title": "Subtotal"
      },
      "tax_rate": {
        "anyOf": [
          {
            "type": "number"
          },
          {
            "type": "string"
          }
        ],
        "title": "Tax Rate"
      },
      "tax_amount": {
        "anyOf": [
          {
            "type": "number"
          },
          {
            "type": "string"
          }
        ],
        "title": "Tax Amount"
      },
      "total": {
        "anyOf": [
          {
            "type": "number"
          },
          {
            "type": "string"
          }
        ],
        "title": "Total"
      },
      "payment_terms": {
        "title": "Payment Terms",
        "type": "string"
      },
      "payment_methods": {
        "$ref": "#/$defs/PaymentMethods"
      },
      "notes": {
        "anyOf": [
          {
            "type": "string"
          },
          {
            "type": "null"
          }
        ],
        "default": null,
        "title": "Notes"
      }
    },
    "required": [
      "invoice_number",
      "invoice_date",
      "due_date",
      "company",
      "bill_to",
      "lines",
      "subtotal",
      "tax_rate",
      "tax_amount",
      "total",
      "payment_terms",
      "payment_methods"
    ],
    "title": "Invoice",
    "type": "object"
  }
}

and with a model that supports vision (Qwen3-VL-8B should work), paste this image:

Details

And it will just output the invoice data without requiring any instructions:

Details

{
  "invoice_number": "INV-2024-0847",
  "invoice_date": "2025-07-29",
  "due_date": "2025-08-28",
  "company": {
    "name": "Acme Corporation",
    "address": {
      "street": "123 Business Street",
      "city": "New York",
      "state": "NY",
      "zip_code": "10001"
    },
    "phone": "(555) 123-4567",
    "email": "[email protected]"
  },
  "bill_to": {
    "company_name": "Tech Solutions Inc.",
    "address": {
      "street": "456 Innovation Drive",
      "city": "San Francisco",
      "state": "CA",
      "zip_code": "94105"
    },
    "attention": "John Smith"
  },
  "lines": [
    {
      "description": "Web Development Services",
      "quantity": 40,
      "rate": 150.00,
      "amount": 6000.00
    },
    {
      "description": "UI/UX Design",
      "quantity": 20,
      "rate": 125.00,
      "amount": 2500.00
    },
    {
      "description": "Database Setup",
      "quantity": 8,
      "rate": 100.00,
      "amount": 800.00
    },
    {
      "description": "Monthly Hosting",
      "quantity": 1,
      "rate": 250.00,
      "amount": 250.00
    }
  ],
  "subtotal": 9550.00,
  "tax_rate": 8.5,
  "tax_amount": 811.75,
  "total": 10361.75,
  "payment_terms": "Net 30 days. 1.5% late fee per month on overdue balances.",
  "payment_methods": {
    "bank_account": "Account #123456789, Routing #987654321",
    "check_payable_to": "Acme Corporation"
  },
  "notes": "Thank you for your business!"
}

One problem with this is that the output is not wrapped in json fenced markdown blocks so you get no syntax highlighting.

This could be improved if the web UI had native support for passing a JSON schema and when enabled displayed the output in a specialized JSON viewer, such as this one

3 replies

ServeurpersoCom Nov 3, 2025
Collaborator

Good point about the JSON output rendering !

In the new Svelte UI, we could detect fenced code blocks with the language tag "json" (like i make for html+js preview) and render them with a built-in JSON viewer component (like the one you linked). That would make structured outputs from the Custom JSON schema much clearer, especially for large responses or nested data.
And doing it only for JSON code blocks would be totally safe: no risk of regression in the existing AST pipeline.

ggerganov Nov 3, 2025
Maintainer Author

@tarruda Thanks, this seems suitable. I'll try to add a short video with this example later today.

ggerganov Nov 3, 2025
Maintainer Author

Added the "Constrained generation" example to the post 👍

Goldenkoron · 2025-11-04T16:22:35Z

Goldenkoron
Nov 4, 2025

I love the look of this. Could you add a "Continue Assistant Response" kind of button? Helps to steer the AI toward a specific formatting you want at the beginning of a conversation if you could edit its response then have it continue output.

1 reply

allozaur Nov 4, 2025
Collaborator

I love the look of this. Could you add a "Continue Assistant Response" kind of button? Helps to steer the AI toward a specific formatting you want at the beginning of a conversation if you could edit its response then have it continue output.

Actually it's already WIP 😄

See #16971

AmgadHasan · 2025-11-04T16:26:16Z

AmgadHasan
Nov 4, 2025

How to enable Parallel conversations?

Do I need to use a specific param when launching the server?

2 replies

ServeurpersoCom Nov 4, 2025
Collaborator

Use --parallel 2 (or higher) on llama-server: it splits the context between workers, so your effective context size gets divided by that number: but you’ll get a higher total tokens/sec overall.

ggerganov Nov 4, 2025
Maintainer Author

it splits the context between workers

If you add --kv-unified the context will not be split and instead shared among all parallel workers. This was added very recently (#16736) and it is very suitable for local usage.

shimmyshimmer · 2025-11-04T16:34:29Z

shimmyshimmer
Nov 4, 2025

Congratulations guys this looks absolutely amazing!! :D

Can't wait to use it

0 replies

sciphergfx · 2025-11-04T17:08:42Z

sciphergfx
Nov 4, 2025

🚀🚀🚀

0 replies

nirfse · 2025-11-04T17:30:06Z

nirfse
Nov 4, 2025

Excellent work! It strikes the right balance between functionality, a simple user experience, and performance.

Admittedly, this is outside the scope of the project, but I would appreciate the option of deploying this interface in standalone mode, separate from llama.cpp, with third-party OpenAI API support.

4 replies

ServeurpersoCom Nov 4, 2025
Collaborator

Technically, if you redirect the /v1/chat/completions request to OpenAI, it’ll work the same way. I used to do that through a reverse proxy back when I didn’t have a GPU, just to stress-test some tools.
Adding a configuration field for the inference endpoint would be easy, but there’s a bit of work needed to make it run cleanly without relying on the /props and /slots endpoints. Those should really become proper “supervision modules” of llama-server, providing even more information: like prompt processing time and other metrics....

ServeurpersoCom Nov 4, 2025
Collaborator

Honestly, once you’ve got a working llama-swap config with model selector enabled (dev opt), you can just ask GPT-5 Thinking to write you a minimal SSE redirect proxy in whatever language you like: it’ll do it. Then you drop the command line into your config.yaml, and boom: you’ve got real OpenAI models showing up in the model selector alongside your local GGUF llama-server zoo :)

zoobab Nov 21, 2025

@ServeurpersoCom could you share your working config?

ServeurpersoCom Nov 21, 2025
Collaborator

@ServeurpersoCom could you share your working config?

Yes, of course. There's some information here ( https://www.serveurperso.com/ia/ ), but I need to write more documentation. The goal is to integrate it into llama.cpp and make it as accessible as possible; we need a working out of the box modelselector and the MCP client first and foremost. already in a dev branch, then, after a lot of work on backend/frontend, to make it official.

In the meantime, try it out and ask me questions :)

bennmann · 2025-11-04T18:54:47Z

bennmann
Nov 4, 2025

implement more Agents for the GUI, like mini-swe-agent and/or make a GUI for trae

https://github.com/bytedance/trae-agent
https://github.com/SWE-agent/mini-swe-agent

0 replies

vincentdnl · 2025-11-04T20:15:25Z

vincentdnl
Nov 4, 2025

Is there an option to add a search URL or something to search the web?

3 replies

ServeurpersoCom Nov 4, 2025
Collaborator

To do this properly, it would need an architecture that operates around llama.cpp, not just within the WebUI itself.
The most likely path forward would be integrating an MCP client directly into the Svelte UI, allowing a backend to handle web search and dynamically enrich the context.

HerrmannHinz Nov 16, 2025

i was just about to ask: whatabout tool call posibilities/mcp servers? that would be a super cool addition to the webui. websearch or toolcall is something essential nowadays imo. +1 on this feature :D

ServeurpersoCom Nov 16, 2025
Collaborator

We’re working with Alek on MCP. It’s a huge amount of front-end work.

I make some MCP PoC on backend https://www.youtube.com/watch?v=mFwMZ_Zem0I

llama-server -> llama-swap -> SSE Proxy -> SvelteUI
                                   |
                              mcp-client (stdio/sse/ws)
                                   |
                              mcp-server (stdio/sse/ws)
                                   |
      sandbox (podman Containerfile image template + Kubernetes manifest)

raphiki · 2025-11-04T20:31:44Z

raphiki
Nov 4, 2025

Kudos guys this rocks!

0 replies

fahdmirza · 2025-11-04T20:33:30Z

fahdmirza
Nov 4, 2025

I created a step-by-step installation and testing video for this Llama.cpp WebUI: https://youtu.be/1H1gx2A9cww?si=bJwf8-QcVSCutelf

Thanks.

0 replies

DoS007 · 2025-11-04T22:07:59Z

DoS007
Nov 4, 2025

Error: "the request exceeds the available context size, try increasing it"

So I can only use a chat as long as it's in context size? context window shifting would be really nice. So e.g. on 16k context one can write on and on and the ai always knows the newest context (all earliest messages that are apart from context_size - max_output (e.g. 2048) are deleted in kv-cache). I tried using --context-shift but that option didn't help.

Btw. I like that llama.cpp has now its ui for chats (switching from koboldcpp). I really like llama.cpp for its vram efficiency (using cuda on nvidia customer card).

0 replies

Pirog17000 · 2025-11-04T22:38:16Z

Pirog17000
Nov 4, 2025

where I can get whole list of commands?
specifically looking for the one which lets me pick the model not from HF, but a local one

1 reply

ServeurpersoCom Nov 4, 2025
Collaborator

llama-server -m /path/to/gguf

-m,    --model FNAME                    model path (default: `models/$filename` with filename from `--hf-file`
                                        or `--model-url` if set, otherwise models/7B/ggml-model-f16.gguf)
                                        (env: LLAMA_ARG_MODEL)

Also try llama-server --help :)

Details

(root|~/llama.cpp/build/bin) ./llama-server --help ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no ggml_cuda_init: found 1 CUDA devices: Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes ----- common params -----

-h, --help, --usage print usage and exit
--version show version and build info
--completion-bash print source-able bash completion script for llama.cpp
--verbose-prompt print a verbose prompt before generation (default: false)
-t, --threads N number of CPU threads to use during generation (default: -1)
(env: LLAMA_ARG_THREADS)
-tb, --threads-batch N number of threads to use during batch and prompt processing (default:
same as --threads)
-C, --cpu-mask M CPU affinity mask: arbitrarily long hex. Complements cpu-range
(default: "")
-Cr, --cpu-range lo-hi range of CPUs for affinity. Complements --cpu-mask
--cpu-strict <0|1> use strict CPU placement (default: 0)
--prio N set process/thread priority : low(-1), normal(0), medium(1), high(2),
realtime(3) (default: 0)
--poll <0...100> use polling level to wait for work (0 - no polling, default: 50)
-Cb, --cpu-mask-batch M CPU affinity mask: arbitrarily long hex. Complements cpu-range-batch
(default: same as --cpu-mask)
-Crb, --cpu-range-batch lo-hi ranges of CPUs for affinity. Complements --cpu-mask-batch
--cpu-strict-batch <0|1> use strict CPU placement (default: same as --cpu-strict)
--prio-batch N set process/thread priority : 0-normal, 1-medium, 2-high, 3-realtime
(default: 0)
--poll-batch <0|1> use polling to wait for work (default: same as --poll)
-c, --ctx-size N size of the prompt context (default: 4096, 0 = loaded from model)
(env: LLAMA_ARG_CTX_SIZE)
-n, --predict, --n-predict N number of tokens to predict (default: -1, -1 = infinity)
(env: LLAMA_ARG_N_PREDICT)
-b, --batch-size N logical maximum batch size (default: 2048)
(env: LLAMA_ARG_BATCH)
-ub, --ubatch-size N physical maximum batch size (default: 512)
(env: LLAMA_ARG_UBATCH)
--keep N number of tokens to keep from the initial prompt (default: 0, -1 =
all)
--swa-full use full-size SWA cache (default: false)
(more
info)
(env: LLAMA_ARG_SWA_FULL)
--kv-unified, -kvu use single unified KV buffer for the KV cache of all sequences
(default: false)
(more info)
(env: LLAMA_ARG_KV_SPLIT)
-fa, --flash-attn [on|off|auto] set Flash Attention use ('on', 'off', or 'auto', default: 'auto')
(env: LLAMA_ARG_FLASH_ATTN)
--no-perf disable internal libllama performance timings (default: false)
(env: LLAMA_ARG_NO_PERF)
-e, --escape process escapes sequences (\n, \r, \t, ', ", \) (default: true)
--no-escape do not process escape sequences
--rope-scaling {none,linear,yarn} RoPE frequency scaling method, defaults to linear unless specified by
the model
(env: LLAMA_ARG_ROPE_SCALING_TYPE)
--rope-scale N RoPE context scaling factor, expands context by a factor of N
(env: LLAMA_ARG_ROPE_SCALE)
--rope-freq-base N RoPE base frequency, used by NTK-aware scaling (default: loaded from
model)
(env: LLAMA_ARG_ROPE_FREQ_BASE)
--rope-freq-scale N RoPE frequency scaling factor, expands context by a factor of 1/N
(env: LLAMA_ARG_ROPE_FREQ_SCALE)
--yarn-orig-ctx N YaRN: original context size of model (default: 0 = model training
context size)
(env: LLAMA_ARG_YARN_ORIG_CTX)
--yarn-ext-factor N YaRN: extrapolation mix factor (default: -1.0, 0.0 = full
interpolation)
(env: LLAMA_ARG_YARN_EXT_FACTOR)
--yarn-attn-factor N YaRN: scale sqrt(t) or attention magnitude (default: -1.0)
(env: LLAMA_ARG_YARN_ATTN_FACTOR)
--yarn-beta-slow N YaRN: high correction dim or alpha (default: -1.0)
(env: LLAMA_ARG_YARN_BETA_SLOW)
--yarn-beta-fast N YaRN: low correction dim or beta (default: -1.0)
(env: LLAMA_ARG_YARN_BETA_FAST)
-nkvo, --no-kv-offload disable KV offload
(env: LLAMA_ARG_NO_KV_OFFLOAD)
-nr, --no-repack disable weight repacking
(env: LLAMA_ARG_NO_REPACK)
--no-host bypass host buffer allowing extra buffers to be used
(env: LLAMA_ARG_NO_HOST)
-ctk, --cache-type-k TYPE KV cache data type for K
allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
(default: f16)
(env: LLAMA_ARG_CACHE_TYPE_K)
-ctv, --cache-type-v TYPE KV cache data type for V
allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
(default: f16)
(env: LLAMA_ARG_CACHE_TYPE_V)
-dt, --defrag-thold N KV cache defragmentation threshold (DEPRECATED)
(env: LLAMA_ARG_DEFRAG_THOLD)
-np, --parallel N number of parallel sequences to decode (default: 1)
(env: LLAMA_ARG_N_PARALLEL)
--mlock force system to keep model in RAM rather than swapping or compressing
(env: LLAMA_ARG_MLOCK)
--no-mmap do not memory-map model (slower load but may reduce pageouts if not
using mlock)
(env: LLAMA_ARG_NO_MMAP)
--numa TYPE attempt optimizations that help on some NUMA systems
- distribute: spread execution evenly over all nodes
- isolate: only spawn threads on CPUs on the node that execution
started on
- numactl: use the CPU map provided by numactl
if run without this previously, it is recommended to drop the system
page cache before using this
see #1437
(env: LLAMA_ARG_NUMA)
-dev, --device <dev1,dev2,..> comma-separated list of devices to use for offloading (none = don't
offload)
use --list-devices to see a list of available devices
(env: LLAMA_ARG_DEVICE)
--list-devices print list of available devices and exit
--override-tensor, -ot =,...
override tensor buffer type
--cpu-moe, -cmoe keep all Mixture of Experts (MoE) weights in the CPU
(env: LLAMA_ARG_CPU_MOE)
--n-cpu-moe, -ncmoe N keep the Mixture of Experts (MoE) weights of the first N layers in the
CPU
(env: LLAMA_ARG_N_CPU_MOE)
-ngl, --gpu-layers, --n-gpu-layers N max. number of layers to store in VRAM (default: -1)
(env: LLAMA_ARG_N_GPU_LAYERS)
-sm, --split-mode {none,layer,row} how to split the model across multiple GPUs, one of:
- none: use one GPU only
- layer (default): split layers and KV across GPUs
- row: split rows across GPUs
(env: LLAMA_ARG_SPLIT_MODE)
-ts, --tensor-split N0,N1,N2,... fraction of the model to offload to each GPU, comma-separated list of
proportions, e.g. 3,1
(env: LLAMA_ARG_TENSOR_SPLIT)
-mg, --main-gpu INDEX the GPU to use for the model (with split-mode = none), or for
intermediate results and KV (with split-mode = row) (default: 0)
(env: LLAMA_ARG_MAIN_GPU)
--check-tensors check model tensor data for invalid values (default: false)
--override-kv KEY=TYPE:VALUE advanced option to override model metadata by key. may be specified
multiple times.
types: int, float, bool, str. example: --override-kv
tokenizer.ggml.add_bos_token=bool:false
--no-op-offload disable offloading host tensor operations to device (default: false)
--lora FNAME path to LoRA adapter (can be repeated to use multiple adapters)
--lora-scaled FNAME SCALE path to LoRA adapter with user defined scaling (can be repeated to use
multiple adapters)
--control-vector FNAME add a control vector
note: this argument can be repeated to add multiple control vectors
--control-vector-scaled FNAME SCALE add a control vector with user defined scaling SCALE
note: this argument can be repeated to add multiple scaled control
vectors
--control-vector-layer-range START END
layer range to apply the control vector(s) to, start and end inclusive
-m, --model FNAME model path (default: models/$filename with filename from --hf-file
or --model-url if set, otherwise models/7B/ggml-model-f16.gguf)
(env: LLAMA_ARG_MODEL)
-mu, --model-url MODEL_URL model download url (default: unused)
(env: LLAMA_ARG_MODEL_URL)
-dr, --docker-repo [/][:quant]
Docker Hub model repository. repo is optional, default to ai/. quant
is optional, default to :latest.
example: gemma3
(default: unused)
(env: LLAMA_ARG_DOCKER_REPO)
-hf, -hfr, --hf-repo /[:quant]
Hugging Face model repository; quant is optional, case-insensitive,
default to Q4_K_M, or falls back to the first file in the repo if
Q4_K_M doesn't exist.
mmproj is also downloaded automatically if available. to disable, add
--no-mmproj
example: unsloth/phi-4-GGUF:q4_k_m
(default: unused)
(env: LLAMA_ARG_HF_REPO)
-hfd, -hfrd, --hf-repo-draft /[:quant]
Same as --hf-repo, but for the draft model (default: unused)
(env: LLAMA_ARG_HFD_REPO)
-hff, --hf-file FILE Hugging Face model file. If specified, it will override the quant in
--hf-repo (default: unused)
(env: LLAMA_ARG_HF_FILE)
-hfv, -hfrv, --hf-repo-v /[:quant]
Hugging Face model repository for the vocoder model (default: unused)
(env: LLAMA_ARG_HF_REPO_V)
-hffv, --hf-file-v FILE Hugging Face model file for the vocoder model (default: unused)
(env: LLAMA_ARG_HF_FILE_V)
-hft, --hf-token TOKEN Hugging Face access token (default: value from HF_TOKEN environment
variable)
(env: HF_TOKEN)
--log-disable Log disable
--log-file FNAME Log to file
--log-colors [on|off|auto] Set colored logging ('on', 'off', or 'auto', default: 'auto')
'auto' enables colors when output is to a terminal
(env: LLAMA_LOG_COLORS)
-v, --verbose, --log-verbose Set verbosity level to infinity (i.e. log all messages, useful for
debugging)
--offline Offline mode: forces use of cache, prevents network access
(env: LLAMA_OFFLINE)
-lv, --verbosity, --log-verbosity N Set the verbosity threshold. Messages with a higher verbosity will be
ignored.
(env: LLAMA_LOG_VERBOSITY)
--log-prefix Enable prefix in log messages
(env: LLAMA_LOG_PREFIX)
--log-timestamps Enable timestamps in log messages
(env: LLAMA_LOG_TIMESTAMPS)
-ctkd, --cache-type-k-draft TYPE KV cache data type for K for the draft model
allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
(default: f16)
(env: LLAMA_ARG_CACHE_TYPE_K_DRAFT)
-ctvd, --cache-type-v-draft TYPE KV cache data type for V for the draft model
allowed values: f32, f16, bf16, q8_0, q4_0, q4_1, iq4_nl, q5_0, q5_1
(default: f16)
(env: LLAMA_ARG_CACHE_TYPE_V_DRAFT)

----- sampling params -----

--samplers SAMPLERS samplers that will be used for generation in the order, separated by
';'
(default:
penalties;dry;top_n_sigma;top_k;typ_p;top_p;min_p;xtc;temperature)
-s, --seed SEED RNG seed (default: -1, use random seed for -1)
--sampling-seq, --sampler-seq SEQUENCE
simplified sequence for samplers that will be used (default:
edskypmxt)
--ignore-eos ignore end of stream token and continue generating (implies
--logit-bias EOS-inf)
--temp N temperature (default: 0.8)
--top-k N top-k sampling (default: 40, 0 = disabled)
--top-p N top-p sampling (default: 0.9, 1.0 = disabled)
--min-p N min-p sampling (default: 0.1, 0.0 = disabled)
--top-nsigma N top-n-sigma sampling (default: -1.0, -1.0 = disabled)
--xtc-probability N xtc probability (default: 0.0, 0.0 = disabled)
--xtc-threshold N xtc threshold (default: 0.1, 1.0 = disabled)
--typical N locally typical sampling, parameter p (default: 1.0, 1.0 = disabled)
--repeat-last-n N last n tokens to consider for penalize (default: 64, 0 = disabled, -1
= ctx_size)
--repeat-penalty N penalize repeat sequence of tokens (default: 1.0, 1.0 = disabled)
--presence-penalty N repeat alpha presence penalty (default: 0.0, 0.0 = disabled)
--frequency-penalty N repeat alpha frequency penalty (default: 0.0, 0.0 = disabled)
--dry-multiplier N set DRY sampling multiplier (default: 0.0, 0.0 = disabled)
--dry-base N set DRY sampling base value (default: 1.75)
--dry-allowed-length N set allowed length for DRY sampling (default: 2)
--dry-penalty-last-n N set DRY penalty for the last n tokens (default: -1, 0 = disable, -1 =
context size)
--dry-sequence-breaker STRING add sequence breaker for DRY sampling, clearing out default breakers
('\n', ':', '"', '*') in the process; use "none" to not use any
sequence breakers
--dynatemp-range N dynamic temperature range (default: 0.0, 0.0 = disabled)
--dynatemp-exp N dynamic temperature exponent (default: 1.0)
--mirostat N use Mirostat sampling.
Top K, Nucleus and Locally Typical samplers are ignored if used.
(default: 0, 0 = disabled, 1 = Mirostat, 2 = Mirostat 2.0)
--mirostat-lr N Mirostat learning rate, parameter eta (default: 0.1)
--mirostat-ent N Mirostat target entropy, parameter tau (default: 5.0)
-l, --logit-bias TOKEN_ID(+/-)BIAS modifies the likelihood of token appearing in the completion,
i.e. --logit-bias 15043+1 to increase likelihood of token ' Hello',
or --logit-bias 15043-1 to decrease likelihood of token ' Hello'
--grammar GRAMMAR BNF-like grammar to constrain generations (see samples in grammars/
dir) (default: '')
--grammar-file FNAME file to read grammar from
-j, --json-schema SCHEMA JSON schema to constrain generations (https://json-schema.org/), e.g.
{} for any JSON object
For schemas w/ external $refs, use --grammar +
example/json_schema_to_grammar.py instead
-jf, --json-schema-file FILE File containing a JSON schema to constrain generations
(https://json-schema.org/), e.g. {} for any JSON object
For schemas w/ external $refs, use --grammar +
example/json_schema_to_grammar.py instead

----- example-specific params -----

--ctx-checkpoints, --swa-checkpoints N
max number of context checkpoints to create per slot (default: 8)
(more info)
(env: LLAMA_ARG_CTX_CHECKPOINTS)
--cache-ram, -cram N set the maximum cache size in MiB (default: 8192, -1 - no limit, 0 -
disable)
(more info)
(env: LLAMA_ARG_CACHE_RAM)
--no-context-shift disables context shift on infinite text generation (default: enabled)
(env: LLAMA_ARG_NO_CONTEXT_SHIFT)
--context-shift enables context shift on infinite text generation (default: disabled)
(env: LLAMA_ARG_CONTEXT_SHIFT)
-r, --reverse-prompt PROMPT halt generation at PROMPT, return control in interactive mode
-sp, --special special tokens output enabled (default: false)
--no-warmup skip warming up the model with an empty run
--spm-infill use Suffix/Prefix/Middle pattern for infill (instead of
Prefix/Suffix/Middle) as some models prefer this. (default: disabled)
--pooling {none,mean,cls,last,rank} pooling type for embeddings, use model default if unspecified
(env: LLAMA_ARG_POOLING)
-cb, --cont-batching enable continuous batching (a.k.a dynamic batching) (default: enabled)
(env: LLAMA_ARG_CONT_BATCHING)
-nocb, --no-cont-batching disable continuous batching
(env: LLAMA_ARG_NO_CONT_BATCHING)
--mmproj FILE path to a multimodal projector file. see tools/mtmd/README.md
note: if -hf is used, this argument can be omitted
(env: LLAMA_ARG_MMPROJ)
--mmproj-url URL URL to a multimodal projector file. see tools/mtmd/README.md
(env: LLAMA_ARG_MMPROJ_URL)
--no-mmproj explicitly disable multimodal projector, useful when using -hf
(env: LLAMA_ARG_NO_MMPROJ)
--no-mmproj-offload do not offload multimodal projector to GPU
(env: LLAMA_ARG_NO_MMPROJ_OFFLOAD)
--image-min-tokens N minimum number of tokens each image can take, only used by vision
models with dynamic resolution (default: read from model)
(env: LLAMA_ARG_IMAGE_MIN_TOKENS)
--image-max-tokens N maximum number of tokens each image can take, only used by vision
models with dynamic resolution (default: read from model)
(env: LLAMA_ARG_IMAGE_MAX_TOKENS)
--override-tensor-draft, -otd =,...
override tensor buffer type for draft model
--cpu-moe-draft, -cmoed keep all Mixture of Experts (MoE) weights in the CPU for the draft
model
(env: LLAMA_ARG_CPU_MOE_DRAFT)
--n-cpu-moe-draft, -ncmoed N keep the Mixture of Experts (MoE) weights of the first N layers in the
CPU for the draft model
(env: LLAMA_ARG_N_CPU_MOE_DRAFT)
-a, --alias STRING set alias for model name (to be used by REST API)
(env: LLAMA_ARG_ALIAS)
--host HOST ip address to listen, or bind to an UNIX socket if the address ends
with .sock (default: 127.0.0.1)
(env: LLAMA_ARG_HOST)
--port PORT port to listen (default: 8080)
(env: LLAMA_ARG_PORT)
--path PATH path to serve static files from (default: )
(env: LLAMA_ARG_STATIC_PATH)
--api-prefix PREFIX prefix path the server serves from, without the trailing slash
(default: )
(env: LLAMA_ARG_API_PREFIX)
--no-webui Disable the Web UI (default: enabled)
(env: LLAMA_ARG_NO_WEBUI)
--embedding, --embeddings restrict to only support embedding use case; use only with dedicated
embedding models (default: disabled)
(env: LLAMA_ARG_EMBEDDINGS)
--reranking, --rerank enable reranking endpoint on server (default: disabled)
(env: LLAMA_ARG_RERANKING)
--api-key KEY API key to use for authentication (default: none)
(env: LLAMA_API_KEY)
--api-key-file FNAME path to file containing API keys (default: none)
--ssl-key-file FNAME path to file a PEM-encoded SSL private key
(env: LLAMA_ARG_SSL_KEY_FILE)
--ssl-cert-file FNAME path to file a PEM-encoded SSL certificate
(env: LLAMA_ARG_SSL_CERT_FILE)
--chat-template-kwargs STRING sets additional params for the json template parser
(env: LLAMA_CHAT_TEMPLATE_KWARGS)
-to, --timeout N server read/write timeout in seconds (default: 600)
(env: LLAMA_ARG_TIMEOUT)
--threads-http N number of threads used to process HTTP requests (default: -1)
(env: LLAMA_ARG_THREADS_HTTP)
--cache-reuse N min chunk size to attempt reusing from the cache via KV shifting
(default: 0)
(card)
(env: LLAMA_ARG_CACHE_REUSE)
--metrics enable prometheus compatible metrics endpoint (default: disabled)
(env: LLAMA_ARG_ENDPOINT_METRICS)
--props enable changing global properties via POST /props (default: disabled)
(env: LLAMA_ARG_ENDPOINT_PROPS)
--slots enable slots monitoring endpoint (default: enabled)
(env: LLAMA_ARG_ENDPOINT_SLOTS)
--no-slots disables slots monitoring endpoint
(env: LLAMA_ARG_NO_ENDPOINT_SLOTS)
--slot-save-path PATH path to save slot kv cache (default: disabled)
--jinja use jinja template for chat (default: disabled)
(env: LLAMA_ARG_JINJA)
--reasoning-format FORMAT controls whether thought tags are allowed and/or extracted from the
response, and in which format they're returned; one of:
- none: leaves thoughts unparsed in message.content
- deepseek: puts thoughts in message.reasoning_content
- deepseek-legacy: keeps <think> tags in message.content while
also populating message.reasoning_content
(default: auto)
(env: LLAMA_ARG_THINK)
--reasoning-budget N controls the amount of thinking allowed; currently only one of: -1 for
unrestricted thinking budget, or 0 to disable thinking (default: -1)
(env: LLAMA_ARG_THINK_BUDGET)
--chat-template JINJA_TEMPLATE set custom jinja chat template (default: template taken from model's
metadata)
if suffix/prefix are specified, template will be disabled
only commonly used templates are accepted (unless --jinja is set
before this flag):
list of built-in templates:
bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml,
command-r, deepseek, deepseek2, deepseek3, exaone3, exaone4, falcon3,
gemma, gigachat, glmedge, gpt-oss, granite, grok-2, hunyuan-dense,
hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos,
llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1,
mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch,
openchat, orion, phi3, phi4, rwkv-world, seed_oss, smolvlm, vicuna,
vicuna-orca, yandex, zephyr
(env: LLAMA_ARG_CHAT_TEMPLATE)
--chat-template-file JINJA_TEMPLATE_FILE
set custom jinja chat template file (default: template taken from
model's metadata)
if suffix/prefix are specified, template will be disabled
only commonly used templates are accepted (unless --jinja is set
before this flag):
list of built-in templates:
bailing, bailing-think, bailing2, chatglm3, chatglm4, chatml,
command-r, deepseek, deepseek2, deepseek3, exaone3, exaone4, falcon3,
gemma, gigachat, glmedge, gpt-oss, granite, grok-2, hunyuan-dense,
hunyuan-moe, kimi-k2, llama2, llama2-sys, llama2-sys-bos,
llama2-sys-strip, llama3, llama4, megrez, minicpm, mistral-v1,
mistral-v3, mistral-v3-tekken, mistral-v7, mistral-v7-tekken, monarch,
openchat, orion, phi3, phi4, rwkv-world, seed_oss, smolvlm, vicuna,
vicuna-orca, yandex, zephyr
(env: LLAMA_ARG_CHAT_TEMPLATE_FILE)
--no-prefill-assistant whether to prefill the assistant's response if the last message is an
assistant message (default: prefill enabled)
when this flag is set, if the last message is an assistant message
then it will be treated as a full message and not prefilled

                                    (env: LLAMA_ARG_NO_PREFILL_ASSISTANT)

-sps, --slot-prompt-similarity how much the prompt order to use that --lora-init-without-apply /lora-adapters) (default: disabled)
-td, --threads-draft N --threads)
-tbd, --threads-batch-draft N same as --threads-draft)
--draft-max, --draft, --draft-n N (env: LLAMA_ARG_DRAFT_MAX)
--draft-min, --draft-n-min N (default: 0)
(env: LLAMA_ARG_DRAFT_MIN)
--draft-p-min P (env: LLAMA_ARG_DRAFT_P_MIN)
-cd, --ctx-size-draft N from model)
(env: LLAMA_ARG_CTX_SIZE_DRAFT)
-devd, --device-draft (none = don't offload)
use --list-devices -ngld, --gpu-layers-draft, number of layers (env: LLAMA_ARG_N_GPU_LAYERS_DRAFT)
-md, --model-draft FNAME (env: LLAMA_ARG_MODEL_DRAFT)
--spec-replace TARGET DRAFT model are not compatible
-mv, --model-vocoder FNAME --tts-use-guide-tokens --embd-gemma-default internet)
--fim-qwen-1.5b-default internet)
--fim-qwen-3b-default internet)
--fim-qwen-7b-default internet)
--fim-qwen-7b-spec download weights from the internet)
--fim-qwen-14b-spec can download weights --fim-qwen-30b-default from the internet)
--gpt-oss-20b-default --gpt-oss-120b-default --vision-gemma-4b-default --vision-gemma-12b-default SIMILARITY
of a request must match the prompt of a slot in
slot (default: 0.10, 0.0 = disabled)
load LoRA adapters without applying them (apply later via POST
number of threads to use during generation (default: same as
number of threads to use during batch and prompt processing (default:
number of tokens to draft for speculative decoding (default: 16)
minimum number of draft tokens to use for speculative decoding
minimum speculative decoding probability (greedy) (default: 0.8)
size of the prompt context for the draft model (default: 0, 0 = loaded
<dev1,dev2,..> comma-separated list of devices to use for offloading the draft model
to see a list of available devices
--n-gpu-layers-draft N
to store in VRAM for the draft model
draft model for speculative decoding (default: unused)
translate the string in TARGET into DRAFT if the draft model and main
vocoder model for audio generation (default: unused)
Use guide tokens to improve TTS word recall
use default EmbeddingGemma model (note: can download weights from the
use default Qwen 2.5 Coder 1.5B (note: can download weights from the
use default Qwen 2.5 Coder 3B (note: can download weights from the
use default Qwen 2.5 Coder 7B (note: can download weights from the
use Qwen 2.5 Coder 7B + 0.5B draft for speculative decoding (note: can
use Qwen 2.5 Coder 14B + 0.5B draft for speculative decoding (note:
from the internet)
use default Qwen 3 Coder 30B A3B Instruct (note: can download weights
use gpt-oss-20b (note: can download weights from the internet)
use gpt-oss-120b (note: can download weights from the internet)
use Gemma 3 4B QAT (note: can download weights from the internet)
use Gemma 3 12B QAT (note: can download weights from the internet)

David-AU-github · 2025-11-04T23:36:00Z

David-AU-github
Nov 4, 2025

Off the scale - thank you for all you do!

0 replies

linuxmagic-mp · 2025-12-10T17:31:24Z

linuxmagic-mp
Dec 10, 2025

Still think we need more 'defaults' automagically determined from the model itself, eg .gguf model type. Of course, llama-server shares cli options with other tools, and it's already getting too large of an ever changing CLI options.. Rather than thinking of this as an .ini, rather we should consider this a .loc override, for what the community agrees the recommended values are, or the model maintainers, via the information in the .gguf headers. Of course, certain systems may be 'constrained', so need lower values, or for experimental purposes, or because of LoRa layers at startup etc.. Suggest that we consider a more 'standard' location for config's that override .gguf header information.

…

On 2025-12-10 08:35, Pascal wrote: ./llama-server --port 8082 -ngl 999 -ctk q8_0 -ctv q8_0 -fa on --mlock -np 4 -kvu --jinja --models-max 1 --models-preset config.ini Easy .ini format : |[MoE-Qwen3-Coder-30B-A3B-Instruct] m = /path/to/unsloth/Qwen3-Coder-30B-A3B-Instruct-GGUF/Qwen3-Coder-30B-A3B-Instruct-Q6_K.gguf temp = 0.7 top-p = 0.8 top-k = 20 min-p = 0 ctx-size = 131072 [MyModel] m = /my/other.gguf ... | The command-line arguments are inherited, and you're overriding them with custom configuration! — Reply to this email directly, view it on GitHub <#16938 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AK24I7XL25KU5GLPHOKUVYL4BBDWPAVCNFSM6AAAAACK5APP4OVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTKMRSGIYDKMA>. You are receiving this because you commented.Message ID: ***@***.***>

-- "Catch the Magic of Linux..." ------------------------------------------------------------------------ Michael Peddemors, President/CEO LinuxMagic Inc. Visit us at http://www.linuxmagic.com @linuxmagic A Wizard IT Company - For More Info http://www.wizard.ca "LinuxMagic" a Reg. TradeMark of Wizard Tower TechnoServices Ltd. ------------------------------------------------------------------------ 604-682-0300 Beautiful British Columbia, Canada

1 reply

mingodad Dec 10, 2025

I was thinking that way too but each user installation needs some adjusts due to their available hardware like for me with an mac mini m4 16GB I can run gpt-oss-20b.gguf only if I specify --no-mmap but others with plenty of memory don't need it, so for a
wide usage range several parameters need be set/overwrite by the end user.

avidwriter · 2025-12-14T14:10:31Z

avidwriter
Dec 14, 2025

Could there be an option added to adjust the width of the main chat DIV on desktop? on a large monitor there's a lot of wasted space to the left and right of the chat DIV. thanks!

17 replies

avidwriter Dec 14, 2025

I think fully responsive on a 4K is too wide?

ServeurpersoCom Dec 14, 2025
Collaborator

I agree, but it might make it harder to read text on 2k and 4k screens.

And on super wide gaming screens it can be a real pain lol

ImadSaddik Dec 14, 2025

Yes, keeping the text centered makes it easy to read.

Usually on Wikipedia, I end up making the text wide and big to fix that issue.

allozaur Dec 15, 2025
Collaborator

hey @ImadSaddik! thanks for commenting here! In order to keep this discussion lean and on-topic, please move this particular feature implementation discussion in a dedicated Feature Request GH Issue.

avidwriter Dec 15, 2025

issue created thanks all

#18067

oliverbob · 2025-12-20T19:33:17Z

oliverbob
Dec 20, 2025

I just packaged llama.cpp with this repo: https://github.com/oliverbob/ginto.ai for Agentic UI workflows. Built everything ready for

It is not perfect yet, but the community might find it useful:

I wish I have more time to create more features:

Tool-calling works best, especially for GPT-OSS / QWEN/Deepseek variants. Very excellent with Groq and Cerebras API.

I'll be focusing on refining MCP client/server and todo listing and model planning enhancements very soon.

2 replies

ServeurpersoCom Dec 20, 2025
Collaborator

I think we need another thread or something similar to prevent from bloating the webui thread. "project that integrate llama.cpp" or equivalent

oliverbob Dec 23, 2025

I think we need another thread or something similar to prevent from bloating the webui thread. "project that integrate llama.cpp" or equivalent

I will keep that in mind. I sincerely feel so sorry about that, but to be precise, the reason I built a custom UI with native support for llama.cpp is due to a minor issue in the UI:

But I ended up building something else. This by the way draws inspiration from this Sveltekit WebUI.

I would be glad to see a future dedicated llama.cpp webui as a separate project to isolate not only discussions but improvement and issues.

Thank you.

GuihongWang · 2026-01-01T17:12:03Z

GuihongWang
Jan 1, 2026

any i18n support for webUI in future?

1 reply

allozaur Jan 1, 2026
Collaborator

Hey! If we seen a demand big enough to do this, why not?

ekswathi · 2026-01-25T05:16:15Z

ekswathi
Jan 25, 2026

how can we use tts gguf model with webui?

1 reply

ServeurpersoCom Jan 25, 2026
Collaborator

We need to finish the clean MCP client, and then I think that's the priority!

yusrmuttaqien · 2026-03-01T05:54:36Z

yusrmuttaqien
Mar 1, 2026

is it possible to allow model to be able to see uploaded file real filename? So for example in image identification, we can refer to specific image using the filename

5 replies

ServeurpersoCom Mar 1, 2026
Collaborator

I'm not sure, it requires modifying the context, but an image only corresponds to a description for the model; if we inject the name into the context, it must be done optionally on the WebUI so as not to influence the original behavior.

yusrmuttaqien Mar 1, 2026

I see, can that be done in the web UI for future release?

ServeurpersoCom Mar 1, 2026
Collaborator

You can open a WebUI issue on github:

If a generic, numbered attachment name is already injected into the context, it needs to be corrected with the actual filename -> this is a bug that needs fixing);
If nothing is injected and the VL model is used by design, this is a feature to inject the attachment name into the context -> this is an optional feature that needs to be added.

Give me a little time, I'll dig through the code to properly classify the issue :)

ServeurpersoCom Mar 1, 2026
Collaborator

This is a feature request. Currently, image filenames are stored in the browser DB for the UI but never injected into the context sent to the model. Only the raw image_url is sent. In contrast, text files and PDFs already have their filename injected as a header (e.g. --- File: readme.txt ---).
The fix would be to inject a similar label for images (e.g. --- Image: photo.jpg ---) as a text part alongside the image_url. This could either be always-on for consistency with how text/PDF attachments already work, or opt-in via a setting if the extra tokens are a concern.

ImadSaddik Mar 1, 2026

I am in favor of injecting the filename for any file type by default.

vignesh191 · 2026-03-02T08:38:43Z

vignesh191
Mar 2, 2026

Wondering if there are near plans of adding some native tools in the webUI? ie. web_search

it would be nice to have my LLM access URLs in the local UI interface :)

1 reply

ServeurpersoCom Mar 2, 2026
Collaborator

The MCP client awaiting merge will allow this to be done. In a standardized way. Each user will add the tools of their choice.

showgood163 · 2026-03-10T06:55:21Z

showgood163
Mar 10, 2026

Hi. Is there any basic authentication support in the WebUI, e.g., user name and password?
Possible use cases are abuse prevention, dialogue separation, etc..

16 replies

ServeurpersoCom Mar 10, 2026
Collaborator

The WebUI shouldn't cause 401 errors at the proxy level. it look like a missing endpoint configuration; you need to proxy : /v1 /props /slots and /models.

ServeurpersoCom Mar 10, 2026
Collaborator

It's even possible to add an authentication portal on top of the web UI, directly on the proxy. That's what it's designed for. But I digress :)

Oh, because of the API token we discussed before, I didn't even think of adding another layer of portal... Good idea though.

In reality, technically speaking, the Lama server has minimal security that is more than sufficient for anti-abuse measures. That's what Georgi meant when he mentioned the authentication token.
But with a (secured) reverse proxy, we regain production-level internet security. This protects all endpoints from internet "fire". (Of course, this involves additional development and doesn't prevent someone from attacking the server; however, it eliminates all unknowns and you know who is doing what!)

showgood163 Mar 10, 2026

The WebUI shouldn't cause 401 errors at the proxy level. it look like a missing endpoint configuration; you need to proxy : /v1 /props /slots and /models.

I forgot to say that 401 records are caused by (deliberate) API token authentication failures.
The reverse proxy is working fine though.

showgood163 Mar 10, 2026

It's even possible to add an authentication portal on top of the web UI, directly on the proxy. That's what it's designed for. But I digress :)

Oh, because of the API token we discussed before, I didn't even think of adding another layer of portal... Good idea though.

In reality, technically speaking, the Lama server has minimal security that is more than sufficient for anti-abuse measures. That's what Georgi meant when he mentioned the authentication token. But with a (secured) reverse proxy, we regain production-level internet security. This protects all endpoints from internet "fire". (Of course, this involves additional development and doesn't prevent someone from attacking the server; however, it eliminates all unknowns and you know who is doing what!)

I also believe that simple authentication mechanisms like token and username/password are secure enough (with strict fail2ban rules). So I'm sticking to not adding another layer of complexity for now (not because I'm lazy lol).

allozaur Mar 14, 2026
Collaborator

Hey, I will consider revisiting Authentication, especially OAuth in near future. Stay tuned!

jferments · 2026-03-15T06:15:05Z

jferments
Mar 15, 2026

What are the best tools for integrating conversational memory into the webUI? Basically, I want something that will be able to remember past conversations, user preferences, etc. Are there any memory systems that have good integrations with the webUI?

1 reply

ServeurpersoCom Mar 16, 2026
Collaborator

For user prefs, the next imminent step is managing default settings on the backend / "administrator" side so that a new browser is always configured as desired.
And for conversations, they are in the browser for privacy reasons; we don't have backend synchronization yet, but we'll see in the future?

RitLab · 2026-03-16T20:07:08Z

RitLab
Mar 16, 2026

Hi, I added the prefix to my proxy server. But WebUI still accesses the path without add some prefix to it. I have tried --api-prefix, but still not working. How to make WebUI call the API via the prefix path?

3 replies

ServeurpersoCom Mar 16, 2026
Collaborator

If I understand correctly, you need the web root that hosts the web UI to be different from the API endpoints?

RitLab Mar 17, 2026

What I want more precisely is to have a different path than the llama baseUrl. For example, the default baseUrl is http://baseurl:port, and I want to change it to http://baseurl:port/llama. So I can access webUI via the customized URL, and the webUI call API to the customized URL, with the prefix path.

ServeurpersoCom Mar 17, 2026
Collaborator

We could indeed add a URL prefix directly in the SvelteUI, but I don't think that's been done (or I missed a great pull request!). In the meantime, you can configure your reverse proxy the same way, for a llama-server runing on 127.0.0.1:8082 :

/v1 for the OpenAI-Compat endpoint
Add /props, /slots. /models
And everything should work

	# Llama.cpp llama-server (for OpenAI and Anthropic clients)
	<Location /llama/v1>
		ProxyPass "http://127.0.0.1:8082/v1"
		ProxyPassReverse "http://127.0.0.1:8082/v1"
	</Location>

	# Llama.cpp llama-server (for Svelte)
	<Location /llama/props>
		ProxyPass "http://127.0.0.1:8082/props"
		ProxyPassReverse "http://127.0.0.1:8082/props"
	</Location>

	<Location /llama/slots>
		ProxyPass "http://127.0.0.1:8082/slots"
		ProxyPassReverse "http://127.0.0.1:8082/slots"
	</Location>

	<Location /llama/models>
		ProxyPass "http://127.0.0.1:8082/models"
		ProxyPassReverse "http://127.0.0.1:8082/models"
	</Location>

(You can copy and paste this to Claude/ChatGPT, explaining that it's a configuration for Apache2 httpd web server!)

vignesh191 · 2026-03-20T19:06:27Z

vignesh191
Mar 20, 2026

Hello, I am trying to use the --webui-config-file flag in my llama-server. I have set it to point to the path of a JSON file:

{
  "systemMessage": "You are a helpful assistant..."
}

though I do not see my system prompt in the general settings of the llama.cpp WebUI. Am I configuring this wrongly?

4 replies

ServeurpersoCom Mar 20, 2026
Collaborator

No, you're doing the right thing, there's a bug, I reproduced it. I check to solve it!

ServeurpersoCom Mar 20, 2026
Collaborator

Done #20823

vignesh191 Mar 21, 2026

@ServeurpersoCom Thank you so much for getting a quick fix ready, you guys rock!

Looking at your PR, is there proper docs on what --webui-config-file params I can use, or is it all the flags listed in: tools/server/webui/src/lib/services/parameter-sync.service.ts ?

ServeurpersoCom Mar 21, 2026
Collaborator

Yes, you're right :
tools/server/webui/src/lib/services/parameter-sync.service.ts is the source of truth for what gets synced.

For a description of each key, see tools/server/webui/src/lib/constants/settings-config.ts (SETTING_CONFIG_DESCRIPTIONS).

A few keys are intentionally not exposed for security or debug reasons: apiKey, mcpServerUsageStats, backend_sampling, disableReasoningParsing, and custom.

The --webui-config-file flag is documented in tools/server/README.md, but since it's mostly an admin/power-user feature we don't list every key there: a quick grep in the repo gets you everything you need.

red-co · 2026-03-21T15:11:29Z

red-co
Mar 21, 2026

Hello, will you support multi system prompt presets and folder based session management?

1 reply

allozaur Mar 23, 2026
Collaborator

Please create a Feature Request issue :)

atmosfar · 2026-04-11T10:41:31Z

atmosfar
Apr 11, 2026

Is it possible to pass the current date via the system prompt in the web UI? I have seen something like "{CURRENT_DATE}", but it does not work. Using an MCP tool just to get the date seems a bit excessive...

2 replies

oliverbob Apr 12, 2026

The llama.cpp server web UI does support date/time template variables — but the syntax is different from what you tried. The correct variables are:

current_date → replaced with YYYY-MM-DD (UTC)
current_time → replaced with HH:MM:SS (UTC)
current_timestamp → replaced with a full UTC timestamp like 2025-05-08 22:19:33 Substack

So in your system prompt, use something like:

Today's date is {{current_date}}. You are a helpful assistant...

or just the bare variable without braces — the exact delimiter syntax can vary by build, so try both current_date and {{current_date}} if one doesn't work. The {CURRENT_DATE} you saw is likely the Open WebUI convention (which uses {{CURRENT_DATE}}), not llama.cpp's native web UI.

One caveat: these use UTC, so if your local timezone matters, you may want to hardcode the offset or just manually note your timezone in the system prompt. Substack

allozaur Apr 12, 2026
Collaborator

Hey, guys, you could prepare a feature request and we could implement this in near future ;)

linuxmagic-mp · 2026-04-13T17:38:56Z

linuxmagic-mp
Apr 13, 2026

What specific version(s) is this tested on? Assuming you are setting this in the llama-server web GUI? (Settings->System Message).. Are there any other caveats? Eg specific LLM's or start up options? Trying in 'router' mode, and the LLM's that respect the prompt explicitly report .. "Today's date is {{current_date}}." ;)

…

On 2026-04-12 02:23, Oliver Bob Lagumen wrote: The llama.cpp server web UI does support date/time template variables — but the syntax is different from what you tried. The correct variables are: * |current_date| → replaced with |YYYY-MM-DD| (UTC) * |current_time| → replaced with |HH:MM:SS| (UTC) * |current_timestamp| → replaced with a full UTC timestamp like |2025-05-08 22:19:33| Substack <https://simonw.substack.com/p/trying-out-llamacpps-new-vision-support> So in your system prompt, use something like: |Today's date is {{current_date}}. You are a helpful assistant... | or just the bare variable without braces — the exact delimiter syntax can vary by build, so try both |current_date| and |{{current_date}}| if one doesn't work. The |{CURRENT_DATE}| you saw is likely the Open WebUI convention (which uses |{{CURRENT_DATE}}|), not llama.cpp's native web UI. One caveat: these use UTC, so if your local timezone matters, you may want to hardcode the offset or just manually note your timezone in the system prompt. Substack <https://simonw.substack.com/p/trying-out-llamacpps-new-vision-support> — Reply to this email directly, view it on GitHub <#16938 (reply in thread)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AK24I7VPZ2NMLXROSGG7SIL4VNOBLAVCNFSM6AAAAACK5APP4OVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTMNJTGMYTKMY>. You are receiving this because you commented.Message ID: ***@***.***>

-- "Catch the Magic of Linux..." ------------------------------------------------------------------------ Michael Peddemors, President/CEO LinuxMagic Inc. Visit us at http://www.linuxmagic.com @linuxmagic A Wizard IT Company - For More Info http://www.wizard.ca "LinuxMagic" a Registered TradeMark of Wizard Tower TechnoServices Ltd. ------------------------------------------------------------------------ 604-682-0300 Beautiful British Columbia, Canada

2 replies

atmosfar Apr 13, 2026

The reply to my comment was misinfo from a bot. Judging by @allozaur 's reply it is not a feature that has been implemented.

allozaur Apr 14, 2026
Collaborator

Yes, it is not implemented (yet)

This comment was marked as off-topic.

Sign in to view

guide : using the new WebUI of llama.cpp #16938

Uh oh!

Uh oh!

ggerganov Nov 2, 2025 Maintainer

Overview

Getting started

Features

Text document processing

PDF document processing

Image inputs

Conversation branching

Parallel conversations

Override default sampling parameters

Render math expressions

Input via URL parameters

HTML/JS preview

Constrained generation

Import/Export

Efficient SSM context management

Mobile compatibility

Sample commands

Acknowledgements

Replies: 48 comments · 154 replies

Uh oh!

ggerganov Nov 3, 2025 Maintainer Author

Uh oh!

Uh oh!

ServeurpersoCom Nov 3, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ServeurpersoCom Nov 3, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ServeurpersoCom Nov 3, 2025 Collaborator

Uh oh!

ggerganov Nov 3, 2025 Maintainer Author

Uh oh!

ggerganov Nov 3, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

allozaur Nov 4, 2025 Collaborator

Uh oh!

Uh oh!

ServeurpersoCom Nov 4, 2025 Collaborator

Uh oh!

ggerganov Nov 4, 2025 Maintainer Author

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ServeurpersoCom Nov 4, 2025 Collaborator

Uh oh!

ServeurpersoCom Nov 4, 2025 Collaborator

Uh oh!

Uh oh!

Uh oh!

ServeurpersoCom Nov 21, 2025 Collaborator

Uh oh!

Uh oh!

ggerganov
Nov 2, 2025
Maintainer

Replies: 48 comments 154 replies

ggerganov
Nov 3, 2025
Maintainer Author

ServeurpersoCom Nov 3, 2025
Collaborator

ServeurpersoCom
Nov 3, 2025
Collaborator

ServeurpersoCom Nov 3, 2025
Collaborator

ggerganov Nov 3, 2025
Maintainer Author

ggerganov Nov 3, 2025
Maintainer Author

allozaur Nov 4, 2025
Collaborator

ServeurpersoCom Nov 4, 2025
Collaborator

ggerganov Nov 4, 2025
Maintainer Author

ServeurpersoCom Nov 4, 2025
Collaborator

ServeurpersoCom Nov 4, 2025
Collaborator

ServeurpersoCom Nov 21, 2025
Collaborator