Skip to content

metantonio/hermes-wsl-ubuntu

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Hermes Agent + llama.cpp + Qwen3.5 Integration Guide

Status Platform GPU License

Author: Antonio Martinez Date: March 18, 2026

End-to-end setup for running Hermes Agent + llama.cpp + Qwen3.5 locally with GPU acceleration.


Table of Contents


System Requirements

Component Minimum Recommended
GPU GTX 1060 (6GB) RTX 30/40 series
VRAM 6GB 12GB+
RAM 16GB 32GB
CPU 4 cores 8+ cores
OS Linux / WSL2 WSL2/ Ubuntu 22.04

Architecture Overview

flowchart LR
    User --> Hermes
    Hermes -->|OpenAI API| LlamaServer
    LlamaServer --> Model[Qwen3.5 GGUF]
    Model --> GPU[GPU / CPU]
Loading

Automatic installation

If you are in WSL - Ubuntu, execute the following script to install CUDA, Hermes, llama.cpp (you will be asked for sudo password)

curl -fsSL https://raw.githubusercontent.com/metantonio/hermes-wsl-ubuntu/master/setup-wsl.sh | bash

Prerequisites & Setup

If you are in Windows, read and install WSL2 - Ubuntu

In Ubuntu:

sudo apt update && sudo apt upgrade -y
sudo apt install -y build-essential cmake git curl wget htop

Verify GPU

nvidia-smi
nvcc --version

If both commands works continue with Hermes agent installation, if don't, then you need CUDA Toolkits:

CUDA Toolkits for WSL - Ubuntu

check: Link

Also, may have to install nvcc with:

sudo apt install nvidia-cuda-toolkit

Hermes Agent Installation

curl -fsSL https://raw.githubusercontent.com/NousResearch/hermes-agent/main/scripts/install.sh | bash

Config

# Do the setup to create the necessary files
hermes setup

And edit these values:

hermes config set OPENAI_BASE_URL http://localhost:8080/v1
hermes config set OPENAI_API_KEY dummy
hermes config set LLM_MODEL Qwen3.5-9B-Q5_K_M
hermes config set TELEGRAM_BOT_TOKEN Your_Telegram_API_Token

llama.cpp Installation

Recommended Path

sudo mkdir -p /opt/llama.cpp
sudo chown -R $USER:$USER /opt/llama.cpp

cd /opt/llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git

Build with CUDA (you need a NVIDIA GPU)

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

Build with CPU (your CPU needs to have AVX2 instructions, basically any CPU since 2020)

This will make the inference slower, but still functional:

cmake -B build
cmake --build build --config Release

Note that for CPU, you may want to install BLAS for accelerated inference using .ggml LLM models, not .gguf

Note:

remove any previous failed build with:

rm -rf build

Verify

./build/bin/llama-server -h

Model Setup (Qwen3.5)

This will be created at your home directory

mkdir -p ~/models
cd ~/models

wget https://huggingface.co/unsloth/Qwen3.5-9B-GGUF/resolve/main/Qwen3.5-9B-Q5_K_M.gguf

Other models links that you could consider: https://huggingface.co/Tesslate/OmniCoder-9B-GGUF/resolve/main/omnicoder-9b-q5_k_m.gguf

Note: You may want to check huggingface for QWEN3.5 models of 0.8B, 2B and 4B if you want to run this with CPU at reasonable speed

Notes

Model Sizes Comparison

Model Parameters Q4_K_M Q5_K_M Q5_K_L Q6_K_L
Qwen3.5 9B 9B 5.5GB 6.5GB 6.8GB 7.7GB
Qwen3.5 14B 14B 9.0GB 10.5GB 11.0GB 12.5GB

Model Selection Matrix

Use Case Model Quantization VRAM Speed Precision Loss
General Chat Qwen3.5 9B Q4_K_M 5.5GB 35-50 tok/s ~7-8%
Development Qwen3.5 9B Q5_K_M 6.5GB 28-42 tok/s ~4-5%
Research Qwen3.5 9B Q5_K_L 6.8GB 22-35 tok/s ~3-4%
Complex Tasks Qwen3.5 14B Q4_K_M 9.0GB 25-40 tok/s ~7-8%
Maximum Quality Qwen3.5 14B Q5_K_M 10.5GB 22-35 tok/s ~4-5%

Running the Server

/opt/llama.cpp/build/bin/llama-server \
  -m ~/models/Qwen3.5-9B-Q5_K_M.gguf \
  -c 8192 \
  -ngl 32 \
  --threads 8 \
  -fa on \
  --host 127.0.0.1 \
  --port 8080

Usage Examples

API (OpenAI Compatible)

curl http://localhost:8080/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Write a Python function",
    "max_tokens": 128
  }'

CLI

/opt/llama.cpp/build/bin/llama-cli \
  -m ~/models/Qwen3.5-9B-Q5_K_M.gguf \
  -ngl 32 \
  -c 4096 \
  -i

Recommended Commands (For a 12 GB VRAM)

Qwen3.5 9B - General Purpose (Recommended)

./llama.cpp/build/bin/llama-server \
    -m /home/antonio/models/Qwen3.5-9B-Instruct-Q5_K_M.gguf \
    -c 8192 -n 4096 -ngl 32 \
    --port 8080 \
    --host 127.0.0.1 \
    --threads 8 \
    --flash-attn 1 \
    --cache-type-k q4_0

Qwen3.5 14B - Complex Reasoning

./llama.cpp/build/bin/llama-server \
    -m /home/antonio/models/Qwen3.5-14B-Q4_K_M.gguf \
    -c 131072 -n 4096 -ngl 34 \
    -np 1 -fa on \
    --port 8080 \
    --host 127.0.0.1 \
    --threads 8 \
    --flash-attn 1 \
    --cache-type-k q4_0 

Qwen3.5 9B - Maximum Speed

./llama.cpp/build/bin/llama-server \
    -m /home/antonio/models/Qwen3.5-9B-Instruct-Q4_K_M.gguf \
    -c 4096 -n 2048 -ngl 32 \
    --port 8080 \
    --host 127.0.0.1 \
    --threads 12 \
    --flash-attn 1 \
    --cache-type-k q4_0 

Integration with Hermes Agent

# Option A: Hermes using CLI
hermes chat --model qwen3.5-9B_Q5_K_M

# Option B: Using Telegram
hermes gateway

Performance & Memory

Memory Breakdown

pie
    title VRAM Usage (Qwen 9B Q5_K_M has 32 layers)
    "Model Weights" : 65
    "KV Cache" : 30
    "Overhead" : 5
Loading

KV Cache Rule

  • ~1.2 GB per 4096 tokens
  • Scales linearly with context
Context KV Cache
4096 ~1.2 GB
8192 ~2.4 GB
10240 ~3.0 GB

Important

  • KV cache = input + output tokens
  • Total tokens = -c (llama.cpp flag)

Tips

RAM Calculation

Each LLM Transformer layer needs around ~180–320 MB por layer (model-dependent, rough estimate) of VRAM (For GPU) / RAM (for CPU) depending of the Quantization.

  • Q4_K_M: would be around 200MB per Layer.
  • Q5_K_M: Would be around 300MB per Layer.

Every 4096 tokens will need a KV Cache equivalent to ~1.2GB VRAM / RAM

System + Overhead will required 0.5GB of VRAM/ RAM

Doing some math example for QWEN3.5-9B-Q5_K_M.gguf:

Component VRAM Required (Q5_K_M)
Model weights (32 layers) ~9.4 GB
KV Cache (Context 4096) ~1.2 GB
System + overhead ~0.5 GB
Total ~11.1 GB

A little tight for a 12GB VRAM GPU, but totally functional.

Now, depending of the task, you may need to have more context, let's says 8192 tokens of context, you can sacrifice some inference speed by loading less layers on the GPU:

Component VRAM Required (Q5_K_M)
Model weights (28 layers) ~8.2 GB
KV Cache (Context 8192) ~2.4 GB
System + overhead ~0.5 GB
Total ~11.1 GB

As you have more GPU VRAM available, you could try add even more context (2048 more):

Component VRAM Required (Q5_K_M)
Model weights (28 layers) ~8.2 GB
KV Cache (Context 10240) ~3.0 GB
System + overhead ~0.5 GB
Total ~11.7 GB

But you will have only 2.5% of the GPU free, is always recommended to have 5%.

KV cache memory depends on the total active context (-c), not separately on input or output tokens.

Note: QWEN3.5-14B has 40 Layers.

Additional Parameters for llama.cpp

Parameter Value Description
-c 8192 Context size (text length, max buffer)
-n 4096 Max output tokens
--ngl 32 Layers on GPU (depends of LLM used)
--port 8080 Server port
--host 127.0.0.1 Listen on localhost
--threads 8 CPU threads for parallelization
--flash-attn 1 Enable Flash Attention (speed boost)

Troubleshooting

CUDA Out of Memory

-ngl 28
-c 4096

Slow Performance

--threads 12
-fa on

GPU Not Detected

nvidia-smi
wsl --update

Optimization Tips

Context vs Performance

graph LR
    A[Small Context] -->|Fast| B[High Speed]
    C[Large Context] -->|Slow| D[More Memory Usage]
Loading

Best Flags

-c 8192
-ngl 32
-fa on
--threads 8

Best Practices

Model Strategy

Use Case Model
Chat Qwen3.5 9B Q4_K_M
Dev Qwen3.5 9B Q5_K_M
Research Qwen3.5 9B Q5_K_L
Complex Qwen3.5 14B

Security

chmod 700 ~/models
chmod 755 /opt/llama.cpp/build/bin/llama-server
  • Use 127.0.0.1
  • Do NOT expose publicly

Workflow

sequenceDiagram
    participant User
    participant Hermes
    participant LlamaServer
    participant Model

    User->>Hermes: Prompt
    Hermes->>LlamaServer: API Call
    LlamaServer->>Model: Inference
    Model-->>LlamaServer: Tokens
    LlamaServer-->>Hermes: Response
    Hermes-->>User: Output
Loading

Production Checklist

  • GPU working (nvidia-smi)
  • llama.cpp built with CUDA
  • Model downloaded
  • Server running
  • Hermes connected
  • Context optimized
  • Permissions secured

Resources


About

Instructions to use Hermes AI agents on WSL2 - Ubuntu

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages