Skip to content

michaelneale/mesh-llm

Repository files navigation

Mesh LLM

Mesh LLM logo

Mesh LLM

Mesh LLM lets you pool spare GPU capacity across machines and expose the result as one OpenAI-compatible API.

If a model fits on one machine, it runs there. If it does not, Mesh LLM automatically spreads the work across the mesh:

  • Dense models use pipeline parallelism.
  • MoE models use expert sharding with zero cross-node inference traffic.
  • Every node gets the same local API at http://localhost:9337/v1.

Why people use it

  • Run models larger than a single machine can hold.
  • Turn a few uneven boxes into one shared inference pool.
  • Give agents a local OpenAI-compatible endpoint instead of wiring each tool by hand.
  • Keep the setup simple: start one node, add more later.

Quick start

Install the latest release:

curl -fsSL https://raw.githubusercontent.com/michaelneale/mesh-llm/main/install.sh | bash

Then start a node:

mesh-llm --auto

That command:

  • picks a suitable bundled backend for your machine
  • downloads a model if needed
  • joins the best public mesh
  • exposes an OpenAI-compatible API at http://localhost:9337/v1
  • starts the web console at http://localhost:3131

Check what is available:

curl -s http://localhost:9337/v1/models | jq '.data[].id'

Send a request:

curl http://localhost:9337/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"GLM-4.7-Flash-Q4_K_M","messages":[{"role":"user","content":"hello"}]}'

Common workflows

1. Try the public mesh

mesh-llm --auto

This is the easiest way to see the system working end to end.

2. Start a private mesh

mesh-llm --model Qwen2.5-32B

This starts serving a model, opens the local API and console, and prints an invite token for other machines.

If you want the mesh to be discoverable via --auto, publish it:

mesh-llm --model Qwen2.5-32B --publish

3. Add another machine

mesh-llm --join <token>

Use --client if the machine should join without serving a model:

mesh-llm --client --join <token>

4. Create a named mesh for a group

mesh-llm --auto --model GLM-4.7-Flash-Q4_K_M --mesh-name "poker-night"

Everyone runs the same command. The first node creates the mesh, the rest discover and join it automatically.

5. Serve more than one model

mesh-llm --model Qwen2.5-32B --model GLM-4.7-Flash

Requests are routed by the model field:

curl localhost:9337/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"GLM-4.7-Flash-Q4_K_M","messages":[{"role":"user","content":"hello"}]}'

How it works

Mesh LLM keeps the user-facing surface simple: talk to localhost:9337, pick a model, and let the mesh decide how to serve it.

  • If a model fits on one machine, it runs there with no network overhead.
  • If a dense model does not fit, layers are split across low-latency peers.
  • If an MoE model does not fit, experts are split across nodes and requests are hash-routed for cache locality.
  • Different nodes can serve different models at the same time.

Each node also exposes a management API and web console on port 3131.

Install notes

The installer currently targets macOS and Linux release bundles. Windows coming soon.

To force a specific bundled flavor during install:

curl -fsSL https://raw.githubusercontent.com/michaelneale/mesh-llm/main/install.sh | MESH_LLM_INSTALL_FLAVOR=vulkan bash

Installed release bundles use flavor-specific llama.cpp binaries:

  • macOS: metal
  • Linux: cpu, cuda, rocm, vulkan

To update a bundle install to the latest release:

mesh-llm update

If you build from source, always use just:

git clone https://github.com/michaelneale/mesh-llm
cd mesh-llm
just build

Requirements and backend-specific build notes are in CONTRIBUTING.md.

Web console

When a node is running, open:

http://localhost:3131

The console shows live topology, VRAM usage, loaded models, and built-in chat. It is backed by /api/status and /api/events.

You can also try the hosted demo:

mesh-llm-console.fly.dev

More docs

Community

Join the #mesh-llm channel on the Goose Discord for discussion and support.

About

reference impl with llama.cpp compiled to distributed inference across machines, with real end to end demo

Resources

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors