Releases · FastFlowLM/FastFlowLM

Another small step for versioning, one giant leap for smooth inference 😎

🌍 1. New Model: Translategemma:4b

Tag: translategemma:4b

Break down language barriers effortlessly.
A powerful translation model designed to handle multilingual tasks with precision and speed — so you can translate text seamlessly and keep your workflows moving 🗣️

📖 Check out the model card for best practices and recommended use cases.

🩺 2. New Model: medgemma1.5:4b

Tag: medgemma1.5:4b

Your lightweight medical AI assistant.
A specialized model fine-tuned for the healthcare and medical domain. Perfect for quickly processing medical text, summarizing clinical notes, and extracting insights.

⚠️ Disclaimer

This tool (MedGemma + FastFlowLM) is not a diagnostic or clinical tool.
Always consult a licensed medical professional for healthcare decisions.

🖼️ 3. Auto Image Resize (VL Models)

Qwen family Vision models just got smarter about image size.

Applies to: qwen2vl-it, qwen3vl-it, qwen2.5vl-it
📐 All input images are automatically resized to 1080p by default
🔧 Disable with setting arg -r 0:

flm run qwen2.5vl-it -r 0

🔒 Stability Improvements

Increased stability for:
- qwen3vl-it
- qwen2.5vl-it

Fewer surprises. Smoother runs. More reliable vision inference 💪

As always, thank you for using FastFlowLM!
More speed. More models. More flow. 🚀

🖼️ New VLM Model Support

Added: qwen2.5vl-it:3b – instruction-tuned VLM and ready to go!
FYI: This model is based on Qwen2.5-VL-3B-Instruct, currently the most popular vision-text-to-text model measured by monthly downloads on Hugging Face! 🌍📥 (HF model leaderboard)

— The FastFlowLM Team

🧠 New Model Support

Added: Qwen2.5-it:3b – instruction-tuned and ready to go!
FYI: This model is based on Qwen2.5-3B-Instruct, currently the most popular text generation model measured by monthly downloads on Hugging Face! 🌍📥
(Proof: HF model leaderboard)

⚡ Performance Improvements

gpt-oss and gpt-oss-sg got a turbo boost! :race_car:
- Up to 2.3× faster prefill speed for both short and long contexts.
- Real numbers, real fast: Benchmark results here 📊

🐛 Bug Fixes

Fixed an issue in medgemma
- It's now fully squashed. You’re welcome 🪲🔨

That’s all for v0.9.32!
Stay fast, flow strong 💨
— The FastFlowLM Team

📦 What’s New

⚡ Prefill Speed Upgrade (Qwen & Gemma Families)

A new attention engine dramatically accelerates prefill, with larger gains at longer context lengths (especially 16K+).

🚀 Up to 3.8× faster prefill
(qwen3:0.6b with a 32K-token prompt)

📈 Prefill Speed @ 32K Prompt (tokens/sec)

Model	Before → After	Speedup
gemma3:1b	1596 → 1755	1.1×
gemma3:4b	673 → 926	1.4×
medgemma:4b	673 → 926	1.4×
qwen3:0.6b	236 → 1496	3.8×
qwen3:1.7b	225 → 768	3.4×
qwen3:4b	164 → 303	1.9×
qwen3-it:4b	164 → 303	1.9×
qwen3-tk:4b	164 → 303	1.9×
qwen3vl-it:4b	164 → 303	1.9×
qwen3:8b	150 → 260	1.7×
deepseek-r1-0528:8b	150 → 260	1.7×

🖼️ Note: For qwen3vl-it:4b, image understanding is also faster in this release.

🔗 Benchmark Results

📊 Gemma3 performance: https://fastflowlm.com/docs/benchmarks/gemma3_results/
📈 Qwen3 performance: https://fastflowlm.com/docs/benchmarks/qwen3_results/

🛠️ Vision Tool Calling (New)

✅ Tool calling is now supported on qwen3vl-it
🔍 Enables vision tool calling workflows
🎥 Demo: https://youtu.be/Rf6r0Fm1UVs?si=u45hBgFXyDeEKXxh

🔧 Client Compatibility Improvements

Non-stream mode logic adjusted
Improves compatibility with client applications

🔧 What’s Fixed

🧠 GPT‑OSS Compatibility Bug

Patched a glitch that was causing hiccups with GPT‑OSS models.

Stay bug-free and model-happy 🐞🤖
— The FastFlowLM Team

⚠️ A bug affecting the gpt-oss:20b model was fixed in v0.9.30. Please use v0.9.30 or later.

📦 What’s New

⚡ Massive Prefill Speed Upgrade

We introduced a new attention engine that dramatically accelerates prefill, with larger gains at longer context lengths (especially 16K+).

Up to 2.6× faster prefill
More speedup at longer prompts
No model re-download required (seamless upgrade)

📈 Prefill Speed with 32K prompt (tok/s)

Model	Before → After	Speedup
lfm2:1.2b	1059 → 1916	1.8×
lfm2:2.6b	654 → 1053	1.6×
lfm2-transcript:2.6b	654 → 1053	1.6×
lfm2.5-it:1.2b	1059 → 1916	1.8×
lfm2.5-tk:1.2b	1059 → 1916	1.8×
llama3.2:1b	577 → 1157	2.0×
llama3.2:3b	214 → 500	2.3×
llama3.1:8b	167 → 281	1.7×
deepseek-r1:8b	167 → 281	1.7×
Phi4-mini-it:4b	173 → 447	2.6×

🔜 Prefill speed upgrades for other models are on the way too — stay tuned!

📊 Detailed Benchmarks

🩹 “No Pain” Update 😎

Just update and run — long-context prefill is now much faster. 🚀

🛠️ Tool Call Bug Fix (Non‑Stream Case)

We’ve squashed a pesky bug affecting tool calls when not using streaming.
Now it behaves exactly like you thought it should. 😌

Thanks for being awesome!
Happy modeling 🤖💙

— The FastFlowLM Team

Pain-free = no model redownload required to enjoy the speedup. Just update and go.

⚡️ 1. New Attention Engine — Gemma3 Vision Turbo

We’ve upgraded the Attention Engine to supercharge vision understanding for Gemma3-based models:

Gemma3-4B
MedGemma3-4B

📉 Latency Drop:
From 3.4s → 2.6s, giving you a ~25% speedup.
Your VLMs just got a whole lot snappier.

🖼️ 2. Qwen3-VL-Instruct-4B Gets a Vision Head Boost

Our shiny new Attention Engine also powers up the vision head for Qwen3-VL-Instruct-4B.
It really shines at high image resolutions.

🏎️ Example:
On 4K images, vision understanding is now ~45% faster compared to previous releases.
Less waiting, more interpreting.

🧘 More pain-free speedups coming soon — stay tuned!

✅ Key Benefits

✅ No redownloads — instant upgrade ⚡
✅ Lower vision latency 🕶️
✅ Better efficiency for VLM workloads 💡

🔗 Day-0 Support for LFM2.5-1.2B‑Thinking

First reasoning model from LiquidAI! More details: https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking

LFM2.5-1.2B-Thinking delivers strong improvements in math reasoning, instruction following, and tool use, matching or exceeding Qwen3-1.7B on most reasoning benchmarks despite using ~40% fewer parameters

Model tag in FLM: lfm2.5-tk:1.2b

Run it in CLI Mode with:

flm run lfm2.5-tk:1.2b

Run it in Server Mode with:

flm serve lfm2.5-tk:1.2b

📊 Performance at a Glance

Kraken (Ryzen AI 340 / 350)

Device	Inference	Framework	Model	4K-Token Prefill Speed (tok/s)	Peak Decoding Speed (tok/s)	Memory (Full Context)
AMD Ryzen AI 7 HX350	NPU	FastFlowLM	LFM2.5-1.2B-Thinking	2032	63+	1.6 GB (full context)
AMD Ryzen AI 5 HX340	NPU	FastFlowLM	LFM2.5-1.2B-Thinking	2032	63+	1.6 GB (full context)

Decoding Speed vs. context length:

59 tok/s @ 4K context
52 tok/s @ 16K context

Strix/Strix Halo (Ryzen AI 360 and above)

Device	Inference	Framework	Model	4K-Token Prefill Speed (tok/s)	Peak Decoding Speed (tok/s)	Memory (Full Context)
AMD Ryzen AI 395+	NPU	FastFlowLM	LFM2.5-1.2B-Thinking	2226	60+	1.6 GB (full context)
AMD Ryzen AI 9 HX370	NPU	FastFlowLM	LFM2.5-1.2B-Thinking	2226	60+	1.6 GB (full context)

Decoding Speed vs. context length:

54 tok/s @ 4K context
49 tok/s @ 16K context

Detailed benchmarks: https://fastflowlm.com/docs/benchmarks/lfm2_results/

Demo: 💧LFM2.5-1.2B-Thinking (LiquidAI) — 100% Powered by AMD Ryzen™ AI NPU

🆕 Model Tag Change

To avoid confusion between variants:

Old Tag	New Tag
`lfm2.5:1.2`	`lfm2.5-it:1.2` (original, instruct)

@ItzCrazyKns

🧠 1. New Model: LiquidAI/LFM2‑2.6B‑Transcript

Tag: lfm2‑trans:2.6b

Summarize conference notes like a pro.
A single‑turn model designed to cleanly condense long transcripts into insights — so you can spend more time sipping ☕ and less time scrolling 📜.

🎬 See it in action: https://youtu.be/hpt0EhR1_vE?si=v9OCKa7VKAzuZ-02

🛠️ 2. Tool Calling — Preview Release

Tool calling is now officially out of preview!
Verified to work with:

qwen3:4b
qwen3:8b
qwen3-it:4b
qwen3-tk-4b

📹 Watch the demo: https://youtu.be/H-i4dztSdVk?si=5keyfkHt3ii8Wlu0

📘 Setup instructions (local):
👉 https://fastflowlm.com/docs/instructions/server/tool_calling/

💽 3. Installer Upgrade — xclbins Inside!

All xclbins are now bundled in the installer, which means:

🆙 Faster updates, No re‑downloading models unless weights change
🤯 Fewer user headaches
🚀 We are able to keep pushing the performance and efficiency.

More performance tuners coming soon… 🔧⚡

🔁 4. Runtime Restructure for Fine‑Tuned Models

We’ve overhauled the FastFlowLM runtime to let YOU plug in fine‑tuned models from supported families.

This is made possible by the upcoming gguf → q4nx conversion tool —
it’s almost ready and the docs are currently baking 🍳.

Stay tuned — this one will unlock a lot of flexibility.

🙌 Acknowledgements

Huge thanks to @ItzCrazyKns from Perplexica for schooling us in the basics of tool calling and all the help along the way!
Huge thanks to @jeremyfowers for highlighting and helping us resolving the ambiguity in the JSON-formatted reasoning output!

We're excited to introduce FastFlowLM v0.9.25, marking a key milestone with the integration of the new LFM2.5 model, freshly unveiled at CES 2026 (Jan 5th). This release also includes improvements to API compatibility and instruction-style models.

🚀 New Model Support

LFM2.5-1.2B-Instruct
🔸 Debuted at CES2026
The newest addition to the LFM family, tuned for instruction-following. It features improved responsiveness and latency, ideal for interactive applications on AMD NPU.
Phi4-mini-instruct
A compact instruction model tailored for devices with limited memory — great for summarization and low-resource tasks.

🛠️ Fixes & Improvements

✅ Fixed bugs related to generation parameters (top_k, top_p, etc.) not being respected in OpenAI-compatible REST APIs.
Ensures correct behavior when adjusting generation strategy through API calls.

🎉 With this update, FastFlowLM continues its mission to support the latest LLMs and provide an efficient, private, and developer-friendly experience on AMD Ryzen AI NPUs.

Releases: FastFlowLM/FastFlowLM

# 🚀 FastFlowLM v0.9.34

🌍 1. New Model: Translategemma:4b

🩺 2. New Model: medgemma1.5:4b

⚠️ Disclaimer

🖼️ 3. Auto Image Resize (VL Models)

🔒 Stability Improvements

Uh oh!

🚀 FastFlowLM v0.9.33 - New VLM Model

🖼️ New VLM Model Support

Uh oh!

🚀 FastFlowLM v0.9.32

🧠 New Model Support

⚡ Performance Improvements

🐛 Bug Fixes

Uh oh!

🚀 FastFlowLM v0.9.31: Long-Context Acceleration

📦 What’s New

⚡ Prefill Speed Upgrade (Qwen & Gemma Families)

📈 Prefill Speed @ 32K Prompt (tokens/sec)

🔗 Benchmark Results

🛠️ Vision Tool Calling (New)

🔧 Client Compatibility Improvements

Uh oh!

🛠️ FastFlowLM v0.9.30 — GPT-OSS Bug Squash Edition 🐛🔨

🔧 What’s Fixed

🧠 GPT‑OSS Compatibility Bug

Uh oh!

🚀 FastFlowLM v0.9.29: Major Prefill Speedup

📦 What’s New

⚡ Massive Prefill Speed Upgrade

📈 Prefill Speed with 32K prompt (tok/s)

📊 Detailed Benchmarks

🛠️ Tool Call Bug Fix (Non‑Stream Case)

Uh oh!

🚀 FastFlowLM v0.9.28: Pain-Free Performance Boost

⚡️ 1. New Attention Engine — Gemma3 Vision Turbo

🖼️ 2. Qwen3-VL-Instruct-4B Gets a Vision Head Boost

🧘 More pain-free speedups coming soon — stay tuned!

✅ Key Benefits

Uh oh!

💧 FastFlowLM v0.9.27 — Special co-release with LiquidAI

🔗 Day-0 Support for LFM2.5-1.2B‑Thinking

📊 Performance at a Glance

Kraken (Ryzen AI 340 / 350)

Strix/Strix Halo (Ryzen AI 360 and above)

🆕 Model Tag Change

Uh oh!

🚀 FastFlowLM v0.9.26 — New Model, Tool Calling, and Infra Change

🧠 1. New Model: LiquidAI/LFM2‑2.6B‑Transcript

🛠️ 2. Tool Calling — Preview Release

💽 3. Installer Upgrade — xclbins Inside!

🔁 4. Runtime Restructure for Fine‑Tuned Models

🙌 Acknowledgements

Contributors

Uh oh!

🧠 FastFlowLM v0.9.25 — New Models + OpenAI API Fixes

🚀 New Model Support

🛠️ Fixes & Improvements

Uh oh!