Releases: FastFlowLM/FastFlowLM
# 🚀 FastFlowLM v0.9.34
Another small step for versioning, one giant leap for smooth inference 😎
🌍 1. New Model: Translategemma:4b
Tag: translategemma:4b
Break down language barriers effortlessly.
A powerful translation model designed to handle multilingual tasks with precision and speed — so you can translate text seamlessly and keep your workflows moving 🗣️
📖 Check out the model card for best practices and recommended use cases.
🩺 2. New Model: medgemma1.5:4b
Tag: medgemma1.5:4b
Your lightweight medical AI assistant.
A specialized model fine-tuned for the healthcare and medical domain. Perfect for quickly processing medical text, summarizing clinical notes, and extracting insights.
⚠️ Disclaimer
This tool (MedGemma + FastFlowLM) is not a diagnostic or clinical tool.
Always consult a licensed medical professional for healthcare decisions.
🖼️ 3. Auto Image Resize (VL Models)
Qwen family Vision models just got smarter about image size.
- Applies to:
qwen2vl-it,qwen3vl-it,qwen2.5vl-it - 📐 All input images are automatically resized to 1080p by default
- 🔧 Disable with setting arg
-r 0:
flm run qwen2.5vl-it -r 0🔒 Stability Improvements
-
Increased stability for:
qwen3vl-itqwen2.5vl-it
Fewer surprises. Smoother runs. More reliable vision inference 💪
As always, thank you for using FastFlowLM!
More speed. More models. More flow. 🚀
🚀 FastFlowLM v0.9.33 - New VLM Model
🖼️ New VLM Model Support
- Added:
qwen2.5vl-it:3b– instruction-tuned VLM and ready to go! - FYI: This model is based on Qwen2.5-VL-3B-Instruct, currently the most popular vision-text-to-text model measured by monthly downloads on Hugging Face! 🌍📥 (HF model leaderboard)
— The FastFlowLM Team
🚀 FastFlowLM v0.9.32
🧠 New Model Support
- Added:
Qwen2.5-it:3b– instruction-tuned and ready to go! - FYI: This model is based on Qwen2.5-3B-Instruct, currently the most popular text generation model measured by monthly downloads on Hugging Face! 🌍📥
(Proof: HF model leaderboard)
⚡ Performance Improvements
gpt-ossandgpt-oss-sggot a turbo boost! :race_car:- Up to 2.3× faster prefill speed for both short and long contexts.
- Real numbers, real fast: Benchmark results here 📊
🐛 Bug Fixes
- Fixed an issue in
medgemma- It's now fully squashed. You’re welcome 🪲🔨
That’s all for v0.9.32!
Stay fast, flow strong 💨
— The FastFlowLM Team
🚀 FastFlowLM v0.9.31: Long-Context Acceleration
📦 What’s New
⚡ Prefill Speed Upgrade (Qwen & Gemma Families)
A new attention engine dramatically accelerates prefill, with larger gains at longer context lengths (especially 16K+).
- 🚀 Up to 3.8× faster prefill
(qwen3:0.6b with a 32K-token prompt)
📈 Prefill Speed @ 32K Prompt (tokens/sec)
| Model | Before → After | Speedup |
|---|---|---|
| gemma3:1b | 1596 → 1755 | 1.1× |
| gemma3:4b | 673 → 926 | 1.4× |
| medgemma:4b | 673 → 926 | 1.4× |
| qwen3:0.6b | 236 → 1496 | 3.8× |
| qwen3:1.7b | 225 → 768 | 3.4× |
| qwen3:4b | 164 → 303 | 1.9× |
| qwen3-it:4b | 164 → 303 | 1.9× |
| qwen3-tk:4b | 164 → 303 | 1.9× |
| qwen3vl-it:4b | 164 → 303 | 1.9× |
| qwen3:8b | 150 → 260 | 1.7× |
| deepseek-r1-0528:8b | 150 → 260 | 1.7× |
🖼️ Note: For qwen3vl-it:4b, image understanding is also faster in this release.
🔗 Benchmark Results
- 📊 Gemma3 performance: https://fastflowlm.com/docs/benchmarks/gemma3_results/
- 📈 Qwen3 performance: https://fastflowlm.com/docs/benchmarks/qwen3_results/
🛠️ Vision Tool Calling (New)
- ✅ Tool calling is now supported on qwen3vl-it
- 🔍 Enables vision tool calling workflows
- 🎥 Demo: https://youtu.be/Rf6r0Fm1UVs?si=u45hBgFXyDeEKXxh
🔧 Client Compatibility Improvements
- Non-stream mode logic adjusted
- Improves compatibility with client applications
🛠️ FastFlowLM v0.9.30 — GPT-OSS Bug Squash Edition 🐛🔨
🔧 What’s Fixed
🧠 GPT‑OSS Compatibility Bug
Patched a glitch that was causing hiccups with GPT‑OSS models.
Stay bug-free and model-happy 🐞🤖
— The FastFlowLM Team
🚀 FastFlowLM v0.9.29: Major Prefill Speedup
gpt-oss:20b model was fixed in v0.9.30. Please use v0.9.30 or later.
📦 What’s New
⚡ Massive Prefill Speed Upgrade
We introduced a new attention engine that dramatically accelerates prefill, with larger gains at longer context lengths (especially 16K+).
- Up to 2.6× faster prefill
- More speedup at longer prompts
- No model re-download required (seamless upgrade)
📈 Prefill Speed with 32K prompt (tok/s)
| Model | Before → After | Speedup |
|---|---|---|
| lfm2:1.2b | 1059 → 1916 | 1.8× |
| lfm2:2.6b | 654 → 1053 | 1.6× |
| lfm2-transcript:2.6b | 654 → 1053 | 1.6× |
| lfm2.5-it:1.2b | 1059 → 1916 | 1.8× |
| lfm2.5-tk:1.2b | 1059 → 1916 | 1.8× |
| llama3.2:1b | 577 → 1157 | 2.0× |
| llama3.2:3b | 214 → 500 | 2.3× |
| llama3.1:8b | 167 → 281 | 1.7× |
| deepseek-r1:8b | 167 → 281 | 1.7× |
| Phi4-mini-it:4b | 173 → 447 | 2.6× |
🔜 Prefill speed upgrades for other models are on the way too — stay tuned!
📊 Detailed Benchmarks
🩹 “No Pain” Update 😎
Just update and run — long-context prefill is now much faster. 🚀
🛠️ Tool Call Bug Fix (Non‑Stream Case)
We’ve squashed a pesky bug affecting tool calls when not using streaming.
Now it behaves exactly like you thought it should. 😌
Thanks for being awesome!
Happy modeling 🤖💙
— The FastFlowLM Team
🚀 FastFlowLM v0.9.28: Pain-Free Performance Boost
Pain-free = no model redownload required to enjoy the speedup. Just update and go.
⚡️ 1. New Attention Engine — Gemma3 Vision Turbo
We’ve upgraded the Attention Engine to supercharge vision understanding for Gemma3-based models:
Gemma3-4BMedGemma3-4B
📉 Latency Drop:
From 3.4s → 2.6s, giving you a ~25% speedup.
Your VLMs just got a whole lot snappier.
🖼️ 2. Qwen3-VL-Instruct-4B Gets a Vision Head Boost
Our shiny new Attention Engine also powers up the vision head for Qwen3-VL-Instruct-4B.
It really shines at high image resolutions.
🏎️ Example:
On 4K images, vision understanding is now ~45% faster compared to previous releases.
Less waiting, more interpreting.
🧘 More pain-free speedups coming soon — stay tuned!
✅ Key Benefits
- ✅ No redownloads — instant upgrade ⚡
- ✅ Lower vision latency 🕶️
- ✅ Better efficiency for VLM workloads 💡
💧 FastFlowLM v0.9.27 — Special co-release with LiquidAI
🔗 Day-0 Support for LFM2.5-1.2B‑Thinking
First reasoning model from LiquidAI! More details: https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking
LFM2.5-1.2B-Thinking delivers strong improvements in math reasoning, instruction following, and tool use, matching or exceeding Qwen3-1.7B on most reasoning benchmarks despite using ~40% fewer parameters
Model tag in FLM: lfm2.5-tk:1.2b
Run it in CLI Mode with:
flm run lfm2.5-tk:1.2bRun it in Server Mode with:
flm serve lfm2.5-tk:1.2b📊 Performance at a Glance
Kraken (Ryzen AI 340 / 350)
| Device | Inference | Framework | Model | 4K-Token Prefill Speed (tok/s) | Peak Decoding Speed (tok/s) | Memory (Full Context) |
|---|---|---|---|---|---|---|
| AMD Ryzen AI 7 HX350 | NPU | FastFlowLM | LFM2.5-1.2B-Thinking | 2032 | 63+ | 1.6 GB (full context) |
| AMD Ryzen AI 5 HX340 | NPU | FastFlowLM | LFM2.5-1.2B-Thinking | 2032 | 63+ | 1.6 GB (full context) |
Decoding Speed vs. context length:
- 59 tok/s @ 4K context
- 52 tok/s @ 16K context
Strix/Strix Halo (Ryzen AI 360 and above)
| Device | Inference | Framework | Model | 4K-Token Prefill Speed (tok/s) | Peak Decoding Speed (tok/s) | Memory (Full Context) |
|---|---|---|---|---|---|---|
| AMD Ryzen AI 395+ | NPU | FastFlowLM | LFM2.5-1.2B-Thinking | 2226 | 60+ | 1.6 GB (full context) |
| AMD Ryzen AI 9 HX370 | NPU | FastFlowLM | LFM2.5-1.2B-Thinking | 2226 | 60+ | 1.6 GB (full context) |
Decoding Speed vs. context length:
- 54 tok/s @ 4K context
- 49 tok/s @ 16K context
Detailed benchmarks: https://fastflowlm.com/docs/benchmarks/lfm2_results/
Demo: 💧LFM2.5-1.2B-Thinking (LiquidAI) — 100% Powered by AMD Ryzen™ AI NPU
🆕 Model Tag Change
To avoid confusion between variants:
| Old Tag | New Tag |
|---|---|
lfm2.5:1.2 |
lfm2.5-it:1.2 (original, instruct) |
🚀 FastFlowLM v0.9.26 — New Model, Tool Calling, and Infra Change
🧠 1. New Model: LiquidAI/LFM2‑2.6B‑Transcript
Tag: lfm2‑trans:2.6b
Summarize conference notes like a pro.
A single‑turn model designed to cleanly condense long transcripts into insights — so you can spend more time sipping ☕ and less time scrolling 📜.
🎬 See it in action: https://youtu.be/hpt0EhR1_vE?si=v9OCKa7VKAzuZ-02
🛠️ 2. Tool Calling — Preview Release
Tool calling is now officially out of preview!
Verified to work with:
qwen3:4bqwen3:8bqwen3-it:4bqwen3-tk-4b
📹 Watch the demo: https://youtu.be/H-i4dztSdVk?si=5keyfkHt3ii8Wlu0
📘 Setup instructions (local):
👉 https://fastflowlm.com/docs/instructions/server/tool_calling/
💽 3. Installer Upgrade — xclbins Inside!
All xclbins are now bundled in the installer, which means:
- 🆙 Faster updates, No re‑downloading models unless weights change
- 🤯 Fewer user headaches
- 🚀 We are able to keep pushing the performance and efficiency.
More performance tuners coming soon… 🔧⚡
🔁 4. Runtime Restructure for Fine‑Tuned Models
We’ve overhauled the FastFlowLM runtime to let YOU plug in fine‑tuned models from supported families.
This is made possible by the upcoming gguf → q4nx conversion tool —
it’s almost ready and the docs are currently baking 🍳.
Stay tuned — this one will unlock a lot of flexibility.
🙌 Acknowledgements
- Huge thanks to @ItzCrazyKns from Perplexica for schooling us in the basics of tool calling and all the help along the way!
- Huge thanks to @jeremyfowers for highlighting and helping us resolving the ambiguity in the JSON-formatted reasoning output!
🧠 FastFlowLM v0.9.25 — New Models + OpenAI API Fixes
We're excited to introduce FastFlowLM v0.9.25, marking a key milestone with the integration of the new LFM2.5 model, freshly unveiled at CES 2026 (Jan 5th). This release also includes improvements to API compatibility and instruction-style models.
🚀 New Model Support
-
LFM2.5-1.2B-Instruct
🔸 Debuted at CES2026
The newest addition to the LFM family, tuned for instruction-following. It features improved responsiveness and latency, ideal for interactive applications on AMD NPU. -
Phi4-mini-instruct
A compact instruction model tailored for devices with limited memory — great for summarization and low-resource tasks.
🛠️ Fixes & Improvements
- ✅ Fixed bugs related to generation parameters (
top_k,top_p, etc.) not being respected in OpenAI-compatible REST APIs. - Ensures correct behavior when adjusting generation strategy through API calls.
🎉 With this update, FastFlowLM continues its mission to support the latest LLMs and provide an efficient, private, and developer-friendly experience on AMD Ryzen AI NPUs.