Skip to content

Releases: FastFlowLM/FastFlowLM

# 🚀 FastFlowLM v0.9.34

20 Feb 21:52
1b13bd3

Choose a tag to compare

Another small step for versioning, one giant leap for smooth inference 😎


🌍 1. New Model: Translategemma:4b

Tag: translategemma:4b

Break down language barriers effortlessly.
A powerful translation model designed to handle multilingual tasks with precision and speed — so you can translate text seamlessly and keep your workflows moving 🗣️

📖 Check out the model card for best practices and recommended use cases.


🩺 2. New Model: medgemma1.5:4b

Tag: medgemma1.5:4b

Your lightweight medical AI assistant.
A specialized model fine-tuned for the healthcare and medical domain. Perfect for quickly processing medical text, summarizing clinical notes, and extracting insights.

⚠️ Disclaimer

This tool (MedGemma + FastFlowLM) is not a diagnostic or clinical tool.
Always consult a licensed medical professional for healthcare decisions.


🖼️ 3. Auto Image Resize (VL Models)

Qwen family Vision models just got smarter about image size.

  • Applies to: qwen2vl-it, qwen3vl-it, qwen2.5vl-it
  • 📐 All input images are automatically resized to 1080p by default
  • 🔧 Disable with setting arg -r 0:
flm run qwen2.5vl-it -r 0

🔒 Stability Improvements

  • Increased stability for:

    • qwen3vl-it
    • qwen2.5vl-it

Fewer surprises. Smoother runs. More reliable vision inference 💪


As always, thank you for using FastFlowLM!
More speed. More models. More flow. 🚀

🚀 FastFlowLM v0.9.33 - New VLM Model

12 Feb 16:45
d511272

Choose a tag to compare

🖼️ New VLM Model Support

  • Added: qwen2.5vl-it:3b – instruction-tuned VLM and ready to go!
  • FYI: This model is based on Qwen2.5-VL-3B-Instruct, currently the most popular vision-text-to-text model measured by monthly downloads on Hugging Face! 🌍📥 (HF model leaderboard)

The FastFlowLM Team

🚀 FastFlowLM v0.9.32

05 Feb 16:18

Choose a tag to compare

🧠 New Model Support

  • Added: Qwen2.5-it:3b – instruction-tuned and ready to go!
  • FYI: This model is based on Qwen2.5-3B-Instruct, currently the most popular text generation model measured by monthly downloads on Hugging Face! 🌍📥
    (Proof: HF model leaderboard)

⚡ Performance Improvements

  • gpt-oss and gpt-oss-sg got a turbo boost! :race_car:
    • Up to 2.3× faster prefill speed for both short and long contexts.
    • Real numbers, real fast: Benchmark results here 📊

🐛 Bug Fixes

  • Fixed an issue in medgemma
    • It's now fully squashed. You’re welcome 🪲🔨

That’s all for v0.9.32!
Stay fast, flow strong 💨
The FastFlowLM Team

🚀 FastFlowLM v0.9.31: Long-Context Acceleration

02 Feb 18:39
e2986d4

Choose a tag to compare

📦 What’s New

⚡ Prefill Speed Upgrade (Qwen & Gemma Families)

A new attention engine dramatically accelerates prefill, with larger gains at longer context lengths (especially 16K+).

  • 🚀 Up to 3.8× faster prefill
    (qwen3:0.6b with a 32K-token prompt)

📈 Prefill Speed @ 32K Prompt (tokens/sec)

Model Before → After Speedup
gemma3:1b 1596 → 1755 1.1×
gemma3:4b 673 → 926 1.4×
medgemma:4b 673 → 926 1.4×
qwen3:0.6b 236 → 1496 3.8×
qwen3:1.7b 225 → 768 3.4×
qwen3:4b 164 → 303 1.9×
qwen3-it:4b 164 → 303 1.9×
qwen3-tk:4b 164 → 303 1.9×
qwen3vl-it:4b 164 → 303 1.9×
qwen3:8b 150 → 260 1.7×
deepseek-r1-0528:8b 150 → 260 1.7×

🖼️ Note: For qwen3vl-it:4b, image understanding is also faster in this release.


🔗 Benchmark Results


🛠️ Vision Tool Calling (New)


🔧 Client Compatibility Improvements

  • Non-stream mode logic adjusted
  • Improves compatibility with client applications

🛠️ FastFlowLM v0.9.30 — GPT-OSS Bug Squash Edition 🐛🔨

28 Jan 20:54

Choose a tag to compare

🔧 What’s Fixed

🧠 GPT‑OSS Compatibility Bug

Patched a glitch that was causing hiccups with GPT‑OSS models.

Stay bug-free and model-happy 🐞🤖
— The FastFlowLM Team

🚀 FastFlowLM v0.9.29: Major Prefill Speedup

28 Jan 19:17
d8451d4

Choose a tag to compare

⚠️ A bug affecting the gpt-oss:20b model was fixed in v0.9.30. Please use v0.9.30 or later.


📦 What’s New

⚡ Massive Prefill Speed Upgrade

We introduced a new attention engine that dramatically accelerates prefill, with larger gains at longer context lengths (especially 16K+).

  • Up to 2.6× faster prefill
  • More speedup at longer prompts
  • No model re-download required (seamless upgrade)

📈 Prefill Speed with 32K prompt (tok/s)

Model Before → After Speedup
lfm2:1.2b 1059 → 1916 1.8×
lfm2:2.6b 654 → 1053 1.6×
lfm2-transcript:2.6b 654 → 1053 1.6×
lfm2.5-it:1.2b 1059 → 1916 1.8×
lfm2.5-tk:1.2b 1059 → 1916 1.8×
llama3.2:1b 577 → 1157 2.0×
llama3.2:3b 214 → 500 2.3×
llama3.1:8b 167 → 281 1.7×
deepseek-r1:8b 167 → 281 1.7×
Phi4-mini-it:4b 173 → 447 2.6×

🔜 Prefill speed upgrades for other models are on the way too — stay tuned!

📊 Detailed Benchmarks

🩹 “No Pain” Update 😎

Just update and run — long-context prefill is now much faster. 🚀

🛠️ Tool Call Bug Fix (Non‑Stream Case)

We’ve squashed a pesky bug affecting tool calls when not using streaming.
Now it behaves exactly like you thought it should. 😌


Thanks for being awesome!
Happy modeling 🤖💙

— The FastFlowLM Team

🚀 FastFlowLM v0.9.28: Pain-Free Performance Boost

22 Jan 16:45
bb4e43a

Choose a tag to compare

Pain-free = no model redownload required to enjoy the speedup. Just update and go.


⚡️ 1. New Attention Engine — Gemma3 Vision Turbo

We’ve upgraded the Attention Engine to supercharge vision understanding for Gemma3-based models:

  • Gemma3-4B
  • MedGemma3-4B

📉 Latency Drop:
From 3.4s → 2.6s, giving you a ~25% speedup.
Your VLMs just got a whole lot snappier.


🖼️ 2. Qwen3-VL-Instruct-4B Gets a Vision Head Boost

Our shiny new Attention Engine also powers up the vision head for Qwen3-VL-Instruct-4B.
It really shines at high image resolutions.

🏎️ Example:
On 4K images, vision understanding is now ~45% faster compared to previous releases.
Less waiting, more interpreting.


🧘 More pain-free speedups coming soon — stay tuned!


✅ Key Benefits

  • No redownloads — instant upgrade ⚡
  • Lower vision latency 🕶️
  • Better efficiency for VLM workloads 💡

💧 FastFlowLM v0.9.27 — Special co-release with LiquidAI

20 Jan 16:11
9d2f5cf

Choose a tag to compare

🔗 Day-0 Support for LFM2.5-1.2B‑Thinking

First reasoning model from LiquidAI! More details: https://huggingface.co/LiquidAI/LFM2.5-1.2B-Thinking

LFM2.5-1.2B-Thinking delivers strong improvements in math reasoning, instruction following, and tool use, matching or exceeding Qwen3-1.7B on most reasoning benchmarks despite using ~40% fewer parameters

Model tag in FLM: lfm2.5-tk:1.2b

Run it in CLI Mode with:

flm run lfm2.5-tk:1.2b

Run it in Server Mode with:

flm serve lfm2.5-tk:1.2b

📊 Performance at a Glance

Kraken (Ryzen AI 340 / 350)

Device Inference Framework Model 4K-Token Prefill Speed (tok/s) Peak Decoding Speed (tok/s) Memory (Full Context)
AMD Ryzen AI 7 HX350 NPU FastFlowLM LFM2.5-1.2B-Thinking 2032 63+ 1.6 GB (full context)
AMD Ryzen AI 5 HX340 NPU FastFlowLM LFM2.5-1.2B-Thinking 2032 63+ 1.6 GB (full context)

Decoding Speed vs. context length:

  • 59 tok/s @ 4K context
  • 52 tok/s @ 16K context

Strix/Strix Halo (Ryzen AI 360 and above)

Device Inference Framework Model 4K-Token Prefill Speed (tok/s) Peak Decoding Speed (tok/s) Memory (Full Context)
AMD Ryzen AI 395+ NPU FastFlowLM LFM2.5-1.2B-Thinking 2226 60+ 1.6 GB (full context)
AMD Ryzen AI 9 HX370 NPU FastFlowLM LFM2.5-1.2B-Thinking 2226 60+ 1.6 GB (full context)

Decoding Speed vs. context length:

  • 54 tok/s @ 4K context
  • 49 tok/s @ 16K context

Detailed benchmarks: https://fastflowlm.com/docs/benchmarks/lfm2_results/


Demo: 💧LFM2.5-1.2B-Thinking (LiquidAI) — 100% Powered by AMD Ryzen™ AI NPU


🆕 Model Tag Change

To avoid confusion between variants:

Old Tag New Tag
lfm2.5:1.2 lfm2.5-it:1.2 (original, instruct)

🚀 FastFlowLM v0.9.26 — New Model, Tool Calling, and Infra Change

15 Jan 19:16

Choose a tag to compare

🧠 1. New Model: LiquidAI/LFM2‑2.6B‑Transcript

Tag: lfm2‑trans:2.6b

Summarize conference notes like a pro.
A single‑turn model designed to cleanly condense long transcripts into insights — so you can spend more time sipping ☕ and less time scrolling 📜.

🎬 See it in action: https://youtu.be/hpt0EhR1_vE?si=v9OCKa7VKAzuZ-02


🛠️ 2. Tool Calling — Preview Release

Tool calling is now officially out of preview!
Verified to work with:

  • qwen3:4b
  • qwen3:8b
  • qwen3-it:4b
  • qwen3-tk-4b

📹 Watch the demo: https://youtu.be/H-i4dztSdVk?si=5keyfkHt3ii8Wlu0

📘 Setup instructions (local):
👉 https://fastflowlm.com/docs/instructions/server/tool_calling/


💽 3. Installer Upgrade — xclbins Inside!

All xclbins are now bundled in the installer, which means:

  • 🆙 Faster updates, No re‑downloading models unless weights change
  • 🤯 Fewer user headaches
  • 🚀 We are able to keep pushing the performance and efficiency.

More performance tuners coming soon… 🔧⚡


🔁 4. Runtime Restructure for Fine‑Tuned Models

We’ve overhauled the FastFlowLM runtime to let YOU plug in fine‑tuned models from supported families.

This is made possible by the upcoming gguf → q4nx conversion tool
it’s almost ready and the docs are currently baking 🍳.

Stay tuned — this one will unlock a lot of flexibility.


🙌 Acknowledgements

🧠 FastFlowLM v0.9.25 — New Models + OpenAI API Fixes

08 Jan 19:35

Choose a tag to compare

We're excited to introduce FastFlowLM v0.9.25, marking a key milestone with the integration of the new LFM2.5 model, freshly unveiled at CES 2026 (Jan 5th). This release also includes improvements to API compatibility and instruction-style models.


🚀 New Model Support

  • LFM2.5-1.2B-Instruct
    🔸 Debuted at CES2026
    The newest addition to the LFM family, tuned for instruction-following. It features improved responsiveness and latency, ideal for interactive applications on AMD NPU.

  • Phi4-mini-instruct
    A compact instruction model tailored for devices with limited memory — great for summarization and low-resource tasks.


🛠️ Fixes & Improvements

  • ✅ Fixed bugs related to generation parameters (top_k, top_p, etc.) not being respected in OpenAI-compatible REST APIs.
  • Ensures correct behavior when adjusting generation strategy through API calls.

🎉 With this update, FastFlowLM continues its mission to support the latest LLMs and provide an efficient, private, and developer-friendly experience on AMD Ryzen AI NPUs.