Step 3.5 Flash Tutorial: How to Deploy a 196B Model That Runs Like 11B

On February 2, 2026, StepFun released Step 3.5 Flash—a sparse mixture-of-experts model that achieves frontier-class reasoning with only 11B active parameters per token. The full model contains 196B parameters, but thanks to intelligent expert routing, it runs on consumer hardware like the Mac Studio M4 Max while scoring 97.3 on AIME 2025 and 74.4% on SWE-bench. This Step 3.5 Flash tutorial walks you through five deployment paths: cloud APIs starting with OpenRouter’s free tier, local deployment on your Mac, and everything in between.

The efficiency story matters. Step 3.5 Flash outperforms models 3-5x larger—including DeepSeek V3.2 (671B) and Kimi K2.5 (1T)—while maintaining 100-350 tok/s throughput. It’s available under Apache 2.0 on GitHub and HuggingFace, compatible with the OpenAI SDK for drop-in integration, and designed specifically for agentic AI workloads.

What Is Step 3.5 Flash? The 196B Model That Runs Like 11B

Step 3.5 Flash uses a sparse Mixture-of-Experts (MoE) architecture with 288 routed experts per layer plus 1 shared expert that activates on every token. The model selects only the top-8 experts per token, activating approximately 11B of its 196B total parameters—a 5.6% efficiency ratio. This is the same architectural philosophy behind DeepSeek V3.2, but executed at a smaller scale with purpose-built optimizations for speed.

Think of it like a consulting firm with 288 specialists. Each incoming query gets routed to the 8 most relevant experts instead of consulting all 288. You get specialist-level quality with generalist-level latency. For a deeper look at how MoE routing works, check out our MoE architecture deep dive.

The secret to Step 3.5 Flash’s 256K context window is hybrid attention: a 3:1 ratio of sliding window attention (SWA) to full attention layers. StepFun augmented the SWA layers with 96 query heads (up from 64) to maintain representational power without expanding the KV cache. They chose SWA over linear attention specifically to support Multi-Token Prediction (MTP-3), which predicts 4 tokens per forward pass for speculative decoding. The result: 100-350 tok/s inference speed depending on hardware.

Benchmarks: How Step 3.5 Flash Stacks Up

Step 3.5 Flash scores 97.3 on AIME 2025 (versus GPT-5.2’s perfect 100), 74.4% on SWE-bench Verified (versus GPT-5.2’s 80.0% and Claude Opus 4.5’s 80.9%), and 86.4% on LiveCodeBench-V6. On agentic benchmarks, it hits 51.0% on Terminal-Bench 2.0 and 88.2 on tau-2-Bench. These numbers position it as a frontier-class model despite activating only 11B parameters per token.

Benchmark	Step 3.5 Flash	GPT-5.2	Claude Opus 4.5	DeepSeek V3.2
AIME 2025	97.3	100	92.8	93.1
SWE-bench Verified	74.4%	80.0%	80.9%	—
LiveCodeBench-V6	86.4%	—	—	—
Terminal-Bench 2.0	51.0%	—	—	—

The gap between Step 3.5 Flash and proprietary frontiers is narrowing fast. It trails GPT-5.2 on SWE-bench (74.4% vs 80.0%) but costs nothing to self-host. That trade-off matters for production agentic systems where API costs scale linearly with usage. For context on whether these benchmarks translate to real-world performance, see our take on AI coding assistants.

Getting Started: OpenRouter Free Tier (Fastest Path)

The fastest way to test Step 3.5 Flash is OpenRouter’s free tier—OpenAI SDK compatible and rate-limited but sufficient for prototyping. Create an account at openrouter.ai, grab your API key, and you’re running inference in under 5 minutes.

from openai import OpenAI

client = OpenAI(
    base_url="https://openrouter.ai/api/v1",
    api_key="YOUR_OPENROUTER_API_KEY",
    default_headers={"HTTP-Referer": "https://your-app.com"}
)

response = client.chat.completions.create(
    model="stepfun/step-3.5-flash:free",
    messages=[
        {"role": "system", "content": "You are a helpful coding assistant."},
        {"role": "user", "content": "Write a Python function to merge two sorted arrays."}
    ]
)
print(response.choices[0].message.content)

OpenRouter’s free tier includes rate limits but works for low-volume production use. For higher throughput, switch to the paid tier (stepfun/step-3.5-flash) or connect directly to StepFun’s API at api.stepfun.ai/v1 using the same OpenAI client initialization pattern. NVIDIA also offers a free developer trial via NIM with optimized containers for DGX Spark and H100/H200 GPUs.

Step 3.5 Flash deployment options showing cloud API and local hardware paths

Local Deployment: vLLM on GPU Servers

For production workloads on dedicated GPU infrastructure, vLLM is the gold standard. It supports tensor parallelism across 4, 8, or 16 GPUs and achieves 100-300 tok/s throughput at scale. The FP8 quantized variant requires approximately 196GB VRAM (4x H100 80GB), while full BF16 precision needs around 392GB (8x H100 80GB).

# FP8 quantization with 4-way tensor parallelism
vllm serve stepfun-ai/Step-3.5-Flash-FP8 \
  --tensor-parallel-size 4 \
  --quantization fp8

# BF16 full precision with 8-way tensor parallelism
vllm serve stepfun-ai/Step-3.5-Flash \
  --tensor-parallel-size 8

Once running, vLLM exposes an OpenAI-compatible /v1/chat/completions endpoint. You can drop it into existing applications without changing your client code. For deployment guidance, check the vLLM documentation.

Local Deployment: llama.cpp on Your Mac

Step 3.5 Flash runs on the Mac Studio M4 Max (128GB unified memory) via Int4 quantization. The Q4_K_S GGUF weights weigh in at approximately 111.5GB. On NVIDIA DGX Spark, the model achieves 20 tok/s with 256K context using INT8 KV cache quantization. This is frontier-class AI running on consumer hardware.

# Build llama.cpp from source or use precompiled binaries
# Download Q4_K_S GGUF weights from HuggingFace

# Run with 16K context
./llama-cli -m step3.5_flash_Q4_K_S.gguf \
  -c 16384 -b 2048 -ub 2048 -fa on --temp 1.0

# Run with 256K context
./llama-cli -m step3.5_flash_Q4_K_S.gguf \
  -c 262144 -ub 1024 -fa on --temp 1.0

The trade-off: expect 5-10 tok/s on consumer hardware versus 100-300 tok/s on GPU clusters. For development, testing, or running without cloud costs, that’s acceptable. For production agentic workflows where speed matters, use vLLM or a cloud API. If you’re comparing local deployment approaches, see our DeepSeek V3.2 local setup guide for similar MoE deployment patterns.

Hardware Requirements

Configuration	VRAM/Memory	Hardware Example	Speed
BF16 Full Precision	~392GB	8x A100 80GB or 4x H100 80GB	100-350 tok/s per stream
FP8 Quantization	~196GB	4x H100 80GB	100-300 tok/s
Int4 Quantization	~120GB minimum	Mac Studio M4 Max, DGX Spark	5-20 tok/s
llama.cpp GGUF Q4_K_S	128GB unified	Mac Studio, DGX Spark	5-20 tok/s

CPU-only inference isn’t recommended—too slow for practical use. If you’re building agents that need to respond in real time, prioritize GPU deployment or cloud APIs. For batch processing or offline workflows, Mac deployment with llama.cpp works fine.

When to Use Step 3.5 Flash vs Alternatives

Use Step 3.5 Flash if you’re building agents, need 256K context, want open-source reliability, or can’t afford proprietary API costs at scale. Use GPT-5.2 if you need absolute frontier capability (80.0% SWE-bench vs 74.4%) and are willing to pay for proprietary support. Use DeepSeek V3.2 if you want raw parameter scale (671B) or prefer the MIT license, but expect larger GPU requirements.

For local deployment, use Step 3.5 Flash on Mac if you want zero API costs, full privacy, and can accept slower speeds (5-10 tok/s). For cloud APIs, use OpenRouter’s free tier for prototyping, then migrate to StepFun or NVIDIA NIM for production. If you’re orchestrating multi-step agent workflows, pair Step 3.5 Flash with LangGraph or similar frameworks.

Conclusion

Step 3.5 Flash proves that efficiency beats scale: 196B total with 11B active outperforms models 3-5x larger. The MoE architecture with 288 experts, hybrid attention, and MTP-3 is purpose-built for agentic AI. Multiple deployment paths mean you can start immediately via OpenRouter’s free tier or go fully self-hosted with vLLM. As agent frameworks mature, models like Step 3.5 Flash will power the next generation of autonomous AI systems without requiring proprietary APIs or massive GPU clusters.

Get the Daily Pulse

Sharp analysis on what's actually moving in AI. No hype, no filler, no weekly digest.

What Is Step 3.5 Flash? The 196B Model That Runs Like 11B

Benchmarks: How Step 3.5 Flash Stacks Up

Getting Started: OpenRouter Free Tier (Fastest Path)

Local Deployment: vLLM on GPU Servers

Local Deployment: llama.cpp on Your Mac

Hardware Requirements

When to Use Step 3.5 Flash vs Alternatives

Conclusion

Get the Daily Pulse

Related Posts

Grok Speech API Tutorial: TTS and STT at $4.20/1M Chars

Claude Opus 4.7 vs GPT-5.4 vs Gemini 3.1 Pro Compared

Kimi Code CLI Tutorial: Install, Run, and Configure K2.6

Get the Daily Pulse