Qwen 3.5 Small Series Showcase
The New Qwen 3.5 Small Series 🚀
On March 2, 2026, Alibaba released the Qwen 3.5 Small Model Series, specifically designed for efficient on-device and edge AI applications. These models feature native multimodality, a 262K context window, and advanced “thinking mode” capabilities.
We have benchmarked the entire series on our standard CPU-only server hardware to identify the performance characteristics and “sweet spots” for local deployment.
Test Hardware
| Component | Specification |
|---|---|
| CPU | AMD EPYC-Rome (16 cores / 32 threads @ 2.0 GHz) |
| RAM | 30 GB DDR4 |
| GPU | None (CPU-only) |
| OS | Linux |
| igllama | v0.3.7 |
| llama.cpp | gguf-v0.18.0 |
Performance Summary (CPU-Only)
Benchmarks conducted with igllama api using optimal thread settings (8 threads for generation, 16 threads for prefill) and --no-think mode.
| Model | Weight Class | GGUF Size (Q4_K_XL) | Generation Speed | Notes |
|---|---|---|---|---|
| Qwen3.5-0.8B | Edge | 0.53 GB | 23.01 tok/s | Ultra-fast, ideal for mobile/IoT |
| Qwen3.5-2B | Mobile | 1.28 GB | 18.38 tok/s | High speed, capable reasoning |
| Qwen3.5-4B | Sweet Spot | 2.71 GB | 8.48 tok/s | Recommended for agentic tasks |
| Qwen3.5-9B | Heavy | 5.56 GB | 6.45 tok/s | Best reasoning, vision capable |
The “Sweet Spot”: Qwen3.5-4B
While the 0.8B and 2B models are incredibly fast, the Qwen3.5-4B provides the most impressive balance of reasoning density and throughput. At 8.48 tok/s, it is more than fast enough for interactive use while offering reasoning capabilities that rival much larger previous-generation models.
Comparison to Qwen3.5-35B-A3B (MoE)
| Metric | Qwen3.5-4B (Dense) | Qwen3.5-35B-A3B (MoE) |
|---|---|---|
| Active Params | 4B | 3B |
| GGUF Size (Q4) | 2.71 GB | 19.17 GB |
| Generation Speed | 8.48 tok/s | 5.56 tok/s |
| Memory Footprint | ~3.5 GB | ~22 GB |
The 4B model is 52% faster than the 35B-A3B MoE on the same hardware, despite having slightly more active parameters. This is likely due to the smaller memory footprint reducing cache misses and memory bus contention.
Optimal Launch Configuration
For the 4B/9B models on 8-channel memory systems (like EPYC-Rome):
igllama api Qwen3.5-4B-UD-Q4_K_XL.gguf \
--threads 8 \
--threads-batch 16 \
--mlock \
--ctx-size 8192 \
--no-think
Conclusion
The Qwen 3.5 Small Series proves that you don’t need a flagship GPU to run high-quality LLMs. For CPU-only servers or high-end workstations, the 4B model is the new standard for local AI agents, providing intelligence and speed in a compact package.