Qwen 3.5 Small Series Showcase

The New Qwen 3.5 Small Series 🚀

On March 2, 2026, Alibaba released the Qwen 3.5 Small Model Series, specifically designed for efficient on-device and edge AI applications. These models feature native multimodality, a 262K context window, and advanced “thinking mode” capabilities.

We have benchmarked the entire series on our standard CPU-only server hardware to identify the performance characteristics and “sweet spots” for local deployment.

Test Hardware

Component	Specification
CPU	AMD EPYC-Rome (16 cores / 32 threads @ 2.0 GHz)
RAM	30 GB DDR4
GPU	None (CPU-only)
OS	Linux
igllama	v0.3.7
llama.cpp	gguf-v0.18.0

Performance Summary (CPU-Only)

Benchmarks conducted with igllama api using optimal thread settings (8 threads for generation, 16 threads for prefill) and --no-think mode.

Model	Weight Class	GGUF Size (Q4_K_XL)	Generation Speed	Notes
Qwen3.5-0.8B	Edge	0.53 GB	23.01 tok/s	Ultra-fast, ideal for mobile/IoT
Qwen3.5-2B	Mobile	1.28 GB	18.38 tok/s	High speed, capable reasoning
Qwen3.5-4B	Sweet Spot	2.71 GB	8.48 tok/s	Recommended for agentic tasks
Qwen3.5-9B	Heavy	5.56 GB	6.45 tok/s	Best reasoning, vision capable

The “Sweet Spot”: Qwen3.5-4B

While the 0.8B and 2B models are incredibly fast, the Qwen3.5-4B provides the most impressive balance of reasoning density and throughput. At 8.48 tok/s, it is more than fast enough for interactive use while offering reasoning capabilities that rival much larger previous-generation models.

Comparison to Qwen3.5-35B-A3B (MoE)

Metric	Qwen3.5-4B (Dense)	Qwen3.5-35B-A3B (MoE)
Active Params	4B	3B
GGUF Size (Q4)	2.71 GB	19.17 GB
Generation Speed	8.48 tok/s	5.56 tok/s
Memory Footprint	~3.5 GB	~22 GB

The 4B model is 52% faster than the 35B-A3B MoE on the same hardware, despite having slightly more active parameters. This is likely due to the smaller memory footprint reducing cache misses and memory bus contention.

Optimal Launch Configuration

For the 4B/9B models on 8-channel memory systems (like EPYC-Rome):

igllama api Qwen3.5-4B-UD-Q4_K_XL.gguf \
  --threads 8 \
  --threads-batch 16 \
  --mlock \
  --ctx-size 8192 \
  --no-think

Conclusion

The Qwen 3.5 Small Series proves that you don’t need a flagship GPU to run high-quality LLMs. For CPU-only servers or high-end workstations, the 4B model is the new standard for local AI agents, providing intelligence and speed in a compact package.