Qwen 3.5 Small Series Showcase

← Back to Benchmark Showcase

The New Qwen 3.5 Small Series 🚀

On March 2, 2026, Alibaba released the Qwen 3.5 Small Model Series, specifically designed for efficient on-device and edge AI applications. These models feature native multimodality, a 262K context window, and advanced “thinking mode” capabilities.

We have benchmarked the entire series on our standard CPU-only server hardware to identify the performance characteristics and “sweet spots” for local deployment.

Test Hardware

ComponentSpecification
CPUAMD EPYC-Rome (16 cores / 32 threads @ 2.0 GHz)
RAM30 GB DDR4
GPUNone (CPU-only)
OSLinux
igllamav0.3.7
llama.cppgguf-v0.18.0

Performance Summary (CPU-Only)

Benchmarks conducted with igllama api using optimal thread settings (8 threads for generation, 16 threads for prefill) and --no-think mode.

ModelWeight ClassGGUF Size (Q4_K_XL)Generation SpeedNotes
Qwen3.5-0.8BEdge0.53 GB23.01 tok/sUltra-fast, ideal for mobile/IoT
Qwen3.5-2BMobile1.28 GB18.38 tok/sHigh speed, capable reasoning
Qwen3.5-4BSweet Spot2.71 GB8.48 tok/sRecommended for agentic tasks
Qwen3.5-9BHeavy5.56 GB6.45 tok/sBest reasoning, vision capable

The “Sweet Spot”: Qwen3.5-4B

While the 0.8B and 2B models are incredibly fast, the Qwen3.5-4B provides the most impressive balance of reasoning density and throughput. At 8.48 tok/s, it is more than fast enough for interactive use while offering reasoning capabilities that rival much larger previous-generation models.

Comparison to Qwen3.5-35B-A3B (MoE)

MetricQwen3.5-4B (Dense)Qwen3.5-35B-A3B (MoE)
Active Params4B3B
GGUF Size (Q4)2.71 GB19.17 GB
Generation Speed8.48 tok/s5.56 tok/s
Memory Footprint~3.5 GB~22 GB

The 4B model is 52% faster than the 35B-A3B MoE on the same hardware, despite having slightly more active parameters. This is likely due to the smaller memory footprint reducing cache misses and memory bus contention.

Optimal Launch Configuration

For the 4B/9B models on 8-channel memory systems (like EPYC-Rome):

igllama api Qwen3.5-4B-UD-Q4_K_XL.gguf \
  --threads 8 \
  --threads-batch 16 \
  --mlock \
  --ctx-size 8192 \
  --no-think

Conclusion

The Qwen 3.5 Small Series proves that you don’t need a flagship GPU to run high-quality LLMs. For CPU-only servers or high-end workstations, the 4B model is the new standard for local AI agents, providing intelligence and speed in a compact package.