Benchmark Showcase

Real-World Performance 📈

igllama is optimized for high-performance inference on CPU-only systems. By leveraging Zig’s efficiency and llama.cpp’s foundational power, we deliver a responsive experience even without a dedicated GPU.

Our showcase features in-depth case studies and benchmarks for the latest models in the ecosystem.

Qwen 3.5 Small Series Benchmarks

On March 2, 2026, Alibaba released the Qwen 3.5 Small Model Series (0.8B, 2B, 4B, 9B). We’ve benchmarked the entire family to find the “sweet spot” for reasoning-heavy tasks on mid-range hardware.

Highlights:

  • 0.8B: 23.01 tok/s — Ultra-fast edge AI.
  • 4B: 8.48 tok/s — The “sweet spot” for agentic reasoning.
  • 9B: 6.45 tok/s — Maximum intelligence for consumer CPUs.

Qwen 3.5 35B MoE Showcase

A deep dive into running the 35B Mixture-of-Experts (MoE) model on a 16-core CPU server. This showcase details the thread count optimizations that led to a 52% performance increase.

Highlights:

  • 5.56 tok/s on 16-core AMD EPYC-Rome.
  • 8 threads identified as the memory-bandwidth optimum.
  • Detailed analysis of memory channel alignment.

Performance Principles

All our benchmarks are conducted using the same set of core principles:

  1. CPU-Only: No GPU acceleration used.
  2. Standard Quants: Using Q4_K_XL or Q6_K GGUFs.
  3. Reproducible: Full CLI commands and flags provided.
  4. Transparent: Detailed hardware specs for every run.