Qwen 3.5 35B MoE Showcase

← Back to Benchmark Showcase

Why Qwen3.5 MoE Excels on CPU-Only Systems

Qwen3.5-35B-A3B is a Mixture of Experts (MoE) model with 35 billion total parameters, but only 3 billion active parameters per inference step. This sparse activation pattern means your CPU processes roughly the same FLOPs as a 3B dense model, but draws on 35B of knowledge. For CPU-only deployments with plenty of RAM, this architecture is uniquely efficient.

The benchmarks below are real measurements on actual server hardware — not synthetic tests.

Test Hardware

ComponentSpecification
CPUAMD EPYC-Rome (16 cores / 32 threads @ 2.0 GHz)
RAM30 GB DDR4
StorageSSD
GPUNone (CPU-only)
OSLinux
igllamav0.3.7
llama.cppgguf-v0.18.0
ModelQwen3.5-35B-A3B UD-Q4_K_XL (19.17 GB)

Thread Count Optimization

Generation throughput on CPU is memory-bandwidth bound (GEMV operations), not compute bound. Adding more threads past your CPU’s memory channel count adds contention without increasing speed. This EPYC-Rome chip has 8 memory channels, which matches the measured optimum.

Generation throughput vs thread count (--threads-batch 16 fixed):

 2 threads: ██████████░░░░░░░░░░░░░░░░░░░░  2.14 tok/s
 4 threads: █████████████████████░░░░░░░░░  4.40 tok/s
 6 threads: ███████████████████░░░░░░░░░░░  3.95 tok/s
 8 threads: ████████████████████████████░░  5.56 tok/s  ← optimal
12 threads: ██████████████████████░░░░░░░░  4.65 tok/s
16 threads: █████████████░░░░░░░░░░░░░░░░░  2.86 tok/s
ThreadsGeneration (tok/s)vs Default (4t)
22.14-51%
44.40baseline
63.95-10%
85.56+26%
124.65+6%
162.86-35%

The previous igllama default capped generation at 4 threads regardless of hardware. The new --threads and --threads-batch flags (v0.3.4) let you tune this independently for your system.

Prefill vs Generation

These are fundamentally different compute patterns:

PhaseOperationBottleneckOptimal threads
Prefill (prompt processing)GEMM (matrix multiply)Compute-parallelAll cores (16)
Generation (token-by-token)GEMV (matrix-vector)Memory bandwidthMemory channels (8)

Use --threads for generation and --threads-batch for prefill independently:

igllama api model.gguf --threads 8 --threads-batch 16

Optimal Server Launch Command

igllama api Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \
  --threads 8 \
  --threads-batch 16 \
  --mlock \
  --ctx-size 8192 \
  --no-think

Flags explained:

  • --threads 8 — matches EPYC-Rome’s 8 memory channels for generation
  • --threads-batch 16 — all cores for parallel prompt processing
  • --mlock — pins model weights in RAM, prevents OS paging (critical with 19GB model)
  • --ctx-size 8192 — larger context than the default 4096
  • --no-think — suppresses Qwen3.5’s <think> reasoning blocks for faster, cleaner responses

Quantization Options (28 GB RAM Budget)

QuantizationFile SizeQualityTok/s (est.)Fits in RAM?
UD-IQ2_XXS9.8 GBLow~7.0Yes
UD-Q2_K_XL12.9 GBMedium-Low~6.0Yes
UD-Q3_K_M16.7 GBMedium~5.8Yes
UD-Q3_K_XL17.2 GBMedium~5.7Yes
UD-Q4_K_XL19.2 GBHigh~5.6Yes
UD-Q4_K_M19.9 GBHigh~5.5Yes
UD-Q5_K_XL24.9 GBVery High~5.0Yes

The UD- prefix means Unsloth Dynamic 2.0 — important layers are selectively upcasted for better quality. UD-Q4_K_XL (used in these benchmarks) is the recommended sweet spot for 28–30 GB RAM systems.

Qwen3.5 Model Family Comparison

ModelTotal ParamsActive ParamsCPU Suitability
Qwen3.5-27B (dense)27B27BPoor — 9× more compute per token
Qwen3.5-35B-A3B (MoE)35B3BExcellent — best CPU choice
Qwen3.5-122B-A10B (MoE)122B10BOK if RAM ≥ 70 GB
Qwen3.5-397B-A17B (MoE)397B17BImpractical on CPU

For CPU-only systems, the 35B-A3B is the clear winner: dense 27B would need 9× more compute per token and still deliver less knowledge.

Performance Summary

MetricValue
Generation speed5.56 tok/s @ 8 threads
Prefill speed~16 tok/s @ 16 threads
Model size (UD-Q4_K_XL)19.17 GB
RAM usage (with KV cache 8K)~22 GB
Time to first token (short prompt)~0.5 s
Time to first token (3K token prompt)~12 s
Context window8192 (default server config)

Thinking Mode: Why You Usually Want It Off

Qwen3.5 models produce <think>...</think> blocks before each response — an internal chain-of-thought that can be hundreds to thousands of tokens long. On CPU hardware this adds tens of seconds of latency before the actual answer appears.

Use --no-think to skip the reasoning phase entirely:

# With thinking (default): ~55s for 200-token think block + actual response
igllama api model.gguf

# Without thinking: response starts immediately
igllama api model.gguf --no-think

igllama implements this by pre-filling an empty <think>\n\n</think> block on the assistant turn — the same technique used by llama.cpp-based tooling. The model treats thinking as already complete and jumps straight to the answer.

Thinking mode remains useful for complex reasoning tasks (math, coding puzzles). For general-purpose API use with tools like Forge or opencode, --no-think is recommended.

mlock: Why It Matters

With a 19 GB model and only 28-30 GB of RAM, the OS may begin swapping model weights to disk under memory pressure. --mlock (new in v0.3.4) pins the model in physical RAM:

# Without mlock: potential paging during inference → 10–100× slower spikes
# With mlock: consistent throughput, no swap latency
igllama api model.gguf --mlock

Note: mlock requires sufficient free RAM. If your system has less free RAM than the model size, omit this flag.