Qwen 3.5 35B MoE Showcase
Why Qwen3.5 MoE Excels on CPU-Only Systems
Qwen3.5-35B-A3B is a Mixture of Experts (MoE) model with 35 billion total parameters, but only 3 billion active parameters per inference step. This sparse activation pattern means your CPU processes roughly the same FLOPs as a 3B dense model, but draws on 35B of knowledge. For CPU-only deployments with plenty of RAM, this architecture is uniquely efficient.
The benchmarks below are real measurements on actual server hardware — not synthetic tests.
Test Hardware
| Component | Specification |
|---|---|
| CPU | AMD EPYC-Rome (16 cores / 32 threads @ 2.0 GHz) |
| RAM | 30 GB DDR4 |
| Storage | SSD |
| GPU | None (CPU-only) |
| OS | Linux |
| igllama | v0.3.7 |
| llama.cpp | gguf-v0.18.0 |
| Model | Qwen3.5-35B-A3B UD-Q4_K_XL (19.17 GB) |
Thread Count Optimization
Generation throughput on CPU is memory-bandwidth bound (GEMV operations), not compute bound. Adding more threads past your CPU’s memory channel count adds contention without increasing speed. This EPYC-Rome chip has 8 memory channels, which matches the measured optimum.
Generation throughput vs thread count (--threads-batch 16 fixed):
2 threads: ██████████░░░░░░░░░░░░░░░░░░░░ 2.14 tok/s
4 threads: █████████████████████░░░░░░░░░ 4.40 tok/s
6 threads: ███████████████████░░░░░░░░░░░ 3.95 tok/s
8 threads: ████████████████████████████░░ 5.56 tok/s ← optimal
12 threads: ██████████████████████░░░░░░░░ 4.65 tok/s
16 threads: █████████████░░░░░░░░░░░░░░░░░ 2.86 tok/s
| Threads | Generation (tok/s) | vs Default (4t) |
|---|---|---|
| 2 | 2.14 | -51% |
| 4 | 4.40 | baseline |
| 6 | 3.95 | -10% |
| 8 | 5.56 | +26% |
| 12 | 4.65 | +6% |
| 16 | 2.86 | -35% |
The previous igllama default capped generation at 4 threads regardless of hardware. The new --threads and --threads-batch flags (v0.3.4) let you tune this independently for your system.
Prefill vs Generation
These are fundamentally different compute patterns:
| Phase | Operation | Bottleneck | Optimal threads |
|---|---|---|---|
| Prefill (prompt processing) | GEMM (matrix multiply) | Compute-parallel | All cores (16) |
| Generation (token-by-token) | GEMV (matrix-vector) | Memory bandwidth | Memory channels (8) |
Use --threads for generation and --threads-batch for prefill independently:
igllama api model.gguf --threads 8 --threads-batch 16
Optimal Server Launch Command
igllama api Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf \
--threads 8 \
--threads-batch 16 \
--mlock \
--ctx-size 8192 \
--no-think
Flags explained:
--threads 8— matches EPYC-Rome’s 8 memory channels for generation--threads-batch 16— all cores for parallel prompt processing--mlock— pins model weights in RAM, prevents OS paging (critical with 19GB model)--ctx-size 8192— larger context than the default 4096--no-think— suppresses Qwen3.5’s<think>reasoning blocks for faster, cleaner responses
Quantization Options (28 GB RAM Budget)
| Quantization | File Size | Quality | Tok/s (est.) | Fits in RAM? |
|---|---|---|---|---|
| UD-IQ2_XXS | 9.8 GB | Low | ~7.0 | Yes |
| UD-Q2_K_XL | 12.9 GB | Medium-Low | ~6.0 | Yes |
| UD-Q3_K_M | 16.7 GB | Medium | ~5.8 | Yes |
| UD-Q3_K_XL | 17.2 GB | Medium | ~5.7 | Yes |
| UD-Q4_K_XL | 19.2 GB | High | ~5.6 | Yes |
| UD-Q4_K_M | 19.9 GB | High | ~5.5 | Yes |
| UD-Q5_K_XL | 24.9 GB | Very High | ~5.0 | Yes |
The UD- prefix means Unsloth Dynamic 2.0 — important layers are selectively upcasted for better quality. UD-Q4_K_XL (used in these benchmarks) is the recommended sweet spot for 28–30 GB RAM systems.
Qwen3.5 Model Family Comparison
| Model | Total Params | Active Params | CPU Suitability |
|---|---|---|---|
| Qwen3.5-27B (dense) | 27B | 27B | Poor — 9× more compute per token |
| Qwen3.5-35B-A3B (MoE) | 35B | 3B | Excellent — best CPU choice |
| Qwen3.5-122B-A10B (MoE) | 122B | 10B | OK if RAM ≥ 70 GB |
| Qwen3.5-397B-A17B (MoE) | 397B | 17B | Impractical on CPU |
For CPU-only systems, the 35B-A3B is the clear winner: dense 27B would need 9× more compute per token and still deliver less knowledge.
Performance Summary
| Metric | Value |
|---|---|
| Generation speed | 5.56 tok/s @ 8 threads |
| Prefill speed | ~16 tok/s @ 16 threads |
| Model size (UD-Q4_K_XL) | 19.17 GB |
| RAM usage (with KV cache 8K) | ~22 GB |
| Time to first token (short prompt) | ~0.5 s |
| Time to first token (3K token prompt) | ~12 s |
| Context window | 8192 (default server config) |
Thinking Mode: Why You Usually Want It Off
Qwen3.5 models produce <think>...</think> blocks before each response — an internal chain-of-thought that can be hundreds to thousands of tokens long. On CPU hardware this adds tens of seconds of latency before the actual answer appears.
Use --no-think to skip the reasoning phase entirely:
# With thinking (default): ~55s for 200-token think block + actual response
igllama api model.gguf
# Without thinking: response starts immediately
igllama api model.gguf --no-think
igllama implements this by pre-filling an empty <think>\n\n</think> block on the assistant turn — the same technique used by llama.cpp-based tooling. The model treats thinking as already complete and jumps straight to the answer.
Thinking mode remains useful for complex reasoning tasks (math, coding puzzles). For general-purpose API use with tools like Forge or opencode, --no-think is recommended.
mlock: Why It Matters
With a 19 GB model and only 28-30 GB of RAM, the OS may begin swapping model weights to disk under memory pressure. --mlock (new in v0.3.4) pins the model in physical RAM:
# Without mlock: potential paging during inference → 10–100× slower spikes
# With mlock: consistent throughput, no swap latency
igllama api model.gguf --mlock
Note: mlock requires sufficient free RAM. If your system has less free RAM than the model size, omit this flag.