API Server
igllama provides an OpenAI-compatible REST API server for GGUF models with streaming support.
GGUF stands for “Georgi Gerganov Unified Format”, the standard model format for llama.cpp.
Quick Start
igllama api model.gguf
igllama api model.gguf --host 0.0.0.0 --port 3000
igllama api model.gguf --gpu-layers 35
igllama api model.gguf --no-think
Server Flags
| Flag | Short | Default | Description |
|---|---|---|---|
--model | -m | (required) | Path to GGUF model |
--host | -h | 127.0.0.1 | Server host |
--port | -p | 8080 | Server port |
--ctx-size | -c | 4096 | Context window size |
--max-tokens | -n | 2048 | Max tokens per response |
--gpu-layers | -ngl | 0 | GPU layers (-1 for all) |
--threads | -t | all cores | Generation threads (GEMV, memory-BW bound) |
--threads-batch | -tb | all cores | Prefill threads (GEMM, compute-parallel) |
--mlock | off | Pin model weights in RAM (prevents paging) | |
--no-think | off | Suppress <think> reasoning blocks (Qwen3-style models) | |
--temp | 0.7 | Sampling temperature |
Thinking Mode (Qwen3 Models)
Qwen3.5 and similar reasoning models produce <think>...</think> blocks before each response. These are useful for complex reasoning tasks but add significant latency and token overhead for everyday use.
Use --no-think to suppress them entirely:
igllama api model.gguf --no-think
How it works: When --no-think is set, igllama pre-fills an empty <think>\n\n</think> block on the assistant turn. This is the standard llama.cpp technique for disabling Qwen3-style chain-of-thought — the model sees the thinking phase as already complete and proceeds directly to the answer.
The flag has no effect on models that don’t produce thinking blocks.
CPU Performance Tuning
For CPU-only inference, generation speed is memory-bandwidth bound, not compute-bound. Setting more threads than your CPU’s memory channel count will hurt performance.
# Recommended for a 16-core server with 8 memory channels:
igllama api model.gguf \
--threads 8 \
--threads-batch 16 \
--mlock \
--ctx-size 8192
Why separate thread counts?
--threadscontrols generation (one token at a time = GEMV). Optimal = memory channel count.--threads-batchcontrols prefill (processing the prompt = GEMM). Scales with total cores.--mlockpins the model in physical RAM. Critical on servers where RAM is tight.
See the Benchmark Showcase for measured results on AMD EPYC-Rome hardware.
/v1/chat/completions
Method: POST
Request:
{"model":"default","messages":[{"role":"user","content":"Hello!"}],"stream":false}
Response:
{"id":"chatcmpl-123","object":"chat.completion","choices":[{"index":0,"message":{"role":"assistant","content":"Hi!"},"finish_reason":"stop"}]}
curl:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"default","messages":[{"role":"user","content":"Hi"}]}'
Streaming
Set stream: true for SSE streaming:
curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"default","messages":[{"role":"user","content":"Hi"}],"stream":true}'
Stream format (OpenAI-compatible, v0.3.6+):
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1234567890,"model":"model.gguf","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1234567890,"model":"model.gguf","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1234567890,"model":"model.gguf","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}
data: [DONE]
/v1/embeddings
Method: POST
Request:
{"model":"default","input":["Hello world"]}
Response:
{"object":"list","data":[{"object":"embedding","index":0,"embedding":[0.123,-0.456]}],"usage":{"prompt_tokens":2,"total_tokens":2}}
curl:
curl http://localhost:8080/v1/embeddings -H "Content-Type: application/json" -d '{"model":"default","input":"Hello"}'
/health
Method: GET
Response: {"status":"ok","model":"loaded"}
curl http://localhost:8080/health
Error Handling
| Status | Description |
|---|---|
| 200 | Success |
| 400 | Invalid request |
| 404 | Not found |
| 500 | Server error |
CORS
Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, POST, OPTIONS
Examples
Python (Non-streaming)
import requests
response = requests.post(
"http://localhost:8080/v1/chat/completions",
json={
"model": "default",
"messages": [
{"role": "system", "content": "You are helpful."},
{"role": "user", "content": "Hello!"}
],
"max_tokens": 100
}
)
print(response.json()["choices"][0]["message"]["content"])
Python (Streaming)
import requests
response = requests.post(
"http://localhost:8080/v1/chat/completions",
json={"model": "default", "messages": [{"role": "user", "content": "Hello"}], "stream": True},
stream=True
)
for line in response.iter_lines():
if line:
print(line.decode())
JavaScript (Streaming)
const response = await fetch('http://localhost:8080/v1/chat/completions', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ model: 'default', messages: [{ role: 'user', content: 'Hi' }], stream: true })
});
const reader = response.body.getReader();
const decoder = new TextDecoder();
while (true) {
const { value, done } = await reader.read();
if (done) break;
for (const line of decoder.decode(value).split('\n')) {
if (line.startsWith('data: ') && !line.includes('[DONE]')) {
console.log(JSON.parse(line.slice(6)).choices[0].delta.content);
}
}
}
Model Compatibility
The API server works with any GGUF model file. GGUF (Georgi Gerganov Unified Format) is the standard format for llama.cpp-compatible models including Llama, Mistral, Qwen, Phi, and Gemma families.
For best results, use models with chat templates for multi-turn conversations.