API Server

igllama provides an OpenAI-compatible REST API server for GGUF models with streaming support.

GGUF stands for “Georgi Gerganov Unified Format”, the standard model format for llama.cpp.

Quick Start

igllama api model.gguf
igllama api model.gguf --host 0.0.0.0 --port 3000
igllama api model.gguf --gpu-layers 35
igllama api model.gguf --no-think

Server Flags

Flag	Short	Default	Description
`--model`	`-m`	(required)	Path to GGUF model
`--host`	`-h`	`127.0.0.1`	Server host
`--port`	`-p`	`8080`	Server port
`--ctx-size`	`-c`	`4096`	Context window size
`--max-tokens`	`-n`	`2048`	Max tokens per response
`--gpu-layers`	`-ngl`	`0`	GPU layers (-1 for all)
`--threads`	`-t`	all cores	Generation threads (GEMV, memory-BW bound)
`--threads-batch`	`-tb`	all cores	Prefill threads (GEMM, compute-parallel)
`--mlock`		off	Pin model weights in RAM (prevents paging)
`--no-think`		off	Suppress `<think>` reasoning blocks (Qwen3-style models)
`--temp`		`0.7`	Sampling temperature

Thinking Mode (Qwen3 Models)

Qwen3.5 and similar reasoning models produce <think>...</think> blocks before each response. These are useful for complex reasoning tasks but add significant latency and token overhead for everyday use.

Use --no-think to suppress them entirely:

igllama api model.gguf --no-think

How it works: When --no-think is set, igllama pre-fills an empty <think>\n\n</think> block on the assistant turn. This is the standard llama.cpp technique for disabling Qwen3-style chain-of-thought — the model sees the thinking phase as already complete and proceeds directly to the answer.

The flag has no effect on models that don’t produce thinking blocks.

CPU Performance Tuning

For CPU-only inference, generation speed is memory-bandwidth bound, not compute-bound. Setting more threads than your CPU’s memory channel count will hurt performance.

# Recommended for a 16-core server with 8 memory channels:
igllama api model.gguf \
  --threads 8 \
  --threads-batch 16 \
  --mlock \
  --ctx-size 8192

Why separate thread counts?

--threads controls generation (one token at a time = GEMV). Optimal = memory channel count.
--threads-batch controls prefill (processing the prompt = GEMM). Scales with total cores.
--mlock pins the model in physical RAM. Critical on servers where RAM is tight.

See the Benchmark Showcase for measured results on AMD EPYC-Rome hardware.

/v1/chat/completions

Method: POST

Request:

{"model":"default","messages":[{"role":"user","content":"Hello!"}],"stream":false}

Response:

{"id":"chatcmpl-123","object":"chat.completion","choices":[{"index":0,"message":{"role":"assistant","content":"Hi!"},"finish_reason":"stop"}]}

curl:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"default","messages":[{"role":"user","content":"Hi"}]}'

Streaming

Set stream: true for SSE streaming:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"default","messages":[{"role":"user","content":"Hi"}],"stream":true}'

Stream format (OpenAI-compatible, v0.3.6+):

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1234567890,"model":"model.gguf","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1234567890,"model":"model.gguf","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1234567890,"model":"model.gguf","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

/v1/embeddings

Method: POST

Request:

{"model":"default","input":["Hello world"]}

Response:

{"object":"list","data":[{"object":"embedding","index":0,"embedding":[0.123,-0.456]}],"usage":{"prompt_tokens":2,"total_tokens":2}}

curl:

curl http://localhost:8080/v1/embeddings -H "Content-Type: application/json" -d '{"model":"default","input":"Hello"}'

/health

Method: GET

Response: {"status":"ok","model":"loaded"}

curl http://localhost:8080/health

Error Handling

Status	Description
200	Success
400	Invalid request
404	Not found
500	Server error

CORS

Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, POST, OPTIONS

Examples

Python (Non-streaming)

import requests

response = requests.post(
    "http://localhost:8080/v1/chat/completions",
    json={
        "model": "default",
        "messages": [
            {"role": "system", "content": "You are helpful."},
            {"role": "user", "content": "Hello!"}
        ],
        "max_tokens": 100
    }
)

print(response.json()["choices"][0]["message"]["content"])

Python (Streaming)

import requests

response = requests.post(
    "http://localhost:8080/v1/chat/completions",
    json={"model": "default", "messages": [{"role": "user", "content": "Hello"}], "stream": True},
    stream=True
)

for line in response.iter_lines():
    if line:
        print(line.decode())

JavaScript (Streaming)

const response = await fetch('http://localhost:8080/v1/chat/completions', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ model: 'default', messages: [{ role: 'user', content: 'Hi' }], stream: true })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  for (const line of decoder.decode(value).split('\n')) {
    if (line.startsWith('data: ') && !line.includes('[DONE]')) {
      console.log(JSON.parse(line.slice(6)).choices[0].delta.content);
    }
  }
}

Model Compatibility

The API server works with any GGUF model file. GGUF (Georgi Gerganov Unified Format) is the standard format for llama.cpp-compatible models including Llama, Mistral, Qwen, Phi, and Gemma families.

For best results, use models with chat templates for multi-turn conversations.