API Server

igllama provides an OpenAI-compatible REST API server for GGUF models with streaming support.

GGUF stands for “Georgi Gerganov Unified Format”, the standard model format for llama.cpp.

Quick Start

igllama api model.gguf
igllama api model.gguf --host 0.0.0.0 --port 3000
igllama api model.gguf --gpu-layers 35
igllama api model.gguf --no-think

Server Flags

FlagShortDefaultDescription
--model-m(required)Path to GGUF model
--host-h127.0.0.1Server host
--port-p8080Server port
--ctx-size-c4096Context window size
--max-tokens-n2048Max tokens per response
--gpu-layers-ngl0GPU layers (-1 for all)
--threads-tall coresGeneration threads (GEMV, memory-BW bound)
--threads-batch-tball coresPrefill threads (GEMM, compute-parallel)
--mlockoffPin model weights in RAM (prevents paging)
--no-thinkoffSuppress <think> reasoning blocks (Qwen3-style models)
--temp0.7Sampling temperature

Thinking Mode (Qwen3 Models)

Qwen3.5 and similar reasoning models produce <think>...</think> blocks before each response. These are useful for complex reasoning tasks but add significant latency and token overhead for everyday use.

Use --no-think to suppress them entirely:

igllama api model.gguf --no-think

How it works: When --no-think is set, igllama pre-fills an empty <think>\n\n</think> block on the assistant turn. This is the standard llama.cpp technique for disabling Qwen3-style chain-of-thought — the model sees the thinking phase as already complete and proceeds directly to the answer.

The flag has no effect on models that don’t produce thinking blocks.

CPU Performance Tuning

For CPU-only inference, generation speed is memory-bandwidth bound, not compute-bound. Setting more threads than your CPU’s memory channel count will hurt performance.

# Recommended for a 16-core server with 8 memory channels:
igllama api model.gguf \
  --threads 8 \
  --threads-batch 16 \
  --mlock \
  --ctx-size 8192

Why separate thread counts?

  • --threads controls generation (one token at a time = GEMV). Optimal = memory channel count.
  • --threads-batch controls prefill (processing the prompt = GEMM). Scales with total cores.
  • --mlock pins the model in physical RAM. Critical on servers where RAM is tight.

See the Benchmark Showcase for measured results on AMD EPYC-Rome hardware.

/v1/chat/completions

Method: POST

Request:

{"model":"default","messages":[{"role":"user","content":"Hello!"}],"stream":false}

Response:

{"id":"chatcmpl-123","object":"chat.completion","choices":[{"index":0,"message":{"role":"assistant","content":"Hi!"},"finish_reason":"stop"}]}

curl:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"default","messages":[{"role":"user","content":"Hi"}]}'

Streaming

Set stream: true for SSE streaming:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"default","messages":[{"role":"user","content":"Hi"}],"stream":true}'

Stream format (OpenAI-compatible, v0.3.6+):

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1234567890,"model":"model.gguf","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1234567890,"model":"model.gguf","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","created":1234567890,"model":"model.gguf","choices":[{"index":0,"delta":{},"finish_reason":"stop"}]}

data: [DONE]

/v1/embeddings

Method: POST

Request:

{"model":"default","input":["Hello world"]}

Response:

{"object":"list","data":[{"object":"embedding","index":0,"embedding":[0.123,-0.456]}],"usage":{"prompt_tokens":2,"total_tokens":2}}

curl:

curl http://localhost:8080/v1/embeddings -H "Content-Type: application/json" -d '{"model":"default","input":"Hello"}'

/health

Method: GET

Response: {"status":"ok","model":"loaded"}

curl http://localhost:8080/health

Error Handling

StatusDescription
200Success
400Invalid request
404Not found
500Server error

CORS

Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, POST, OPTIONS

Examples

Python (Non-streaming)

import requests

response = requests.post(
    "http://localhost:8080/v1/chat/completions",
    json={
        "model": "default",
        "messages": [
            {"role": "system", "content": "You are helpful."},
            {"role": "user", "content": "Hello!"}
        ],
        "max_tokens": 100
    }
)

print(response.json()["choices"][0]["message"]["content"])

Python (Streaming)

import requests

response = requests.post(
    "http://localhost:8080/v1/chat/completions",
    json={"model": "default", "messages": [{"role": "user", "content": "Hello"}], "stream": True},
    stream=True
)

for line in response.iter_lines():
    if line:
        print(line.decode())

JavaScript (Streaming)

const response = await fetch('http://localhost:8080/v1/chat/completions', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ model: 'default', messages: [{ role: 'user', content: 'Hi' }], stream: true })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  for (const line of decoder.decode(value).split('\n')) {
    if (line.startsWith('data: ') && !line.includes('[DONE]')) {
      console.log(JSON.parse(line.slice(6)).choices[0].delta.content);
    }
  }
}

Model Compatibility

The API server works with any GGUF model file. GGUF (Georgi Gerganov Unified Format) is the standard format for llama.cpp-compatible models including Llama, Mistral, Qwen, Phi, and Gemma families.

For best results, use models with chat templates for multi-turn conversations.