API Server

igllama provides an OpenAI-compatible REST API server for GGUF models with streaming support.

GGUF stands for “Georgi Gerganov Unified Format”, the standard model format for llama.cpp.

Quick Start

igllama api model.gguf
igllama api model.gguf --host 0.0.0.0 --port 3000
igllama api model.gguf --gpu-layers 35

Server Flags

Flag	Short	Default	Description
`--model`	`-m`	(required)	Path to GGUF model
`--host`	`-h`	`127.0.0.1`	Server host
`--port`	`-p`	`8080`	Server port
`--gpu-layers`	`-ngl`	`0`	GPU layers (-1 for all)

/v1/chat/completions

Method: POST

Request:

{"model":"default","messages":[{"role":"user","content":"Hello!"}],"stream":false}

Response:

{"id":"chatcmpl-123","object":"chat.completion","choices":[{"index":0,"message":{"role":"assistant","content":"Hi!"},"finish_reason":"stop"}]}

curl:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"default","messages":[{"role":"user","content":"Hi"}]}'

Streaming

Set stream: true for SSE streaming:

curl http://localhost:8080/v1/chat/completions -H "Content-Type: application/json" -d '{"model":"default","messages":[{"role":"user","content":"Hi"}],"stream":true}'

Stream format:

data: {"id":"chatcmpl-123","object":"chat.completion.chunk","choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
data: [DONE]

/v1/embeddings

Method: POST

Request:

{"model":"default","input":["Hello world"]}

Response:

{"object":"list","data":[{"object":"embedding","index":0,"embedding":[0.123,-0.456]}],"usage":{"prompt_tokens":2,"total_tokens":2}}

curl:

curl http://localhost:8080/v1/embeddings -H "Content-Type: application/json" -d '{"model":"default","input":"Hello"}'

/health

Method: GET

Response: {"status":"ok","model":"loaded"}

curl http://localhost:8080/health

Error Handling

Status	Description
200	Success
400	Invalid request
404	Not found
500	Server error

CORS

Access-Control-Allow-Origin: *
Access-Control-Allow-Methods: GET, POST, OPTIONS

Examples

Python (Non-streaming)

import requests

response = requests.post(
    "http://localhost:8080/v1/chat/completions",
    json={
        "model": "default",
        "messages": [
            {"role": "system", "content": "You are helpful."},
            {"role": "user", "content": "Hello!"}
        ],
        "max_tokens": 100
    }
)

print(response.json()["choices"][0]["message"]["content"])

Python (Streaming)

import requests

response = requests.post(
    "http://localhost:8080/v1/chat/completions",
    json={"model": "default", "messages": [{"role": "user", "content": "Hello"}], "stream": True},
    stream=True
)

for line in response.iter_lines():
    if line:
        print(line.decode())

JavaScript (Streaming)

const response = await fetch('http://localhost:8080/v1/chat/completions', {
  method: 'POST',
  headers: { 'Content-Type': 'application/json' },
  body: JSON.stringify({ model: 'default', messages: [{ role: 'user', content: 'Hi' }], stream: true })
});

const reader = response.body.getReader();
const decoder = new TextDecoder();

while (true) {
  const { value, done } = await reader.read();
  if (done) break;
  for (const line of decoder.decode(value).split('\n')) {
    if (line.startsWith('data: ') && !line.includes('[DONE]')) {
      console.log(JSON.parse(line.slice(6)).choices[0].delta.content);
    }
  }
}

Performance Tips

Use --gpu-layers -1 to offload all layers to GPU for maximum speed
Reduce context size for faster responses with shorter conversations
Use streaming for better perceived latency in chat applications
Batch multiple inputs in embeddings requests when possible

Model Compatibility

The API server works with any GGUF model file. GGUF (Georgi Gerganov Unified Format) is the standard format for llama.cpp-compatible models including Llama, Mistral, Qwen, Phi, and Gemma families.

For best results, use models with chat templates for multi-turn conversations.