CLI Reference

Complete reference for all igllama commands. GGUF stands for “Georgi Gerganov Unified Format”, the standard model format for llama.cpp-compatible inference engines.

Command Overview

Command	Description
`help`	Show usage information
`version`	Show version information
`pull`	Download model from HuggingFace Hub
`list`	List all cached models
`run`	Run single-turn inference
`chat`	Interactive multi-turn chat session
`import`	Import local GGUF file to cache
`api`	Start OpenAI-compatible API server
`show`	Display GGUF file metadata
`rm`	Remove a cached model
`serve`	Manage llama-server lifecycle

help

Display usage information and available commands.

igllama help
igllama --help
igllama -h

version

Display the current igllama version.

igllama version
igllama --version
igllama -v

pull

Download models from HuggingFace Hub with progress bar and resume support.

# List available GGUF files
igllama pull bartowski/Llama-3-8B-Instruct-GGUF

# Download specific file
igllama pull bartowski/Llama-3-8B-Instruct-GGUF --file Llama-3-8B-Instruct-Q4_K_M.gguf

# Force re-download
igllama pull bartowski/Llama-3-8B-Instruct-GGUF --file model.gguf --force

Flag	Description
`-f, --file`	Download specific file from repository
`-F, --force`	Force re-download even if file exists
`-q, --quiet`	Suppress progress output

list (cached models)

List all downloaded models in the cache.

igllama list
igllama ls

run (inference)

Run single-turn inference on a model.

igllama run model.gguf --prompt "Hello, world!"
igllama run model.gguf -p "Explain quantum computing" --gpu-layers 35
igllama run model.gguf -p "List 3 items" --grammar-file json

Flag	Description
`-p, --prompt`	Prompt text (required)
`-n, --max-tokens`	Max tokens to generate (default: 512)
`-ngl, --gpu-layers`	GPU layers to offload (default: 0)
`-g, --grammar`	GBNF grammar string for constrained output
`-gf, --grammar-file`	Path to GBNF grammar file

chat

Interactive multi-turn chat with conversation history and session management.

igllama chat model.gguf
igllama chat model.gguf --system "You are a helpful coding assistant"
igllama chat model.gguf --template llama3
igllama chat model.gguf --prompt "Explain recursion" --json
igllama chat model.gguf --resume session_name
igllama chat model.gguf --temp 0.8 --top-p 0.9 --top-k 40

Flag	Description
`-s, --system`	Set system prompt
`-t, --template`	Chat template: auto, chatml, llama3, mistral, gemma, phi3, qwen
`-n, --max-tokens`	Max tokens per response (default: 2048)
`-c, --context-size`	Context size (default: model training size)
`-p, --prompt`	Single-turn mode (non-interactive)
`--json`	Output response as JSON
`-q, --quiet`	Suppress model loading logs
`-ngl, --gpu-layers`	GPU layers to offload (default: 0)
`--temp`	Temperature (default: 0.7, 0=greedy)
`--top-p`	Top-p nucleus sampling (default: 0.9)
`--top-k`	Top-k sampling (default: 40)
`--repeat-penalty`	Repetition penalty (default: 1.1)
`--seed`	Random seed (default: 0=random)
`--grammar`	GBNF grammar string
`--grammar-file`	Path to GBNF grammar file
`--no-save`	Disable auto-save sessions
`--resume`	Resume session from file

Chat Subcommands

In-chat commands available during interactive sessions:

Command	Description
`/help`	Show available commands
`/quit`, `/exit`	Exit the chat session
`/clear`	Clear conversation history and KV cache
`/save <name>`	Save session to file
`/load <name>`	Load a saved session
`/sessions`	List all saved sessions
`/system <text>`	Set or update system prompt
`/tokens`	Show token usage statistics
`/stats`	Show generation statistics
`/template <name>`	Switch chat template

import (cache models)

Import local GGUF files into the model cache.

igllama import /path/to/model.gguf
igllama import /path/to/model.gguf --copy
igllama import /path/to/model.gguf --alias my-model

Flag	Description
`--copy`	Copy file to cache
`--symlink, --link`	Create symlink to source
`--alias, -a`	Create named alias for quick access

api

Start OpenAI-compatible REST API server with streaming support.

igllama api model.gguf
igllama api model.gguf --host 0.0.0.0 --port 3000
igllama api model.gguf --gpu-layers -1

Flag	Description
`-m, --model`	Path to GGUF model file (required)
`-p, --port`	Server port (default: 8080)
`-h, --host`	Server host (default: 127.0.0.1)
`-c, --ctx-size`	Context size (default: 4096)
`-n, --max-tokens`	Max tokens per response (default: 2048)
`-ngl, --gpu-layers`	GPU layers to offload (default: 0)
`--temp`	Temperature (default: 0.7)

API Endpoints

Endpoint	Method	Description
`/health`	GET	Health check
`/v1/models`	GET	List available models
`/v1/chat/completions`	POST	Chat completions (streaming supported)
`/v1/embeddings`	POST	Generate embeddings

# Chat completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello!"}],"stream":false}'

# Streaming
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Story"}],"stream":true}'

show

Display GGUF file metadata including version, tensor count, and architecture.

igllama show model.gguf
igllama show ./models/llama-3-8b.gguf

rm

Remove a downloaded model from the cache.

igllama rm bartowski/Llama-3-8B-Instruct-GGUF
igllama remove bartowski/Llama-3-8B-Instruct-GGUF

serve

Manage llama-server lifecycle for API access.

igllama serve start -m model.gguf
igllama serve start -m model.gguf --port 8080
igllama serve status
igllama serve logs
igllama serve logs --follow
igllama serve stop

Subcommand	Description
`start -m <model>`	Start llama-server with model
`stop`	Stop running server
`status`	Show server status
`logs [--follow]`	View server logs
`help`	Show serve help

Serve Start Options

Flag	Description
`--model, -m`	Path to GGUF model file (required)
`--port`	Server port (default: 8080)
`--host`	Server host (default: 127.0.0.1)
`--ctx-size`	Context size (default: 2048)
`--n-gpu-layers`	Number of GPU layers (default: 0)

Environment Variables

Variable	Description
`IGLLAMA_HOME`	Base directory for models (default: ~/.cache/huggingface)
`HF_TOKEN`	HuggingFace API token for private/gated models
`HF_HOME`	Custom HuggingFace cache directory

Examples

# Download and run
igllama pull bartowski/Llama-3-8B-Instruct-GGUF
igllama run bartowski/Llama-3-8B-Instruct-GGUF --prompt "Hello!"

# Chat with template
igllama chat model.gguf --template chatml --system "You are helpful"

# API server
igllama api model.gguf --port 8080
curl http://localhost:8080/v1/chat/completions -d '{"messages":[{"role":"user","content":"Hi"}]}'

# Import with alias
igllama import ~/models/custom.gguf --alias my-custom
igllama chat my-custom