CLI Reference

Complete reference for all igllama commands. GGUF stands for “Georgi Gerganov Unified Format”, the standard model format for llama.cpp-compatible inference engines.

Command Overview

CommandDescription
helpShow usage information
versionShow version information
pullDownload model from HuggingFace Hub
listList all cached models
runRun single-turn inference
chatInteractive multi-turn chat session
importImport local GGUF file to cache
apiStart OpenAI-compatible API server
showDisplay GGUF file metadata
rmRemove a cached model
serveManage llama-server lifecycle

help

Display usage information and available commands.

igllama help
igllama --help
igllama -h

version

Display the current igllama version.

igllama version
igllama --version
igllama -v

pull

Download models from HuggingFace Hub with progress bar and resume support.

# List available GGUF files
igllama pull bartowski/Llama-3-8B-Instruct-GGUF

# Download specific file
igllama pull bartowski/Llama-3-8B-Instruct-GGUF --file Llama-3-8B-Instruct-Q4_K_M.gguf

# Force re-download
igllama pull bartowski/Llama-3-8B-Instruct-GGUF --file model.gguf --force
FlagDescription
-f, --fileDownload specific file from repository
-F, --forceForce re-download even if file exists
-q, --quietSuppress progress output

list (cached models)

List all downloaded models in the cache.

igllama list
igllama ls

run (inference)

Run single-turn inference on a model.

igllama run model.gguf --prompt "Hello, world!"
igllama run model.gguf -p "Explain quantum computing" --gpu-layers 35
igllama run model.gguf -p "List 3 items" --grammar-file json
FlagDescription
-p, --promptPrompt text (required)
-n, --max-tokensMax tokens to generate (default: 512)
-ngl, --gpu-layersGPU layers to offload (default: 0)
-g, --grammarGBNF grammar string for constrained output
-gf, --grammar-filePath to GBNF grammar file

chat

Interactive multi-turn chat with conversation history and session management.

igllama chat model.gguf
igllama chat model.gguf --system "You are a helpful coding assistant"
igllama chat model.gguf --template llama3
igllama chat model.gguf --prompt "Explain recursion" --json
igllama chat model.gguf --resume session_name
igllama chat model.gguf --temp 0.8 --top-p 0.9 --top-k 40
FlagDescription
-s, --systemSet system prompt
-t, --templateChat template: auto, chatml, llama3, mistral, gemma, phi3, qwen
-n, --max-tokensMax tokens per response (default: 2048)
-c, --context-sizeContext size (default: model training size)
-p, --promptSingle-turn mode (non-interactive)
--jsonOutput response as JSON
-q, --quietSuppress model loading logs
-ngl, --gpu-layersGPU layers to offload (default: 0)
--tempTemperature (default: 0.7, 0=greedy)
--top-pTop-p nucleus sampling (default: 0.9)
--top-kTop-k sampling (default: 40)
--repeat-penaltyRepetition penalty (default: 1.1)
--seedRandom seed (default: 0=random)
--grammarGBNF grammar string
--grammar-filePath to GBNF grammar file
--no-saveDisable auto-save sessions
--resumeResume session from file

Chat Subcommands

In-chat commands available during interactive sessions:

CommandDescription
/helpShow available commands
/quit, /exitExit the chat session
/clearClear conversation history and KV cache
/save <name>Save session to file
/load <name>Load a saved session
/sessionsList all saved sessions
/system <text>Set or update system prompt
/tokensShow token usage statistics
/statsShow generation statistics
/template <name>Switch chat template

import (cache models)

Import local GGUF files into the model cache.

igllama import /path/to/model.gguf
igllama import /path/to/model.gguf --copy
igllama import /path/to/model.gguf --alias my-model
FlagDescription
--copyCopy file to cache
--symlink, --linkCreate symlink to source
--alias, -aCreate named alias for quick access

api

Start OpenAI-compatible REST API server with streaming support.

igllama api model.gguf
igllama api model.gguf --host 0.0.0.0 --port 3000
igllama api model.gguf --gpu-layers -1
FlagDescription
-m, --modelPath to GGUF model file (required)
-p, --portServer port (default: 8080)
-h, --hostServer host (default: 127.0.0.1)
-c, --ctx-sizeContext size (default: 4096)
-n, --max-tokensMax tokens per response (default: 2048)
-ngl, --gpu-layersGPU layers to offload (default: 0)
--tempTemperature (default: 0.7)

API Endpoints

EndpointMethodDescription
/healthGETHealth check
/v1/modelsGETList available models
/v1/chat/completionsPOSTChat completions (streaming supported)
/v1/embeddingsPOSTGenerate embeddings
# Chat completion
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello!"}],"stream":false}'

# Streaming
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Story"}],"stream":true}'

show

Display GGUF file metadata including version, tensor count, and architecture.

igllama show model.gguf
igllama show ./models/llama-3-8b.gguf

rm

Remove a downloaded model from the cache.

igllama rm bartowski/Llama-3-8B-Instruct-GGUF
igllama remove bartowski/Llama-3-8B-Instruct-GGUF

serve

Manage llama-server lifecycle for API access.

igllama serve start -m model.gguf
igllama serve start -m model.gguf --port 8080
igllama serve status
igllama serve logs
igllama serve logs --follow
igllama serve stop
SubcommandDescription
start -m <model>Start llama-server with model
stopStop running server
statusShow server status
logs [--follow]View server logs
helpShow serve help

Serve Start Options

FlagDescription
--model, -mPath to GGUF model file (required)
--portServer port (default: 8080)
--hostServer host (default: 127.0.0.1)
--ctx-sizeContext size (default: 2048)
--n-gpu-layersNumber of GPU layers (default: 0)

Environment Variables

VariableDescription
IGLLAMA_HOMEBase directory for models (default: ~/.cache/huggingface)
HF_TOKENHuggingFace API token for private/gated models
HF_HOMECustom HuggingFace cache directory

Examples

# Download and run
igllama pull bartowski/Llama-3-8B-Instruct-GGUF
igllama run bartowski/Llama-3-8B-Instruct-GGUF --prompt "Hello!"

# Chat with template
igllama chat model.gguf --template chatml --system "You are helpful"

# API server
igllama api model.gguf --port 8080
curl http://localhost:8080/v1/chat/completions -d '{"messages":[{"role":"user","content":"Hi"}]}'

# Import with alias
igllama import ~/models/custom.gguf --alias my-custom
igllama chat my-custom