Quickstart Guide
Welcome to igllama, your gateway to running large language models locally. This guide gets you from zero to inference in minutes, not hours. No PhD required, no cloud dependencies, just pure local AI.
GGUF (Georgi Gerganov Unified Format) is the backbone here, a quantized model format that runs efficiently on consumer hardware.
Choose Your Path
| Track | Time | What You’ll Do |
|---|---|---|
| 5-minute sprint | 5 min | Pull a model and run a single prompt |
| Full tutorial | 10 min | Pull, run, chat, and import local models |
Prerequisites: igllama installed and built. If not, head to Installation first.
Step 1: Pull a Model (2 minutes)
First, download a model from HuggingFace. We’ll use Qwen3.5-35B-A3B, a Mixture of Experts model with 35B total parameters but only 3B active per inference. Perfect for CPU-only systems.
igllama pull Qwen/Qwen3.5-35B-A3B-GGUF
Expected output:
Downloading Qwen/Qwen3.5-35B-A3B-GGUF...
Model: qwen3.5-35b-a3b-ud-q4_k_xl.gguf (19.2 GB)
Progress: [████████████████████] 100% - 19.2 GB / 19.2 GB
Download complete. Model cached at: ~/.cache/igllama/models/Qwen/Qwen3.5-35B-A3B-GGUF
List your cached models:
igllama list
Expected output:
Cached Models:
- Qwen/Qwen3.5-35B-A3B-GGUF
└─ qwen3.5-35b-a3b-ud-q4_k_xl.gguf (19.2 GB)
Step 2: Run Inference (2 minutes)
Send a single prompt and get a response:
igllama run Qwen/Qwen3.5-35B-A3B-GGUF -p "Explain quantum entanglement in one sentence."
Expected output:
Loading model: qwen3.5-35b-a3b-ud-q4_k_xl.gguf
Model loaded: 35B parameters, 256K context
Generating response...
Quantum entanglement is a phenomenon where two or more particles become correlated in such a way that the quantum state of each particle cannot be described independently, even when separated by large distances.
Tokens: 42 | Time: 3.2s | Speed: 13.1 tok/s
Add GPU acceleration if available:
igllama run Qwen/Qwen3.5-35B-A3B-GGUF -p "Hello!" --gpu-layers 35
Step 3: Interactive Chat (3 minutes)
For multi-turn conversations with context memory:
igllama chat Qwen/Qwen3.5-35B-A3B-GGUF
Expected output:
Loading model: qwen3.5-35b-a3b-ud-q4_k_xl.gguf
Model loaded successfully.
Chat session started. Type '/help' for commands, '/quit' to exit.
> Hello! What can you help me with?
I can assist with a wide range of tasks including answering questions, creative writing, problem solving, coding help, analysis, and general conversation. What would you like to work on?
> Can you write a haiku about AI?
Silicon dreams wake,
Neural pathways spark and glow,
Mind from metal born.
> /tokens
Session Statistics:
Input tokens: 45
Output tokens: 128
Total tokens: 173
In-chat commands:
| Command | Description |
|---|---|
/help | Show available commands |
/quit | Exit chat session |
/clear | Clear conversation history |
/save <name> | Save session to file |
/system <text> | Set system prompt |
/tokens | Show token usage |
Step 4: Import Local GGUF (2 minutes)
Already have GGUF files? Import them into the cache:
# Import with symlink (default, saves disk space)
igllama import /path/to/model.gguf
# Import with copy (creates full copy)
igllama import /path/to/model.gguf --copy
# Import with custom alias
igllama import /path/to/model.gguf --alias my-model
Expected output:
Importing: /path/to/model.gguf
Creating symlink in cache...
Model imported as: my-model.gguf
Location: ~/.cache/igllama/models/my-model.gguf
Now use it like any pulled model:
igllama run my-model.gguf -p "Test prompt"
What’s Next?
You’ve completed the core workflow. Here’s where to go next:
- CLI Reference - Full command documentation
- Chat Mode - Advanced conversation features
- API Server - OpenAI-compatible REST API
- Installation - Build options and GPU backends
Troubleshooting
Model not found: Ensure you ran pull first or use the full path with import.
Out of memory: Try a smaller quantization (Q4_K_S instead of Q4_K_XL) or use --gpu-layers to offload to GPU.
Slow inference: Enable GPU acceleration with --gpu-layers -1 for full offload, or use a smaller model variant.
The machine is waiting. Your next prompt is yours.