Architecture
igllama is built on a Zig-first philosophy, wrapping the battle-tested llama.cpp C core with idiomatic Zig abstractions.
High-Level Overview 🏗️
The architecture follows a clean layered design, from your terminal to the neural network weights.
Component Breakdown 🔧
main.zig — The Entry Point
The main.zig file serves as the command center, parsing CLI arguments and routing to appropriate command handlers. It uses Zig’s std.ArgIterator for argument parsing and manages the application lifecycle.
Key responsibilities:
- Command routing (run, convert, info)
- Global flag parsing (
--verbose,--help) - Error handling and exit codes
- Resource cleanup via defer statements
commands/ — Modular Command Handlers
Each command lives in its own module under commands/:
| Command | Module | Purpose |
|---|---|---|
run | commands/run.zig | Interactive chat and completion |
convert | commands/convert.zig | Model format conversion |
info | commands/info.zig | GGUF metadata inspection |
This modular design keeps concerns separated and makes the codebase navigable.
llama.cpp.zig — Zero-Overhead Bindings 🔄
The llama.cpp.zig module provides idiomatic Zig wrappers around llama.cpp’s C API. These bindings use Zig’s extern struct for memory layout compatibility and comptime for compile-time checks.
pub const Context = extern struct {
ptr: *llama_context,
pub fn decode(self: Context, batch: *Batch) !void {
const ret = llama_decode(self.ptr, batch.*);
if (ret != 0) return error.DecodeFailed;
}
};
The bindings maintain binary compatibility while adding Zig’s type safety and error handling.
GGUF File Structure 📦
GGUF (Georgi Gerganov Unified Format) is the binary container format for model weights and metadata. Named after Georgi Gerganov, the creator of llama.cpp, it’s optimized for fast loading and memory mapping.
The format prioritizes:
- Fast loading via memory mapping
- Self-describing metadata in human-readable form
- Alignment for direct memory access
- Version tolerance for forward compatibility
Memory Layout During Model Loading
When igllama loads a model, memory is organized in a specific pattern to maximize efficiency and enable zero-copy operations where possible.
Key memory regions:
- Stack: Small, fast allocations for CLI parsing and control flow
- Heap: Zig allocator manages contexts, batch buffers, and KV cache
- Memory-Mapped GGUF: Direct file mapping for zero-copy tensor reads
- C Heap: llama.cpp’s internal allocations (managed via
llama_free)
The Zig-First Advantage
Why wrap C in Zig instead of writing pure Zig? Three reasons:
Performance: llama.cpp’s C core is heavily optimized with SIMD intrinsics, multi-threading, and platform-specific tweaks. We inherit all of it.
Compatibility: New llama.cpp features drop in without rewriting core inference logic. The bindings layer absorbs API changes.
Safety: Zig’s type system catches errors at compile time. The C core handles the math, Zig handles the structure.
This hybrid approach gives us the best of both worlds: C’s raw performance with Zig’s modern tooling and safety guarantees.
Data Flow
User Input -> CLI Parser -> Command Handler -> llama.cpp.zig -> C Core -> Token Stream -> Output
↓ ↓ ↓
| ArgIterator | run.zig/etc | Bindings | llama_decode | llama_sampling | Result |
The flow is intentionally linear with minimal abstraction layers. Each component has a single responsibility, making the system easy to trace and debug.
Last updated: February 2026.