Architecture

igllama is built on a Zig-first philosophy, wrapping the battle-tested llama.cpp C core with idiomatic Zig abstractions.

High-Level Overview 🏗️

The architecture follows a clean layered design, from your terminal to the neural network weights.

Component Breakdown 🔧

main.zig — The Entry Point

The main.zig file serves as the command center, parsing CLI arguments and routing to appropriate command handlers. It uses Zig’s std.ArgIterator for argument parsing and manages the application lifecycle.

Key responsibilities:

Command routing (run, convert, info)
Global flag parsing (--verbose, --help)
Error handling and exit codes
Resource cleanup via defer statements

commands/ — Modular Command Handlers

Each command lives in its own module under commands/:

Command	Module	Purpose
`run`	`commands/run.zig`	Interactive chat and completion
`convert`	`commands/convert.zig`	Model format conversion
`info`	`commands/info.zig`	GGUF metadata inspection

This modular design keeps concerns separated and makes the codebase navigable.

llama.cpp.zig — Zero-Overhead Bindings 🔄

The llama.cpp.zig module provides idiomatic Zig wrappers around llama.cpp’s C API. These bindings use Zig’s extern struct for memory layout compatibility and comptime for compile-time checks.

pub const Context = extern struct {
    ptr: *llama_context,
    
    pub fn decode(self: Context, batch: *Batch) !void {
        const ret = llama_decode(self.ptr, batch.*);
        if (ret != 0) return error.DecodeFailed;
    }
};

The bindings maintain binary compatibility while adding Zig’s type safety and error handling.

GGUF File Structure 📦

GGUF (Georgi Gerganov Unified Format) is the binary container format for model weights and metadata. Named after Georgi Gerganov, the creator of llama.cpp, it’s optimized for fast loading and memory mapping.

The format prioritizes:

Fast loading via memory mapping
Self-describing metadata in human-readable form
Alignment for direct memory access
Version tolerance for forward compatibility

Memory Layout During Model Loading

When igllama loads a model, memory is organized in a specific pattern to maximize efficiency and enable zero-copy operations where possible.

Key memory regions:

Stack: Small, fast allocations for CLI parsing and control flow
Heap: Zig allocator manages contexts, batch buffers, and KV cache
Memory-Mapped GGUF: Direct file mapping for zero-copy tensor reads
C Heap: llama.cpp’s internal allocations (managed via llama_free)

The Zig-First Advantage

Why wrap C in Zig instead of writing pure Zig? Three reasons:

Performance: llama.cpp’s C core is heavily optimized with SIMD intrinsics, multi-threading, and platform-specific tweaks. We inherit all of it.

Compatibility: New llama.cpp features drop in without rewriting core inference logic. The bindings layer absorbs API changes.

Safety: Zig’s type system catches errors at compile time. The C core handles the math, Zig handles the structure.

This hybrid approach gives us the best of both worlds: C’s raw performance with Zig’s modern tooling and safety guarantees.

Data Flow

User Input -> CLI Parser -> Command Handler -> llama.cpp.zig -> C Core -> Token Stream -> Output
 
↓ ↓ ↓

| ArgIterator | run.zig/etc | Bindings | llama_decode | llama_sampling | Result |

The flow is intentionally linear with minimal abstraction layers. Each component has a single responsibility, making the system easy to trace and debug.

Last updated: February 2026.