Architecture

igllama is built on a Zig-first philosophy, wrapping the battle-tested llama.cpp C core with idiomatic Zig abstractions.

High-Level Overview 🏗️

The architecture follows a clean layered design, from your terminal to the neural network weights.

igllama CLI llama.cpp.zig Zig Bindings + abstractions llama.cpp (C Core) Neural network inference USER ZIG C CORE

Component Breakdown 🔧

main.zig — The Entry Point

The main.zig file serves as the command center, parsing CLI arguments and routing to appropriate command handlers. It uses Zig’s std.ArgIterator for argument parsing and manages the application lifecycle.

Key responsibilities:

  • Command routing (run, convert, info)
  • Global flag parsing (--verbose, --help)
  • Error handling and exit codes
  • Resource cleanup via defer statements

commands/ — Modular Command Handlers

Each command lives in its own module under commands/:

CommandModulePurpose
runcommands/run.zigInteractive chat and completion
convertcommands/convert.zigModel format conversion
infocommands/info.zigGGUF metadata inspection

This modular design keeps concerns separated and makes the codebase navigable.

llama.cpp.zig — Zero-Overhead Bindings 🔄

The llama.cpp.zig module provides idiomatic Zig wrappers around llama.cpp’s C API. These bindings use Zig’s extern struct for memory layout compatibility and comptime for compile-time checks.

pub const Context = extern struct {
    ptr: *llama_context,
    
    pub fn decode(self: Context, batch: *Batch) !void {
        const ret = llama_decode(self.ptr, batch.*);
        if (ret != 0) return error.DecodeFailed;
    }
};

The bindings maintain binary compatibility while adding Zig’s type safety and error handling.

GGUF File Structure 📦

GGUF (Georgi Gerganov Unified Format) is the binary container format for model weights and metadata. Named after Georgi Gerganov, the creator of llama.cpp, it’s optimized for fast loading and memory mapping.

GGUF File Layout Magic: 0x46554747 ("GGUF") 4 bytes Version: uint32 4 bytes Tensor Count: uint64 8 bytes Metadata KV Pairs variable Tensor Info (name, type, offset) variable Tensor Data (aligned, memory-mappable)

The format prioritizes:

  • Fast loading via memory mapping
  • Self-describing metadata in human-readable form
  • Alignment for direct memory access
  • Version tolerance for forward compatibility

Memory Layout During Model Loading

When igllama loads a model, memory is organized in a specific pattern to maximize efficiency and enable zero-copy operations where possible.

Memory Layout (Top-Down) Stack (CLI args, local vars) ↓ grows down Heap: Zig Allocator (contexts, batch buffers, KV cache) Memory-Mapped GGUF (read-only, zero-copy tensor access) C Heap (llama.cpp internal) ↑ grows up

Key memory regions:

  1. Stack: Small, fast allocations for CLI parsing and control flow
  2. Heap: Zig allocator manages contexts, batch buffers, and KV cache
  3. Memory-Mapped GGUF: Direct file mapping for zero-copy tensor reads
  4. C Heap: llama.cpp’s internal allocations (managed via llama_free)

The Zig-First Advantage

Why wrap C in Zig instead of writing pure Zig? Three reasons:

Performance: llama.cpp’s C core is heavily optimized with SIMD intrinsics, multi-threading, and platform-specific tweaks. We inherit all of it.

Compatibility: New llama.cpp features drop in without rewriting core inference logic. The bindings layer absorbs API changes.

Safety: Zig’s type system catches errors at compile time. The C core handles the math, Zig handles the structure.

This hybrid approach gives us the best of both worlds: C’s raw performance with Zig’s modern tooling and safety guarantees.

Data Flow

User Input -> CLI Parser -> Command Handler -> llama.cpp.zig -> C Core -> Token Stream -> Output
 
↓ ↓ ↓

| ArgIterator | run.zig/etc | Bindings | llama_decode | llama_sampling | Result |

The flow is intentionally linear with minimal abstraction layers. Each component has a single responsibility, making the system easy to trace and debug.


Last updated: February 2026.