ggml-zig
ggml-zig
A pure Zig port of ggml — the tensor library powering llama.cpp and similar LLM inference engines.
Status: Vulkan backend only. Works with real models. Not that performant yet.
Why Zig?
Porting ggml to Zig isn’t about novelty. It’s about what Zig forces into the open that C++ hides:
- Explicit memory management — no GC, no hidden allocations, no surprises. Every allocation is visible at the call site. Staging-buffer policies and lifetimes are clear rather than implicit.
- Explicit error handling — GPU and driver failures are normal control flow via error unions, not exceptions or silent side-channels. No unchecked return codes.
- Comptime type safety — push constants, specialization constants, and descriptor layouts are validated at compile time, catching ABI mismatches before they reach the driver.
- C-level performance — no runtime overhead, no GC pauses, direct control over data layout and memory placement.
- No hidden control flow —
defer,errdefer, and explicit allocators make resource cleanup auditable and predictable. - Better refactoring safety — ownership is explicit, so backend refactors (sync, pipeline, multi-GPU) are less likely to silently break lifetime invariants.
Simply: I like Zig, and Vulkan backend development is exactly where these properties matter most.
Vulkan Backend
The Vulkan backend is organized into clear domains:
| Domain | Responsibility |
|---|---|
core |
Context, device bindings, constants |
device |
Instance, device selection, queues |
memory |
Buffers, staging, transfers |
pipeline |
SPIR-V shader registry, descriptors, push constants |
compute |
Graph dispatch, op routing, fusion |
Supported Operations
- Binary:
add,sub,mul,div - Unary:
sqr,sqrt,sin,cos,log,scale,clamp, and more - Normalization:
rms_norm,norm,l2_norm,group_norm - Attention:
softmax,rope,flash_attn,diag_mask_inf - Matrix multiply:
mul_mat,mul_mat_id - Tensor/Shape:
repeat,concat,cpy,get_rows,set_rows - Misc:
sum,argsort,im2col,pool
Fusion support is included for multi-add and attention-adjacent patterns, reducing dispatch overhead and memory traffic.
Requirements
- Zig
0.16.0-dev.3121+d34b868bcor newer master - Vulkan-capable GPU and driver
Building
# Compile-only check (fast, no GPU required)
zig build check
# Full build
zig build
zig build check compiles the public ggml module through a dependency smoke test — a quick way to confirm your toolchain and dependencies resolve correctly.
Usage
Fetch the dependency (this writes the URL and hash into build.zig.zon automatically):
zig fetch --save git+https://github.com/zillama/ggml-zig#0.0.1
Then wire it in build.zig:
const ggml_dep = b.dependency("ggml", .{ .target = target, .optimize = optimize });
exe.root_module.addImport("ggml", ggml_dep.module("ggml"));
Then in your code:
const ggml = @import("ggml");
const vk = ggml.backends.vulkan;
// Access backend context, device, memory, pipeline, compute domains:
// vk.context, vk.device, vk.buffer, vk.pipeline_registry, vk.graph ...
The public API surface is rooted at src/ggml.zig. Backends are accessed via ggml.backends.vulkan.*.
Testing
Tests are split into three tiers:
| Step | Command | Requires GPU |
|---|---|---|
| Compile check | zig build check |
No |
| Unit + integration | zig build test |
No (hermetic) |
| End-to-end (real GPU) | zig build test-e2e |
Yes |
Unit & integration tests
zig build test
Covers:
- Operation support tables, offload heuristics, shape/stride math
- Vulkan device creation and queue initialization
- Shader dispatch with real tensor data and result verification
- Pipeline/descriptor/push-constant wiring
- SPIR-V validation and generated shader compilation
- All backend modules exercised via
comptimeimport checks
End-to-end tests
zig build test-e2e
Runs actual tensor operations on a Vulkan GPU and verifies numeric results. The inference smoke test checks model file loading when a path is provided:
GGML_TEST_MODEL_PATH=/path/to/model.gguf zig build test-e2e
Set GGML_E2E_VERBOSE=1 for detailed per-operation output.
Coverage summary
| Area | Tier |
|---|---|
| Backend registration and device selection | unit |
| Tensor type math and stride correctness | unit |
| Vulkan pipeline creation and descriptor sets | integration |
| SPIR-V shader compilation and validation | integration |
| Binary/unary/norm op dispatch on GPU | e2e |
| Matrix multiply on GPU | e2e |
| Attention ops (softmax, rope, flash_attn) | e2e |
| Full graph execution with result readback | e2e |
| Model file loading smoke | e2e (opt-in via env) |