← all repositories

tairov/llama2.mojo

A pure Mojo implementation of Llama 2 model inference with SIMD and multithreading optimizations.

llama2.mojo
Velocity · 7d
+2.1
★ / day
Trend
steady
star history

This repository provides a single-file Llama 2 inference implementation written entirely in Mojo. It leverages Mojo’s SIMD and vectorization primitives to achieve hardware-level optimizations, outperforming both the original llama2.c implementation by 30% and llama.cpp by 20% on baby-llama inference. The project supports multiple model sizes (260K to 110M parameters) and TinyLlama-1.1B, with extensive benchmarks on Apple M1 Max showing up to 1025 tokens/second throughput.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.