tairov/llama2.mojo
A pure Mojo implementation of Llama 2 model inference with SIMD and multithreading optimizations.

This repository provides a single-file Llama 2 inference implementation written entirely in Mojo. It leverages Mojo’s SIMD and vectorization primitives to achieve hardware-level optimizations, outperforming both the original llama2.c implementation by 30% and llama.cpp by 20% on baby-llama inference. The project supports multiple model sizes (260K to 110M parameters) and TinyLlama-1.1B, with extensive benchmarks on Apple M1 Max showing up to 1025 tokens/second throughput.