tairov/llama2.mojo

A pure Mojo implementation of Llama 2 model inference with SIMD and multithreading optimizations.

★2.1k stars Mojo Inference · Serving Language Models

View on GitHub ↗ Homepage ↗

Velocity · 7d

+2.1

★ / day

Trend

→steady

star history

This repository provides a single-file Llama 2 inference implementation written entirely in Mojo. It leverages Mojo’s SIMD and vectorization primitives to achieve hardware-level optimizations, outperforming both the original llama2.c implementation by 30% and llama.cpp by 20% on baby-llama inference. The project supports multiple model sizes (260K to 110M parameters) and TinyLlama-1.1B, with extensive benchmarks on Apple M1 Max showing up to 1025 tokens/second throughput.