Yes — andrewkchan/yalm is an open-source project tracked on heatdrop.

What language is yalm written in?

andrewkchan/yalm is primarily written in C++.

andrewkchan/yalm has 591 stars on GitHub.

Where can I find yalm?

andrewkchan/yalm is on GitHub at https://github.com/andrewkchan/yalm.

← all repositories

andrewkchan/yalm

LLM inference from scratch, for people who read the comments

A bare-metal C++/CUDA inference engine built as a readable, documented exercise in performance engineering.

★591 stars C++ Inference · Serving Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

yalm implements transformer inference for models like Mistral and Llama 3.2 using only C++, CUDA, and the bare minimum I/O libraries needed to load weights. It handles completion, perplexity scoring, and passkey retrieval tests, but notably not chat. The author is explicit that this is an educational codebase: the goal is readable code and documented optimizations, not a production product.

The interesting bit

Despite being homework, it nearly matches llama.cpp on an RTX 4090—about 64 tok/s versus llama.cpp’s 61 tok/s for short sequences on Mistral-7B—while remaining small enough to study. The project treats dynamic parallelism and other black-magic tricks as optional, preferring to scientifically explain the optimizations it does use.

Key highlights

Runs entirely without ML frameworks; only weight I/O is outsourced.
Benchmarks within shouting distance of llama.cpp and calm on Mistral-7B FP16.
Includes a test suite with kernel-level microbenchmarks and memory-bandwidth probes.
Supports CPU fallback and sliding-window context limits.
Heavily commented and paired with a detailed blog post on building fast inference from scratch.

Caveats

NVIDIA-only and single-GPU; the full model must fit in VRAM.
Chat interface is unimplemented, and only a handful of models have been tested as of late 2024.
The author explicitly warns against production use.

Verdict

Grab this if you want to understand how LLM inference actually works on a GPU without wading through a framework. Skip it if you need multi-GPU serving, chat APIs, or a battle-tested engine.

Frequently asked

What is andrewkchan/yalm?: A bare-metal C++/CUDA inference engine built as a readable, documented exercise in performance engineering.
Is yalm open source?: Yes — andrewkchan/yalm is an open-source project tracked on heatdrop.
What language is yalm written in?: andrewkchan/yalm is primarily written in C++.
How popular is yalm?: andrewkchan/yalm has 591 stars on GitHub.
Where can I find yalm?: andrewkchan/yalm is on GitHub at https://github.com/andrewkchan/yalm.