andrewkchan/yalm
A C++/CUDA implementation of LLM inference that serves Mistral-7B and similar models from scratch on NVIDIA GPUs.

Velocity · 7d
+1.0
★ / day
Trend
→steady
star history
A custom LLM inference engine written in C++ and CUDA that loads model weights in HuggingFace safetensor format and executes transformer forward passes on GPU. It serves as an educational resource for understanding inference optimization techniques including memory management, CUDA kernel implementation, and batching strategies. Supports FP16 inference with 4k context length on RTX 4090, achieving comparable throughput to llama.cpp.