← all repositories

andrewkchan/yalm

A C++/CUDA implementation of LLM inference that serves Mistral-7B and similar models from scratch on NVIDIA GPUs.

yalm
Velocity · 7d
+1.0
★ / day
Trend
steady
star history

A custom LLM inference engine written in C++ and CUDA that loads model weights in HuggingFace safetensor format and executes transformer forward passes on GPU. It serves as an educational resource for understanding inference optimization techniques including memory management, CUDA kernel implementation, and batching strategies. Supports FP16 inference with 4k context length on RTX 4090, achieving comparable throughput to llama.cpp.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.