yassa9/qwen600
A minimalist single-batch CUDA inference engine for Qwen3-0.6B that runs entirely on GPU with no Python dependencies.

This repository implements a static, suckless-style inference engine for the Qwen3-0.6B language model using pure CUDA C/C++. It provides single-batch LLM inference on GPU with compile-time constants for optimization, using cuBLAS and CUB libraries for efficient computation. The engine benchmarks claim 8.5% faster token generation than llama.cpp and 292% faster than HuggingFace with flash-attn. Configuration is done directly in source code to minimize dependencies and abstractions.