← all repositories

yassa9/qwen600

A minimalist single-batch CUDA inference engine for Qwen3-0.6B that runs entirely on GPU with no Python dependencies.

554 stars Cuda Inference · Serving
qwen600
Velocity · 7d
+2.0
★ / day
Trend
steady
star history

This repository implements a static, suckless-style inference engine for the Qwen3-0.6B language model using pure CUDA C/C++. It provides single-batch LLM inference on GPU with compile-time constants for optimization, using cuBLAS and CUB libraries for efficient computation. The engine benchmarks claim 8.5% faster token generation than llama.cpp and 292% faster than HuggingFace with flash-attn. Configuration is done directly in source code to minimize dependencies and abstractions.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.