Run 671B MoE models on a single RTX 4090 (sort of)
KTransformers makes CPU-GPU heterogeneous inference and fine-tuning for massive MoE models almost practical on consumer hardware.

What it does KTransformers is a CPU-GPU heterogeneous computing framework for inference and fine-tuning of large language models, with a heavy focus on Mixture-of-Experts (MoE) architectures. It splits work between GPU and CPU—often keeping “cold” experts in system memory while “hot” ones live in VRAM—so you can run models like DeepSeek-V3/R1 on hardware that shouldn’t be able to hold them.
The project now exposes two main paths: a high-performance inference kernel (kt-kernel) and an SFT integration with LLaMA-Factory. Both lean on aggressive quantization (INT4/INT8 on CPU, GPTQ/FP8 on GPU) and Intel AMX/AVX kernel optimizations to squeeze performance out of mismatched hardware.
The interesting bit The expert scheduling is the clever part: instead of treating all 256 experts as equally GPU-resident, it profiles which ones fire frequently and parks the rest in CPU memory or even on disk. The prefix cache goes three layers deep (GPU → CPU → disk) for reuse across turns. It’s a memory hierarchy strategy dressed up as an LLM serving framework.
Key highlights
- Runs DeepSeek-R1/V3 on single 24GB GPU + 382GB DRAM, with claimed 3–28× speedup over baseline offloading
- SFT integration with LLaMA-Factory reports 6–12× training speedup vs. ZeRO-Offload for MoE fine-tuning, using ~half the CPU memory of prior KT paths
- Intel AMX/AVX512/AVX2 kernels for quantized CPU inference; also supports AMD ROCm, Intel Arc, and Ascend NPU
- Native FP8 per-channel precision and CPU-GPU expert scheduling with NUMA awareness
- Integrates into SGLang for production serving; clean Python API for injection into existing stacks
Caveats
- The “3–28× speedup” claim is from the project’s own benchmarks; no independent verification is cited
- “Day0 support” for new models (GLM-5, MiniMax-M2.5, Kimi-K2.5) suggests rapid but potentially brittle adaptation
- Original monolithic framework was archived; current split into
kt-kernel+ SFT docs is recent (v0.6.1, April 2026) and may still be settling
Verdict Worth a look if you’re trying to serve or fine-tune 100B+ MoE models on commodity hardware and don’t mind tuning quantization tradeoffs. If you have uniform H100 clusters and ample VRAM, this adds complexity you probably don’t need.