tjake/Jlama
A Java-based inference engine for running LLMs locally with quantization and SIMD acceleration

Velocity · 7d
+1.2
★ / day
Trend
→steady
star history
Jlama is a Java LLM inference engine that enables running large language models directly in Java applications. It supports popular model architectures including Llama, Gemma, Mistral, and Qwen2, with features like paged attention, mixture of experts, and tool calling. The engine supports multiple data types including F32, F16, BF16, and quantization formats like Q8 and Q4, with optional SIMD acceleration and WebGPU support for performance optimization.