cactus-compute/cactus
An on-device AI inference engine optimized for mobile and wearable devices, supporting LLMs, speech recognition, and vision with ARM SIMD kernels and quantization.

Cactus provides fast, low-RAM AI inference on ARM CPUs through zero-copy memory mapping and custom SIMD kernels for Apple, Snapdragon, and Exynos chips. It supports multimodal workloads including chat, vision, speech-to-text, and RAG through an OpenAI-compatible API layer. The engine includes NPU-accelerated prefill, KV-cache quantization, and chunked prefill to minimize latency and power consumption on mobile devices.