Is rkllama open source?

Yes — NotPunchnox/rkllama is open source, released under the GPL-3.0 license.

What language is rkllama written in?

NotPunchnox/rkllama is primarily written in Python.

How popular is rkllama?

NotPunchnox/rkllama has 572 stars on GitHub.

Where can I find rkllama?

NotPunchnox/rkllama is on GitHub at https://github.com/NotPunchnox/rkllama.

← all repositories

NotPunchnox/rkllama

An Ollama-compatible server that funnels LLMs through Rockchip’s NPU

It routes LLM inference through Rockchip’s 6-TOPS NPU so your Orange Pi’s CPU doesn’t have to shoulder the load alone.

★572 stars Python Inference · Serving Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does RKLLama is essentially a Python orchestration layer over Rockchip’s official rkllm and rknn C++ libraries, exposing an Ollama-compatible REST server that runs LLMs, vision encoders, and even image-generation or speech pipelines on the RK3588 and RK3576 NPUs. Existing chat front-ends can treat an Orange Pi or Radxa board like a local GPU server because the API mimics Ollama and partially mimics OpenAI. The project handles model loading, prompt caching, and multimodal preprocessing so you do not have to write the NPU glue yourself.

The interesting bit The clever angle is the breadth of workloads squeezed onto a 6-TOPS NPU—LLMs, vision models, TTS, STT, and even Stable Diffusion—while keeping an Ollama-compatible API so standard clients never know they are talking to an embedded board. Dynamic model unloading and prompt-cache persistence across sessions are quality-of-life features usually missing from vendor SDK demos.

Key highlights

Runs inference on the Rockchip NPU rather than leaving the ARM CPU to do all the work.
Ollama-compatible API surface (/api/chat, /api/pull, etc.) plus partial OpenAI compatibility for chat, embeddings, audio, and image generation.
Supports multimodal models such as Qwen-VL, MiniCPM-V, and InternVL, plus TTS, STT, and image generation via RKNN pipelines.
Dynamic model lifecycle with inactivity timeouts and prompt-cache files that survive model eviction for up to seven days.
Experimental .GGUF support through a community llama.cpp fork.

Caveats

OpenAI API coverage is partial, and some endpoints (like audio translations) are currently limited.
Vision and multimodal setups require manual Modelfile configuration with encoder-specific properties.
Experimental .GGUF support relies on a third-party llama.cpp fork, not the main project.

Verdict Worth a look if you are building an offline AI appliance on an Orange Pi 5 or Radxa Rock 4D and want an Ollama-like experience without ignoring the NPU. Skip it if you need mature x86 GPU performance or a fully polished OpenAI API drop-in; this is version 0.0.69 and squarely aimed at tinkerers.

Frequently asked

What is NotPunchnox/rkllama?: It routes LLM inference through Rockchip’s 6-TOPS NPU so your Orange Pi’s CPU doesn’t have to shoulder the load alone.
Is rkllama open source?: Yes — NotPunchnox/rkllama is open source, released under the GPL-3.0 license.
What language is rkllama written in?: NotPunchnox/rkllama is primarily written in Python.
How popular is rkllama?: NotPunchnox/rkllama has 572 stars on GitHub.
Where can I find rkllama?: NotPunchnox/rkllama is on GitHub at https://github.com/NotPunchnox/rkllama.