Yes — UbiquitousLearning/mllm is open source, released under the MIT license.

What language is mllm written in?

UbiquitousLearning/mllm is primarily written in C++.

UbiquitousLearning/mllm has 1.6k stars on GitHub.

Where can I find mllm?

UbiquitousLearning/mllm is on GitHub at https://github.com/UbiquitousLearning/mllm.

← all repositories

UbiquitousLearning/mllm

Running Qwen3 on your phone via an in-app Go server

mllm exists to run quantized multimodal LLMs locally on phones and edge hardware without cloud round-trips.

★1.6k stars C++ Inference · Serving Language Models

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does mllm is a C++ inference framework that takes standard PyTorch and SafeTensors checkpoints, quantizes them into its own format, and runs the result on mobile and edge hardware. It supports a broad catalog of models—Qwen3, DeepSeek-OCR, SmolLM3, LLaVA, Gemma, and others—across Arm CPUs, OpenCL GPUs, Qualcomm Hexagon NPUs, and Ascend NPUs. For Android, it offers a reference demo that wraps the engine inside an on-device Go server to keep inference off the main thread.

The interesting bit Rather than the usual JNI shim, the Android demo spins up an in-app Golang server (mllm_server.aar) so the UI talks to inference over a local client-server protocol. It is a deliberately heavyweight decoupling for a resource-constrained device. The framework also acts as a middleman between high-level optimizations like speculative decoding and pruning and low-level hardware runtimes such as CANN, CUDA, and MLIR.

Key highlights

Ingests PyTorch and SafeTensors directly via mllm-convertor, quantizing to w4a8, INT4, or INT8.
v2 offers Pythonic eager execution, compilation support for NPU integration, and parallel model execution.
Hardware targets include Arm CPU, OpenCL GPU, QNN NPU, and Ascend NPU; experimental CUDA support for Jetson is available through pymllm.
Provides an SDK, CLI inference tool, and an Android reference architecture using the Go server.
Model coverage in v2 focuses on recent small-to-mid size multimodal models like Qwen2-VL and Qwen2.5-VL.

Caveats

v1 is being retired; the README warns that V1 support is ending soon and new features are landing on the v2 branch.
NPU support is still spotty—only specific model variants have Hexagon or Ascend binaries, and Jetson CUDA is labeled experimental.
The documentation shows extensive model tables but offers no concrete latency, memory, or power measurements, so real-world efficiency remains an open question.

Verdict A solid candidate if you need a single pipeline from PyTorch weights to offline phone inference and are willing to tolerate some experimental hardware support. Look elsewhere if you need battle-tested server GPU throughput or published benchmarks to guide hardware selection.

Frequently asked

What is UbiquitousLearning/mllm?: mllm exists to run quantized multimodal LLMs locally on phones and edge hardware without cloud round-trips.
Is mllm open source?: Yes — UbiquitousLearning/mllm is open source, released under the MIT license.
What language is mllm written in?: UbiquitousLearning/mllm is primarily written in C++.
How popular is mllm?: UbiquitousLearning/mllm has 1.6k stars on GitHub.
Where can I find mllm?: UbiquitousLearning/mllm is on GitHub at https://github.com/UbiquitousLearning/mllm.