Is onnxruntime-genai open source?

Yes — microsoft/onnxruntime-genai is open source, released under the MIT license.

What language is onnxruntime-genai written in?

microsoft/onnxruntime-genai is primarily written in C++.

How popular is onnxruntime-genai?

microsoft/onnxruntime-genai has 1.1k stars on GitHub.

Where can I find onnxruntime-genai?

microsoft/onnxruntime-genai is on GitHub at https://github.com/microsoft/onnxruntime-genai.

← all repositories

microsoft/onnxruntime-genai

Microsoft open-sources the busywork of on-device LLM inference

It exists to stop developers from hand-rolling tokenizers, KV caches, and sampling loops every time they want to run an LLM on a phone or laptop.

★1.1k stars C++ Inference · Serving Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

ONNX Runtime GenAI is a cross-platform inference engine that runs generative AI models—primarily LLMs and vision-language variants—through ONNX Runtime. It handles the entire generative loop, including pre- and post-processing, logits processing, search and sampling, KV cache management, and grammar specification for tool calling. The project ships bindings for Python, C#, C/C++, and Java (though Java currently requires building from source), targeting Linux, Windows, macOS, and Android on x86, x64, and arm64.

The interesting bit

The value is in the unglamorous plumbing. While everyone talks about model weights, this library focuses on the messy orchestration around them: how tokens are cached, how sampling strategies are applied, and how tool-calling grammars constrain output. It is also the same runtime that powers Microsoft’s Foundry Local, Windows ML, and the VS Code AI Toolkit, so it is not a side experiment.

Key highlights

Supports a wide range of model architectures, including Llama, Mistral, DeepSeek, Phi, Qwen, Gemma, Whisper, and vision-language variants.
Hardware acceleration spans CPU, CUDA, DirectML, OpenVINO, Qualcomm QNN, WebGPU, and Nvidia TensorRT-RTX.
Features like Multi-LoRA, continuous decoding, and constrained decoding are already implemented.
Stable diffusion support is under active development, with speculative decoding and broader multi-modal support on the roadmap.

Caveats

The README warns that examples on the main branch may drift ahead of the latest stable release, so you need to match versions carefully.
Java bindings exist but still require building from source.
iOS and AMD GPU support are on the roadmap but not yet available.

Verdict

Worth a look if you are shipping ONNX-based LLMs to edge devices and would rather not maintain your own inference loop. Skip it if you are already locked into PyTorch-native tooling or need iOS support today.

Frequently asked

What is microsoft/onnxruntime-genai?: It exists to stop developers from hand-rolling tokenizers, KV caches, and sampling loops every time they want to run an LLM on a phone or laptop.
Is onnxruntime-genai open source?: Yes — microsoft/onnxruntime-genai is open source, released under the MIT license.
What language is onnxruntime-genai written in?: microsoft/onnxruntime-genai is primarily written in C++.
How popular is onnxruntime-genai?: microsoft/onnxruntime-genai has 1.1k stars on GitHub.
Where can I find onnxruntime-genai?: microsoft/onnxruntime-genai is on GitHub at https://github.com/microsoft/onnxruntime-genai.