microsoft/onnxruntime-genai
ONNX Runtime GenAI is a C++ runtime for efficiently running large language models on device with support for CUDA, DirectML, TensorRT, and other hardware accelerators.

This repository provides a specialized runtime for executing generative AI models in the ONNX format. It implements the complete generative AI loop including model preprocessing, ONNX Runtime-based inference, logits processing, search and sampling, KV cache management, and grammar-based constrained decoding for tool calling. The project supports a wide range of LLM architectures including Llama, Gemma, Mistral, Phi, Qwen, Whisper, DeepSeek, and Granite, across multiple hardware backends.