OpenNMT/CTranslate2
A C++ and Python library providing an optimized inference runtime for Transformer models on CPU and GPU.

CTranslate2 implements a custom runtime that applies performance optimization techniques like weights quantization, layer fusion, and batch reordering to accelerate Transformer model inference and reduce memory usage. The library converts models from frameworks including OpenNMT, Fairseq, Marian, and Hugging Face Transformers into an optimized format, then serves them on CPU and GPU with support for encoder-decoder, decoder-only, and encoder-only architectures.