Is CTranslate2 open source?

Yes — OpenNMT/CTranslate2 is open source, released under the MIT license.

What language is CTranslate2 written in?

OpenNMT/CTranslate2 is primarily written in C++.

How popular is CTranslate2?

OpenNMT/CTranslate2 has 4.6k stars on GitHub.

Where can I find CTranslate2?

OpenNMT/CTranslate2 is on GitHub at https://github.com/OpenNMT/CTranslate2.

← all repositories

OpenNMT/CTranslate2

Faster Transformer inference without the framework baggage

A custom C++ and Python runtime that re-implements Transformer inference to trade framework flexibility for raw speed and smaller memory footprints.

★4.6k stars C++ Inference · Serving

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This is a custom inference engine for Transformer models. It converts checkpoints from popular training frameworks—OpenNMT, Fairseq, Marian, Hugging Face Transformers, and others—into an optimized format, then runs them with aggressive tricks like layer fusion, padding removal, batch reordering, and in-place operations on both CPU and GPU.

The interesting bit

Quantization and CPU dispatch are first-class citizens, not afterthoughts. The runtime supports FP16, BF16, INT16, INT8, and even INT4 weights; automatically picks the best backend—Intel MKL, oneDNN, OpenBLAS, Ruy, or Apple Accelerate—based on the host CPU; and can shard huge models across multiple GPUs with tensor parallelism.

Key highlights

Covers encoder-decoder (T5, Whisper, BART), decoder-only (Llama, Mistral, GPT-NeoX, Qwen2), and encoder-only (BERT, XLM-RoBERTa) architectures.
Benchmarks show large speedups over standard frameworks: on a c5.2xlarge CPU, an OpenNMT-py model managed 275 tokens/sec under PyTorch versus 658–1,126 with this engine depending on quantization.
GPU memory shrinks dramatically as well: the same model used 2,973 MB under PyTorch but only 813 MB with combined INT8 and FP16 settings on an NVIDIA A10G.
A single binary bundles multiple CPU instruction sets and backends, dispatching at runtime so you do not need separate AVX and AVX2 builds.
The project promises backward compatibility and also experiments with model compression and acceleration techniques.

Caveats

You must convert models into an optimized format first; it will not run raw PyTorch or TensorFlow checkpoints directly.
The README cautions that published benchmarks are only valid for the specific configuration tested, so absolute numbers may shift on different hardware.

Verdict

Teams serving Transformer models in production who need lower latency and memory usage should evaluate this. Researchers rapidly iterating on new architectures may find the conversion step and fixed supported-model list limiting.

Frequently asked

What is OpenNMT/CTranslate2?: A custom C++ and Python runtime that re-implements Transformer inference to trade framework flexibility for raw speed and smaller memory footprints.
Is CTranslate2 open source?: Yes — OpenNMT/CTranslate2 is open source, released under the MIT license.
What language is CTranslate2 written in?: OpenNMT/CTranslate2 is primarily written in C++.
How popular is CTranslate2?: OpenNMT/CTranslate2 has 4.6k stars on GitHub.
Where can I find CTranslate2?: OpenNMT/CTranslate2 is on GitHub at https://github.com/OpenNMT/CTranslate2.