T5 inference that won't make you wait for coffee
A Python library that exports, quantizes, and ONNX-ifies T5 models so your CPU can finally catch its breath.

What it does
fastT5 takes any pretrained HuggingFace T5 model, exports it to ONNX, quantizes it down to 8-bit, and wraps it in an ONNX Runtime session. The returned object still speaks HuggingFace’s generate() API, so your calling code barely changes. One-liner convenience or step-by-step pipeline, your choice.
The interesting bit
T5’s encoder-decoder architecture and variable-length past_key_values make it a pain to ONNX-export cleanly. The workaround: split the decoder into two ONNX graphs—one for the first step without cached states, one for subsequent steps with them. That trick, plus quantization, is where the 3× size reduction and claimed 5× speedup come from.
Key highlights
- Single-call export:
export_and_get_onnx_model("t5-small")handles conversion, quantization, and runtime setup - Benchmarks (T5-base, English→French, AMD EPYC 7B12 CPU): ~5.7× faster for greedy search, 3–4× for beam search vs. PyTorch
- Quantized models trade a sliver of BLEU/ROUGE for the speed—scores are within ~1–5% of original per the provided tables
- Supports private HuggingFace Hub models via token authentication
- Custom output paths for caching exported models between runs
Caveats
- CPU-only for now; GPU support is listed as not yet implemented
- Speed gains taper off for longer sequence lengths (visible in the benchmark heatmaps)
- Results were generated on a specific server-grade CPU; your laptop may differ
Verdict Worth a look if you’re running T5 inference on CPU and need to cut latency without rewriting your pipeline. Skip it if you’re already GPU-bound or need the absolute best accuracy—quantization isn’t free.