BERT without the TensorFlow tax
A stripped-down C++ inference engine that trades framework overhead for raw CUDA and MKL speed.

What it does
cuBERT runs BERT inference directly on NVIDIA GPUs (via CUDA and CUBLAS) or Intel CPUs (via MKL), skipping TensorFlow entirely. It reads frozen BERT protobufs and hands back logits, probabilities, pooled output, or sequence embeddings through a plain C API. Python and Java wrappers exist but are thin; the work happens in C++.
The interesting bit
The threading model is where the project shows its scars. The vanilla Bert class isn’t thread-safe, so BertM round-robins requests across multiple locked instances—one per GPU, or CUBERT_NUM_CPU_MODELS on CPU. Then you still need to tune OMP_NUM_THREADS for MKL’s own parallelism. The README admits this balancing act “diffs from model seq_length, batch_size, your CPU cores, your server QPS, and many many other things.” No hand-waving; just a warning that you’ll be benchmarking.
Key highlights
- Mixed-precision fp16 storage with fp32 compute on Volta/Turing Tensor Cores, claiming >2× speedup with <1% accuracy loss
- GPU benchmark (Tesla P4, seq_length=32): 184.6 ms vs TensorFlow’s 255.2 ms at batch 128; 54.5 ms vs 70.0 ms at batch 32
- CPU benchmark (Xeon E5-2680 v4): 984.9 ms vs 1504.0 ms at batch 128; 24.0 ms vs 69.9 ms at batch 1
- Uses protobuf-c specifically to avoid TensorFlow protobuf version conflicts
- Pre-built Python wheels available, but only MKL-on-Linux; GPU builds require compiling from source
Caveats
- Only BERT (Transformer) is supported; no other architectures
- CUDA libraries are not cross-version compatible, so you’ll need matching toolchains
- GPU wheel packages aren’t provided; pre-built binaries are CPU-only MKL builds
Verdict
Worth a look if you’re serving BERT at scale and can stomach C++ build systems. Skip it if you need model flexibility, easy GPU deployment, or aren’t prepared to spend time tuning thread counts for your specific hardware.