Is cuBERT open source?

Yes — zhihu/cuBERT is open source, released under the MIT license.

What language is cuBERT written in?

zhihu/cuBERT is primarily written in C++.

How popular is cuBERT?

zhihu/cuBERT has 547 stars on GitHub.

Where can I find cuBERT?

zhihu/cuBERT is on GitHub at https://github.com/zhihu/cuBERT.

← all repositories

zhihu/cuBERT

BERT without the TensorFlow tax

A stripped-down C++ inference engine that trades framework overhead for raw CUDA and MKL speed.

★547 stars C++ Inference · Serving

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

cuBERT runs BERT inference directly on NVIDIA GPUs (via CUDA and CUBLAS) or Intel CPUs (via MKL), skipping TensorFlow entirely. It reads frozen BERT protobufs and hands back logits, probabilities, pooled output, or sequence embeddings through a plain C API. Python and Java wrappers exist but are thin; the work happens in C++.

The interesting bit

The threading model is where the project shows its scars. The vanilla Bert class isn’t thread-safe, so BertM round-robins requests across multiple locked instances—one per GPU, or CUBERT_NUM_CPU_MODELS on CPU. Then you still need to tune OMP_NUM_THREADS for MKL’s own parallelism. The README admits this balancing act “diffs from model seq_length, batch_size, your CPU cores, your server QPS, and many many other things.” No hand-waving; just a warning that you’ll be benchmarking.

Key highlights

Mixed-precision fp16 storage with fp32 compute on Volta/Turing Tensor Cores, claiming >2× speedup with <1% accuracy loss
GPU benchmark (Tesla P4, seq_length=32): 184.6 ms vs TensorFlow’s 255.2 ms at batch 128; 54.5 ms vs 70.0 ms at batch 32
CPU benchmark (Xeon E5-2680 v4): 984.9 ms vs 1504.0 ms at batch 128; 24.0 ms vs 69.9 ms at batch 1
Uses protobuf-c specifically to avoid TensorFlow protobuf version conflicts
Pre-built Python wheels available, but only MKL-on-Linux; GPU builds require compiling from source

Caveats

Only BERT (Transformer) is supported; no other architectures
CUDA libraries are not cross-version compatible, so you’ll need matching toolchains
GPU wheel packages aren’t provided; pre-built binaries are CPU-only MKL builds

Verdict

Worth a look if you’re serving BERT at scale and can stomach C++ build systems. Skip it if you need model flexibility, easy GPU deployment, or aren’t prepared to spend time tuning thread counts for your specific hardware.

Frequently asked

What is zhihu/cuBERT?: A stripped-down C++ inference engine that trades framework overhead for raw CUDA and MKL speed.
Is cuBERT open source?: Yes — zhihu/cuBERT is open source, released under the MIT license.
What language is cuBERT written in?: zhihu/cuBERT is primarily written in C++.
How popular is cuBERT?: zhihu/cuBERT has 547 stars on GitHub.
Where can I find cuBERT?: zhihu/cuBERT is on GitHub at https://github.com/zhihu/cuBERT.