← all repositories
autoliuweijie/FastBERT

BERT learns to quit while it's ahead

A BERT variant that skips transformer layers when it's already confident, cutting FLOPs by 4× with minimal accuracy loss.

FastBERT
Velocity · 7d
+0.3
★ / day
Trend
steady
star history

What it does

FastBERT adds early-exit classifiers to each layer of a standard BERT stack. During inference, easy samples bail out at shallow layers; only hard cases run the full depth. The model learns when to exit via self-distillation—no teacher model needed.

The interesting bit

The “speed” parameter lets you dial the accuracy/compute tradeoff explicitly. On the Chinese book review task, the baseline burns 21.8B FLOPs for 86.88% accuracy; FastBERT at speed 0.5 drops to 5.2B FLOPs and 86.64% accuracy. That’s a 76% compute cut for a 0.24 point drop. The English AG News results are even starker: 10× FLOPs reduction with ~1.3 points lost.

Key highlights

  • Self-distillation trains the early exits using the final layer’s own outputs as targets—no external teacher required
  • Installable via pip install fastbert; includes PyPI package and pre-trained Chinese/English BERT weights
  • ACL 2020 paper with a follow-up journal version (FastPLM) accepted to IEEE TNNLS in 2021
  • Reference implementation includes runnable examples for Chinese book reviews and English AG News classification
  • Alternative PyTorch implementation available from BitVoyage/FastBERT

Caveats

  • Pre-trained model weights live on Weiyun (Chinese cloud storage); access may vary by region
  • README is thin on architecture details—no diagram of where the branch classifiers sit or how the speed threshold gates exits
  • Only classification tasks shown; unclear how well early-exit logic transfers to token-level tasks like NER or QA

Verdict

Worth a look if you’re serving BERT at scale and can tolerate small accuracy tradeoffs for latency wins. Skip it if you need guarantees—adaptive inference means worst-case latency is still full-model latency.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.