A 2017 SQuAD model, rebuilt with Transformer-era attention
This is a faithful TensorFlow re-implementation of Microsoft's R-Net that quietly swaps in scaled multiplicative attention and CuDNN GRUs to make a memory-hungry architecture trainable.

What it does
R-Net is a neural reading comprehension model: given a passage and a question, it locates the answer span. This repo targets the SQuAD dataset specifically and reproduces the original paper’s EM/F1 scores almost exactly (71.07/79.51 vs. 71.1/79.5).
The interesting bit
The author didn’t just port the paper. Additive attention, as originally specified, is a memory brute; this implementation substitutes the scaled multiplicative attention from “Attention Is All You Need” and layers in variational dropout and residual-style concatenation to keep stacked RNNs from degrading. CuDNN GRU plus bucketing drops training time on a TITAN X from 2.56 s/it to 0.28 s/it—though bucketing costs you 0.3% F1, which feels like a fair trade.
Key highlights
- Reproduces original R-Net scores on SQuAD v1.1 to within 0.03 EM
- Scaled multiplicative attention replaces the memory-intensive additive variant
- CuDNN GRU + bucketing yields ~9× speedup over naive CPU training
- Learning rate halves automatically when dev loss plateaus
- Optional extensions: pretrained GloVe character embeddings, FastText vectors (reported +1% F1)
Caveats
- Locked to TensorFlow ≥1.5.0 and spaCy ≥2.0.0; the README warns of “a lot of known problems caused by using different software versions”
- No code or documentation beyond the README; you’ll need to read
config.pyto toggle optional extensions - Bucketing is faster but demonstrably slightly less accurate
Verdict
Worth a look if you’re studying the SQuAD leaderboard’s 2017 vintage or need a working, optimized R-Net baseline. Skip it if you want a maintained, modern transformer-based reader—this is a period piece with some thoughtful engineering retrofits.