Teaching seq2seq to put 'the' back where it belongs
A TensorFlow grammar corrector that learns by deliberately mangling movie dialog.

What it does
Deep Text Corrector trains sequence-to-sequence models to fix small grammatical errors in short, conversational text—think SMS messages or chat. It starts with clean English samples, randomly strips articles, breaks contractions, and swaps homophones to create synthetic training pairs, then trains an attention-based LSTM to reverse the damage.
The interesting bit
The clever part is the decoding constraint: the model is forbidden from inventing words. It can only reuse tokens from the input or a small “corrective” set (words like “the” or “than” that fixes typically insert). This is enforced with a hard binary mask on the logits, plus a neat OOV trick that assumes rare words appear in the same order in input and output—reasonable when the only “errors” are missing articles, not vocabulary swaps.
Key highlights
- Synthetic data generation from the Cornell Movie-Dialogs Corpus (300k+ lines), with perturbation rates loosely based on CoNLL 2014 shared task figures
- Biased decoding via logit masking, not used during training to preserve learning signal
- Straightforward OOV resolution: assumes input and output OOV sequences match one-to-one
- Outperforms an identity-function baseline on accuracy across all sentence lengths; BLEU mixed
- Ships as an extension of TensorFlow’s 2016-era
seq2seqtutorial code, with an IPython notebook for interactive training
Caveats
- Requires TensorFlow >= 0.11, which dates the project to roughly 2016–2017; modern TF compatibility is unclear
- Error types are narrowly scoped: missing articles, broken contractions, and a handful of homophone swaps—don’t expect it to fix subject-verb agreement or comma splices
- The README notes the Cornell corpus was chosen because it was “the largest collection of conversational written English I could find that was mostly grammatically correct,” which is a telling constraint
Verdict
Worth a look if you’re studying constrained seq2seq decoding or grammar correction as a controlled generation problem. Skip it if you need a production-ready corrector; this is a research demonstration with a narrow error model and dated dependencies.