When seq2seq models need a nudge from game theory
A 2018 TensorFlow toolkit that bolts reinforcement learning onto encoder-decoder models to fix exposure bias and metric mismatch in summarization.

What it does RLSeq2Seq is a research codebase that trains abstractive text summarizers by combining standard sequence-to-sequence models with reinforcement learning tricks. It implements scheduled sampling variants, policy-gradient with self-critic training, and actor-critic methods using DDQN and dueling networks, all targeted at the CNN/Daily Mail and Newsroom datasets.
The interesting bit The project treats text generation as a decision-making problem rather than pure supervised learning. It bundles multiple RL papers into one training framework—so you can swap between Bengio’s scheduled sampling, Ranzato’s end-to-end backprop, and actor-critic architectures without rewriting the model from scratch.
Key highlights
- Supports TensorFlow 1.10.1 (yes, the TF 1.x era)
- Implements three major RL families: scheduled sampling, policy-gradient with self-critic, and actor-critic via DDQN/dueling networks
- Ships with preprocessing scripts that the authors claim boost ROUGE scores on CNN/Daily Mail and Newsroom
- Includes pointer-generator coverage and intra-decoder attention mechanisms
- Directly tied to arXiv:1805.09461 with a citation request baked into the README
Caveats
- Explicitly noted as “no longer actively maintained”
- Requires Python 2.7, CUDA 9, and cuDNN 7.1—a stack that is now archaeological
- README is thorough on paper references but sparse on architectural details or current benchmark standings
Verdict Worth a look if you’re reproducing 2018 summarization papers or studying how RL was grafted onto seq2seq before transformers took over. Skip it if you need production code or modern PyTorch/TF 2.x support.