BERT's kryptonite: swapping words until it breaks
A 2019 adversarial attack that fools text classifiers by replacing words with semantically similar synonyms—no model retraining required.

What it does
TextFooler generates adversarial examples for text classification and natural language inference models. It picks words in an input sentence, finds synonyms via counter-fitted word embeddings, and swaps them until the target model (BERT, LSTM, CNN) changes its prediction—while keeping the sentence meaning intact to human readers.
The interesting bit
The attack is entirely black-box: no access to model gradients or architecture needed, just query access. It uses Universal Sentence Encoder to filter synonym candidates, ensuring semantic similarity without requiring a human in the loop. The paper’s title asks “Is BERT Really Robust?"—spoiler, the answer was no.
Key highlights
- Works against BERT, LSTM, and CNN classifiers on 7 datasets
- Pre-computed cosine similarity matrices speed up synonym lookup
- Includes pre-trained target model parameters and generated adversarial examples for direct benchmarking
- Supports both text classification and NLI tasks with separate scripts (
attack_classification.py,attack_nli.py) - Published code for a 2019 ICLR paper with ~530 stars
Caveats
- Setup requires installing a separate
esimpackage manually and downloading ~1GB of counter-fitted embeddings - README is sparse on how the attack actually selects which words to perturb; you’ll need the paper for algorithmic details
- Google Drive links for datasets and models may rot over time
Verdict
Worth a look if you’re building NLP defenses or benchmarking model robustness—this was an influential early attack. Skip it if you need something production-ready; the tooling is research-grade and the field has moved on to more sophisticated attacks.