DIY a famous NLP dataset before it vanishes again
A scraper that rebuilds BookCorpus from Smashwords after the original disappeared, with the author openly warning it might not work anymore.

What it does
This is a Python crawler that reconstructs BookCorpus, a once-standard text corpus for training sentence encoders. It scrapes free ebooks from Smashwords—the same source as the original—downloads txt or epub files, extracts text, and formats it into sentence-per-line output. The repo includes a snapshot of ~11,000 book URLs from January 2019 to get you started without re-crawling.
The interesting bit
The README spends more space warning you not to use this than explaining how. The author flags that crawling may be broken, points to three alternative sources (including a HuggingFace dataset), and cites a research paper on the corpus’s “deficiencies.” It’s a refreshingly honest maintenance mode—half tool, half historical document.
Key highlights
- Fetches txt when possible, falls back to epub extraction with optional word-count validation (
--trash-bad-count) - Includes postprocessing scripts for sentence segmentation and optional BlingFire tokenization
- Ships with a frozen 2019 URL list so you can skip the fragile discovery step
epub2txt.pyadapted from an existing project, not reinvented- Straightforward dependencies: BeautifulSoup, lxml, html2text, progressbar2
Caveats
- The author explicitly warns “clawling could be difficult due to some issues of the website” [sic]
- Smashwords terms of service apply; the disclaimer disavows responsibility for “plagiarism or legal implication”
- Expected error noise (
Failed: epub and txt,File is not a zip file) is normal per the README
Verdict
Useful if you need to understand how BookCorpus was built, or if the alternative distributions don’t fit your licensing needs. For most practitioners, the HuggingFace dataset or Shawn Presser’s 2020 crawl are probably less headache. Skip this if you want something that just works today.