Is bookcorpus open source?

Yes — soskek/bookcorpus is open source, released under the MIT license.

What language is bookcorpus written in?

soskek/bookcorpus is primarily written in Python.

How popular is bookcorpus?

soskek/bookcorpus has 863 stars on GitHub.

Where can I find bookcorpus?

soskek/bookcorpus is on GitHub at https://github.com/soskek/bookcorpus.

← all repositories

soskek/bookcorpus

DIY a famous NLP dataset before it vanishes again

A scraper that rebuilds BookCorpus from Smashwords after the original disappeared, with the author openly warning it might not work anymore.

★863 stars Python Data Tooling Language Models

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

This is a Python crawler that reconstructs BookCorpus, a once-standard text corpus for training sentence encoders. It scrapes free ebooks from Smashwords—the same source as the original—downloads txt or epub files, extracts text, and formats it into sentence-per-line output. The repo includes a snapshot of ~11,000 book URLs from January 2019 to get you started without re-crawling.

The interesting bit

The README spends more space warning you not to use this than explaining how. The author flags that crawling may be broken, points to three alternative sources (including a HuggingFace dataset), and cites a research paper on the corpus’s “deficiencies.” It’s a refreshingly honest maintenance mode—half tool, half historical document.

Key highlights

Fetches txt when possible, falls back to epub extraction with optional word-count validation (--trash-bad-count)
Includes postprocessing scripts for sentence segmentation and optional BlingFire tokenization
Ships with a frozen 2019 URL list so you can skip the fragile discovery step
epub2txt.py adapted from an existing project, not reinvented
Straightforward dependencies: BeautifulSoup, lxml, html2text, progressbar2

Caveats

The author explicitly warns “clawling could be difficult due to some issues of the website” [sic]
Smashwords terms of service apply; the disclaimer disavows responsibility for “plagiarism or legal implication”
Expected error noise (Failed: epub and txt, File is not a zip file) is normal per the README

Verdict

Useful if you need to understand how BookCorpus was built, or if the alternative distributions don’t fit your licensing needs. For most practitioners, the HuggingFace dataset or Shawn Presser’s 2020 crawl are probably less headache. Skip this if you want something that just works today.

Frequently asked

What is soskek/bookcorpus?: A scraper that rebuilds BookCorpus from Smashwords after the original disappeared, with the author openly warning it might not work anymore.
Is bookcorpus open source?: Yes — soskek/bookcorpus is open source, released under the MIT license.
What language is bookcorpus written in?: soskek/bookcorpus is primarily written in Python.
How popular is bookcorpus?: soskek/bookcorpus has 863 stars on GitHub.
Where can I find bookcorpus?: soskek/bookcorpus is on GitHub at https://github.com/soskek/bookcorpus.