News summarization by brute-force journalism
A Python library that reverse-engineers the 5W1H structure from news articles, because someone finally decided to treat reporters' training as a spec.

What it does Giveme5W1H parses news articles and extracts phrases answering the classic journalistic questions: who, what, when, where, why, and how. It exposes both a Python 3.6+ library and a RESTful API, and expects input in a JSON format matching the companion news-please crawler’s output.
The interesting bit The system leans on Stanford CoreNLP for heavy linguistic lifting, but wraps it in a scoring pipeline that ranks candidate phrases per question rather than treating extraction as a single-shot classification problem. The “learn weights” tooling also suggests the authors acknowledge their heuristics need tuning per domain.
Key highlights
- Requires running a separate Stanford CoreNLP Server (port 9000), which initializes lazily and can take minutes on first use
- REST API runs on port 9099 with a browser-playground for testing articles
- Caches CoreNLP and enhancer output to disk to avoid reprocessing
- Ships with file-handler utilities for batch-processing JSON article folders
- Academic lineage: published at INRA 2019, Apache 2.0 licensed
Caveats
- The README warns that some “Additional Information” is outdated, which is… not ideal for a documentation section
- Manual CoreNLP server management is mandatory; the authors explicitly rejected transparent integration due to startup latency
- No GPU acceleration mentioned; this is CPU-bound NLP from the CoreNLP era
Verdict Worth a look if you’re building news analysis pipelines and need structured event summaries without training your own models. Skip it if you want modern transformer-based extraction or a fully self-contained library — this is a 2019-vintage system with 2019-vintage dependencies.