Social science for people who'd rather code than theorize
A scikit-learn-flavored toolkit that turns messy conversations into measurable social signals.

What it does ConvoKit packages conversational analysis as a Python toolkit with a scikit-learn-compatible interface. It bundles a dozen research-backed feature extractors—politeness strategies, linguistic coordination, hypergraph structure, conversational forecasting—with ready-to-download datasets spanning Supreme Court arguments, Reddit threads, Wikipedia talk pages, and movie dialogues.
The interesting bit The toolkit doesn’t just give you bag-of-words; it bakes in published social science. The “linguistic coordination” feature, for instance, measures power dynamics through function-word mimicry. The “Expected Conversational Context Framework” lets you characterize utterances by what typically surrounds them. These are specific, citable methods from Cornell NLP papers, not generic NLP utilities.
Key highlights
- Ships with 10+ curated corpora (Supreme Court, Parliament Q&A, 900k subreddits, etc.) via
convokit.download() - Implements published methods: politeness strategies, redirection detection, pivotal moment identification, CRAFT forecasting model
- Scikit-learn-inspired unified interface; includes interactive Colab tutorials
- Active maintenance: v4.1.1 released May 2026, 37 contributors, Discord community
Caveats
- Some features (prompt types, surface motifs) appear commented out in the README—status unclear
- Several dataset download links point to a Cornell server (
zissou.infosci.cornell.edu); long-term availability not guaranteed - Heavy tilt toward academic research use cases; production deployment guidance is sparse
Verdict Researchers studying online discourse, power dynamics, or conversation derailment should start here. Engineers building chatbots or generic conversational AI will find useful pieces but may need to bridge gaps themselves.