A Python toolkit for mining materials data without reinventing the wheel
Matminer collects scattered materials-science datasets and featurizers into one library so researchers can stop writing the same data-prep scripts.

What it does
Matminer is a Python library that gathers datasets, data-retrieval methods, and featurizers for materials science into a single package. It handles the tedious work of finding, formatting, and citing community-developed data so you can focus on analysis rather than wrangling. Python 3.11+ required.
The interesting bit
The library tracks provenance for you: every dataset and featurizer carries a citations() method that spits out BibTeX entries. It’s a small feature that solves a real pain point—academic papers where the data sources are vaguely waved at in a footnote.
Key highlights
- Bundles community datasets and featurizers in one importable library
- Built-in citation tracking via
citations()methods on datasets, retrievers, and featurizers - Companion projects for automation (
automatminer) and benchmarking (matbench) - Active since at least 2018 with a dedicated help forum
- Separate examples repo with worked demonstrations
Caveats
- The README is thin on specifics: no dataset counts, no performance claims, no architecture overview
- The examples and deeper docs live in separate repositories, so you’ll be clicking around
Verdict
Materials scientists doing ML on structure-property relationships should grab this to skip boilerplate data loading. Everyone else can pass—there’s nothing generic here worth repurposing.