An open-source AI lakehouse that actually admits it's heavy
Hopsworks bundles a feature store, MLOps pipeline tooling, and team governance into one Java-heavy platform you can run anywhere—or pay them to manage.
What it does
Hopsworks is a self-described “Real-Time AI Lakehouse” built around a Python-centric feature store. It gives ML teams a shared workspace for feature engineering, model registry, training pipelines, and model serving, with project-based multi-tenancy so different teams can safely share a single cluster. The platform wraps in Jupyter notebooks, Airflow for pipeline orchestration, and support for Spark, Flink, and GPU training.
The interesting bit
The project-based sandbox model is the unusual angle: it treats ML assets (features, models, training data) as governed, versioned resources that can be shared across team boundaries without dumping everyone into the same namespace. That’s the governance pitch—sensitive data stays isolated, collaboration happens anyway.
Key highlights
- Modular by design: usable as standalone feature store, full MLOps platform, or anything between
- Multi-platform: managed cloud (AWS/Azure/GCP), on-prem Linux installs, or serverless app at app.hopsworks.ai
- AGPL-V3 licensed — copyleft, so modifications must be shared back
- Integrates with Databricks, SageMaker, and KubeFlow per the README
- On-prem requires 32GB RAM, 8 CPUs, and direct engagement with Hopsworks engineering for setup
Caveats
- On-premise installation is explicitly not self-serve: “each infrastructure is unique and requires a tailored approach” — expect professional services
- The serverless app is labeled beta
- Java repo with 1,299 stars; the actual Python APIs live in separate repositories
Verdict
Worth evaluating if your team has outgrown ad-hoc feature storage and needs governed collaboration across multiple projects. Skip it if you want a lightweight, drop-in feature store without the full-platform commitment.