← all repositories
nuclia/nucliadb

A database that actually gets unstructured data

NucliaDB stores and searches text, files, vectors, and annotations with hybrid search built for RAG pipelines.

716 stars Python RAG · SearchData Tooling
nucliadb
Velocity · 7d
+0.5
★ / day
Trend
steady
star history

What it does

NucliaDB is an open-source database for storing and searching unstructured data. It combines vector search, full-text search, and graph indexes in one system, with a storage layer backed by PostgreSQL and blob support for S3, GCS, and Azure. It’s written in Rust and Python, designed for multi-tenant deployments and large datasets.

The interesting bit

The project is explicitly built around a commercial ecosystem: Nuclia’s cloud “Understanding API” handles the messy work of data extraction and AI enrichment, while NucliaDB serves as the queryable storage layer. The AGPLv3 license means you can use it freely, but modify it and you must publish changes — a deliberate fence around the hosted service business model.

Key highlights

  • Hybrid search: vector, keyword, and graph indexes in one query layer
  • Field types cover text, files, links, and conversations with paragraph-level indexing
  • Exports to HuggingFace datasets and PyTorch-compatible formats
  • Distributed search with index replication and cloud-native deployment options
  • Role-based security with upstream proxy authentication

Caveats

  • The README has rough edges: typos (“multi-teanant”), awkward phrasing (“utilizing the power”), and some feature claims are vague (“Cloud data and insight extraction” lacks detail)
  • The cloud API integration is clearly the revenue path; self-hosted users get the database engine without the automatic NLP enrichment

Verdict

Worth evaluating if you’re building RAG infrastructure and want a unified search backend rather than gluing vector and text databases together. Skip it if you need a mature, standalone vector database without the surrounding Nuclia cloud ecosystem.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.