Neo4j's LLM-powered ETL pipeline for the graph-curious
Feed it PDFs, YouTube links, or web pages; get back a queryable knowledge graph in Neo4j.

What it does This is a full-stack application—FastAPI backend, React frontend—that ingests unstructured data (PDFs, docs, YouTube transcripts, web pages, S3/GCS buckets) and uses LLMs via LangChain to extract entities and relationships. It writes the results as a structured knowledge graph into Neo4j, then lets you chat with the data through multiple retrieval modes (vector, graph, hybrid, etc.).
The interesting bit The project treats “graph construction” as an ETL problem with LLMs as the transformer. It supports a wide model zoo—OpenAI, Gemini, Anthropic, Groq, Ollama, DeepSeek, Bedrock, Fireworks, plus OpenAI-compatible endpoints—and includes token-usage tracking with per-user/per-database metering, which is unusual for an open-source tool.
Key highlights
- Schema-aware extraction: define custom node/relationship labels or let the LLM infer them
- Multiple chat/retrieval modes: vector, graph, graph+vector, fulltext, entity_vector, global_vector
- Built-in visualization via Neo4j Bloom; standalone chat UI at
/chat-only - Token usage tracking with daily/monthly limits and a dedicated API endpoint
- Docker Compose for local deployment; GCP Cloud Run configs included
- Supports Neo4j Aura (free tier works) and Neo4j Desktop (manual split deploy)
Caveats
- Several LLM providers (Anthropic, Groq, Bedrock, Fireworks, Ollama, DeepSeek) are marked “dev deployed version”—unclear if fully supported or experimental
- Diffbot API key is listed as mandatory even if you only want to use OpenAI
- Neo4j Desktop users must run backend and frontend separately; docker-compose is explicitly not supported
- README is truncated mid-environment-variable table, so full configuration surface is incomplete
Verdict Worth a look if you’re building GraphRAG pipelines and want a working reference architecture rather than wiring LangChain to Neo4j yourself. Skip it if you need a lightweight library; this is a full application with frontend, backend, and considerable deployment surface area.