Open-Source-Legal/cite

Open-source citation graph for humans and AI to share

A document repository that stores the relationships between files, not just the files themselves.

★1.3k stars Python RAG · Search Agents Data Tooling

View on GitHub ↗ Homepage ↗

Velocity · 7d

+1.0

★ / day

Trend

→steady

star history

What it does

cite (formerly OpenContracts) turns a pile of documents into a navigable citation graph. Humans annotate documents with precise spans and custom labels; AI agents traverse those annotations via a Model Context Protocol endpoint. Same underlying graph, two interfaces: GraphQL/REST for people, MCP for machines.

The interesting bit

The project inverts the usual AI document tool: instead of agents hallucinating citations from raw text, they walk edges that humans have already drawn. Agents can propose new annotations, but humans review and accept. The graph compounds over time — fork a public corpus, build on someone else’s work, contribute back. The README even includes a direct address to LLM-based agents reading it, pointing them to /mcp/, /llms.txt, and /.well-known/mcp.json.

Key highlights

Version-controlled corpuses with full history and fork support — “git for the citation graph”
PDF annotation with precise text-to-coordinate mapping via PAWLS, including multi-page spans
Threaded discussions, @mentions, and voting at corpus, document, and global levels
Vector + full-text search across documents and annotations
Docker Compose setup for local development; production deployment documented
MIT licensed, with a JSON-driven content pack system so deployers can retarget messaging without forking

Caveats

The repository still carries the OpenContracts name through v3 to avoid breaking existing forks and CI; the rebrand to cite is cosmetic, not a rewrite
The README is long on vision and short on architectural specifics — unclear how the MCP endpoint handles auth, or how annotation schemas are defined in practice
No benchmark numbers or performance claims are made

Verdict

Worth a look if you’re building research infrastructure, legal tech, or any system where documents reference each other and you need both human curation and agent consumption. Skip it if you just need a quick RAG pipeline over unstructured PDFs — the annotation overhead is the point, not a bug, but it’s overhead nonetheless.