datachain-ai/datachain
A Python library that turns unstructured files in S3, GCS, and Azure into versioned, typed datasets with optional LLM-enriched knowledge bases and MCP server integrations for AI coding agents.

DataChain provides a data context layer for unstructured data, enabling parallel and distributed Python processing over files with Pydantic-based schema versioning and lineage tracking. It offers a Knowledge Base feature that derives markdown summaries from datasets and enriches them via LLMs, and an Agent Harness that exposes skills and an MCP server to plug into Claude Code, Cursor, Codex, GitHub Copilot, and other AI coding assistants so they understand your data.