Turn your PDFs into LLM training data without writing a scraper
A desktop app that ingests documents, auto-generates Q&A pairs, and exports fine-tuning datasets in standard formats.

What it does Easy Dataset is a desktop and web application that converts unstructured documents—PDFs, Word files, Markdown, EPUB, images—into structured datasets for fine-tuning LLMs, building RAG systems, or evaluating models. You upload files, it splits and cleans the text, generates questions and answers via LLM APIs, and exports to formats like Alpaca, ShareGPT, and JSONL. It also runs evaluation workflows including blind model comparisons and automated judging.
The interesting bit The project treats dataset creation as a pipeline rather than a script. It bundles document parsing, chunking strategies (code-aware, recursive, fixed-length), domain-specific label trees, and even a “data distillation” mode that generates questions from topics without source documents. The built-in evaluation system—complete with arena-style blind testing and LLaMA Factory integration—suggests the authors actually ship models, not just notebooks.
Key highlights
- Supports OpenAI-format APIs plus Ollama, Zhipu, MiniMax, OpenRouter, and vision models (Gemini, Claude) for PDF parsing and image Q&A
- Exports to Alpaca, ShareGPT, Multilingual-Thinking, and JSON/JSONL; one-click LLaMA Factory config generation; direct Hugging Face upload
- Desktop clients for Windows, macOS (Intel/Apple Silicon), and Linux AppImage, plus Docker and raw npm install
- AGPL 3.0 licensed with an associated arXiv paper (2507.04009)
- Multi-language UI: Chinese, English, Turkish, Portuguese
Caveats
- The “intelligent” features (splitting, cleaning, question generation) are LLM-powered and thus API-cost sensitive; token consumption is tracked in a dashboard, but costs scale with document volume
- No mention of offline/fully local operation without any API keys; Ollama support exists but unclear if all features work without cloud LLMs
Verdict Worth a look if you’re regularly building domain-specific datasets and tired of stitching together Python scripts. Skip it if you need a fully automated, zero-cost pipeline or prefer code-first workflows you can version-control.