← all repositories
microsoft/markitdown

Microsoft's 141k-star tool that turns your document chaos into LLM food

A Python utility that converts PDFs, PowerPoints, Excel files, images, audio, and even YouTube videos into Markdown—optimized for feeding text to language models, not for pretty human reading.

147.4k stars Python Data Tooling
markitdown
Velocity · 7d
+258
★ / day
Trend
steady
star history

What it does

MarkItDown is a Python utility that ingests almost any office document or media file and spits out Markdown. PDFs, Word docs, PowerPoints, Excel sheets, images (with OCR), audio (with transcription), HTML, ZIP archives, YouTube URLs, EPubs—it handles the lot. The output preserves structure like headings, lists, and tables, but the README is explicit: this is for LLM consumption, not high-fidelity human-readable conversion.

The interesting bit

The project bets that Markdown is the optimal LLM ingestion format because models like GPT-4o are “natively trained” on it, and Markdown is highly token-efficient. It also offers a clever plugin architecture—third-party plugins like markitdown-ocr can inject LLM Vision into converters without adding heavy ML dependencies, and Azure Content Understanding integration can extract structured YAML front matter (invoice amounts, contract clauses) alongside the Markdown body.

Key highlights

  • Broad format coverage: PDF, DOCX, PPTX, XLSX, images, audio, video (via Azure CU), HTML, CSV, JSON, XML, ZIP, YouTube, EPub
  • Optional dependency groups so you only install what you need (e.g., pip install 'markitdown[pdf,docx]')
  • Plugin system with hashtag #markitdown-plugin for discovery; OCR plugin uses existing llm_client/llm_model pattern
  • Azure Content Understanding integration for higher-quality cloud extraction, structured fields, and video support
  • Azure Document Intelligence as a middle-tier option for cloud-based layout analysis
  • CLI, Python API, and Docker support

Caveats

  • Built-in audio transcription is basic; video requires billable Azure Content Understanding calls
  • LLM image descriptions currently only work for PPTX and image files, not all formats
  • Security note: performs I/O with process privileges; README warns to sanitize inputs and use narrowest convert_* function in untrusted environments
  • Each Azure Content Understanding convert() call is a billable API call; costs can accumulate quickly

Verdict

Grab this if you’re building RAG pipelines, document Q&A systems, or any workflow that needs to feed heterogeneous file formats into an LLM context window. Skip it if you need pixel-perfect document reproduction for human readers—Microsoft itself says that’s not the goal.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.