Forget HTML parsers: this RAG reads screenshots
PixelRAG renders documents into screenshot tiles and retrieves them visually, preserving tables and layout that HTML parsers strip away.

What it does
PixelRAG turns web pages, PDFs, and images into screenshot tiles, then indexes and searches them with a vision-language embedding model. Instead of parsing HTML into text chunks that lose tables and charts, it keeps the visual layout intact for retrieval. A hosted API already holds an index of 8.28 million Wikipedia pages and accepts both text and image queries.
The interesting bit
The project ships a LoRA-fine-tuned Qwen3-VL-Embedding-2B model trained specifically on webpage screenshots, so the retriever actually understands page layout rather than just captioning it. There is also a Claude Code plugin that lets Claude screenshot and read pages directly, giving the agent eyes without an MCP server.
Key highlights
- Searches a live hosted index of 8.28M Wikipedia pages via an open API with no key required.
- Queries can be text or images, so you can literally search by picture.
- Pipeline stages (
pixelshot,chunk,embed,build-index,serve) run independently; mix and match what you need. - Ships as a Claude Code plugin (
pixelbrowse) that renders pages locally through Playwright/CDP. - Training adapters and the full fine-tuning dataset are published on Hugging Face.
Caveats
- Data curation reproduction steps are currently marked “TBD” in the docs.
- The fine-tuning code lives in a separate
uvproject with pinned CUDA and PyTorch dependencies, so expect environment gymnastics if you plan to retrain.
Verdict
Worth a look if you run RAG over visually complex documents and are tired of watching your parser mangle tables. Skip it if your corpus is already clean plain text and you don’t need another GPU-bound embedding model.