StarTrail-org/PixelRAG

Forget HTML parsers: this RAG reads screenshots

PixelRAG renders documents into screenshot tiles and retrieves them visually, preserving tables and layout that HTML parsers strip away.

★1.3k stars Python RAG · Search Image · Video · Audio

View on GitHub ↗ Homepage ↗

Collecting fresh signals — velocity needs a few days of history.

collecting data…

star history

What it does

PixelRAG turns web pages, PDFs, and images into screenshot tiles, then indexes and searches them with a vision-language embedding model. Instead of parsing HTML into text chunks that lose tables and charts, it keeps the visual layout intact for retrieval. A hosted API already holds an index of 8.28 million Wikipedia pages and accepts both text and image queries.

The interesting bit

The project ships a LoRA-fine-tuned Qwen3-VL-Embedding-2B model trained specifically on webpage screenshots, so the retriever actually understands page layout rather than just captioning it. There is also a Claude Code plugin that lets Claude screenshot and read pages directly, giving the agent eyes without an MCP server.

Key highlights

Searches a live hosted index of 8.28M Wikipedia pages via an open API with no key required.
Queries can be text or images, so you can literally search by picture.
Pipeline stages (pixelshot, chunk, embed, build-index, serve) run independently; mix and match what you need.
Ships as a Claude Code plugin (pixelbrowse) that renders pages locally through Playwright/CDP.
Training adapters and the full fine-tuning dataset are published on Hugging Face.

Caveats

Data curation reproduction steps are currently marked “TBD” in the docs.
The fine-tuning code lives in a separate uv project with pinned CUDA and PyTorch dependencies, so expect environment gymnastics if you plan to retrain.

Verdict

Worth a look if you run RAG over visually complex documents and are tired of watching your parser mangle tables. Skip it if your corpus is already clean plain text and you don’t need another GPU-bound embedding model.