Shrink your AI agent's reading list by 60–95% without changing the answers
A local compression layer that sits between your agent and the LLM, cutting token counts by routing tool outputs, logs, and code through type-specific compressors.

What it does
Headroom intercepts everything an AI agent consumes — tool outputs, logs, RAG chunks, files, conversation history — and compresses it before the LLM sees it. It runs as a Python/TypeScript library, a zero-code proxy, or an MCP server. The project claims 60–95% token reduction with preserved accuracy, backed by benchmark tables for GSM8K, TruthfulQA, SQuAD v2, and BFCL.
The interesting bit
The architecture treats compression as a routing problem. A ContentRouter detects whether incoming data is JSON, code (AST), or prose, then dispatches to SmartCrusher, CodeCompressor, or a HuggingFace model called Kompress-base. A reversible layer (CCR) keeps originals locally so the LLM can retrieve them on demand — useful when the compressed version isn’t enough.
Key highlights
- Six compression algorithms: JSON crusher, AST-aware code compressor, text model, image router, cache aligner, and reversible CCR storage
headroom wrap claude|codex|cursor|aider|copilot— one-command agent integration- Cross-agent memory with auto-dedup, so Claude and Codex can share context
headroom learnmines failed sessions and writes corrections toCLAUDE.md/AGENTS.md- Local-first: runs on your machine, data doesn’t leave
Caveats
- Requires Python 3.10+ and a local process — won’t work in fully sandboxed environments
- The “60B+ tokens saved” community metric is displayed prominently but lacks methodology detail in the README
- Some integrations (Cursor, Copilot CLI) need manual config paste or proxy startup, not fully automatic
Verdict
Worth a look if you’re burning through context windows with coding agents or multi-step RAG pipelines. Skip if you’re already happy with a single provider’s native compaction and don’t need cross-agent memory or reversible retrieval.