JSON for LLMs that actually fits in the context window
A new serialization format that trades braces for whitespace and turns uniform arrays into schema-aware tables, cutting token counts by ~40% without losing the JSON data model.

What it does
TOON is a lossless encoding of the JSON data model optimized for LLM prompts. It keeps objects, arrays, and primitives intact but replaces JSON’s brace-heavy syntax with YAML-like indentation for nested structures and CSV-style tables for uniform arrays of objects. You use JSON in your code; you ship TOON to the model.
The interesting bit
The format embeds guardrails directly in the syntax: array headers declare exact lengths with [N] and field names with {fields}, giving models an explicit schema to validate against. The benchmarks are refreshingly honest — they show TOON losing to XML on Gemini and note that deeply nested data is often still better served by compact JSON.
Key highlights
- ~40% fewer tokens than standard JSON in mixed-structure benchmarks across 4 models (2,759 vs 4,587 tokens)
- 76.4% retrieval accuracy vs JSON’s 75.0% on 209 test questions — slightly better comprehension with significantly less verbosity
- Deterministic round-trips: encode JSON → TOON → JSON without data loss
- Multi-language ecosystem with spec-driven implementations (TypeScript, Python, Go, Rust, .NET)
- Provisional
text/toonmedia type and.toonfile extension
Caveats
- Deeply nested or non-uniform structures see diminishing returns; JSON-compact can be more token-efficient
- Some local/quantized models (e.g., Ollama) may process compact JSON faster despite TOON’s lower token count — the README explicitly advises benchmarking TTFT and tokens/sec for your specific deployment
- Format is stable but explicitly labeled “an idea in progress”; spec is open to revision
Verdict
Worth a look if you’re routinely stuffing large tabular datasets into prompts and watching your token budget evaporate. Skip it if your data is deeply nested, your pipeline is JSON-locked with no translation layer, or you’re running latency-sensitive local inference where token count doesn’t map cleanly to wall-clock time.