gkamradt/needle-in-a-haystack
Benchmarks LLMs on their ability to retrieve a hidden 'needle' fact from long 'haystack' contexts at varying positions.

Needle-in-a-Haystack is an evaluation framework for pressure-testing LLM long-context retrieval. It runs controlled experiments sweeping context length against needle depth, scores each model response, and writes structured results to JSONL. Built-in tasks include single-fact lookup, multi-fact recall, and UUID-chain hops for multi-step reasoning. Supports OpenAI, Anthropic, and Cohere providers out of the box with a plugin system for adding more.