← all repositories

gkamradt/needle-in-a-haystack

Benchmarks LLMs on their ability to retrieve a hidden 'needle' fact from long 'haystack' contexts at varying positions.

2.3k stars Jupyter Notebook LLMOps · Eval
needle-in-a-haystack
Velocity · 7d
+2.5
★ / day
Trend
steady
star history

Needle-in-a-Haystack is an evaluation framework for pressure-testing LLM long-context retrieval. It runs controlled experiments sweeping context length against needle depth, scores each model response, and writes structured results to JSONL. Built-in tasks include single-fact lookup, multi-fact recall, and UUID-chain hops for multi-step reasoning. Supports OpenAI, Anthropic, and Cohere providers out of the box with a plugin system for adding more.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.