← all repositories

openai/human-eval

Benchmark harness for evaluating the functional correctness of code completions generated by large language models.

3.3k stars Python LLMOps · EvalLanguage Models
human-eval
Velocity · 7d
+1.8
★ / day
Trend
steady
star history

This repository contains the evaluation harness for the HumanEval dataset, a benchmark for assessing code-generating language models. It provides tooling to run model-generated Python code in a sandboxed environment and measure functional correctness against hand-written test cases. The dataset and methodology were introduced in the paper ‘Evaluating Large Language Models Trained on Code’ and is widely used to compare code-assistance models like Codex.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.