GitHub's answer to 'find me code that does X'
A benchmark dataset and baseline models for teaching machines to search code by natural language intent, not just keyword matching.

What it does CodeSearchNet is a research corpus and benchmark for semantic code search: matching a human query like “extract video ID from URL” to the actual function that does it. It ships 2 million (docstring, code) pairs across six languages, baseline neural models, and a leaderboard hosted on Weights & Biases. The challenge itself has concluded, but the data and evaluation tools remain open.
The interesting bit The dataset treats docstrings as search queries and entire functions as documents to retrieve, which neatly sidesteps the chicken-and-egg problem of building code search without labeled query-code pairs. GitHub and Microsoft Research Cambridge built this together, which explains the unusually clean data pipeline for academic ML infrastructure.
Key highlights
- 2 million pairs from Python, JavaScript, Ruby, Go, Java, and PHP
- Data split by repository, not by file, so no test-set leakage through sibling functions
- Evaluation uses NDCG against 99 manually annotated general queries
- Pre-built Docker environment with CUDA 9.0+ GPU support; ~3.5 GB download
- Baseline models and pre-trained weights included, with W&B integration for experiment tracking
Caveats
- The challenge is closed; no new leaderboard submissions accepted
- Setup requires Nvidia-Docker and a GPU, so cloud-only developers may hit friction
- The README notes a TODO on the
shafield and some fields are explicitly unused, suggesting the schema carries a bit of historical baggage
Verdict Grab this if you’re doing research on code representation learning or building semantic search for an internal codebase. Skip it if you want a production drop-in search engine; this is a benchmark and starting point, not a finished product.