← all repositories
github/CodeSearchNet

GitHub's answer to 'find me code that does X'

A benchmark dataset and baseline models for teaching machines to search code by natural language intent, not just keyword matching.

2.4k stars Jupyter Notebook Coding AssistantsData Tooling
CodeSearchNet
Velocity · 7d
+0.9
★ / day
Trend
steady
star history

What it does CodeSearchNet is a research corpus and benchmark for semantic code search: matching a human query like “extract video ID from URL” to the actual function that does it. It ships 2 million (docstring, code) pairs across six languages, baseline neural models, and a leaderboard hosted on Weights & Biases. The challenge itself has concluded, but the data and evaluation tools remain open.

The interesting bit The dataset treats docstrings as search queries and entire functions as documents to retrieve, which neatly sidesteps the chicken-and-egg problem of building code search without labeled query-code pairs. GitHub and Microsoft Research Cambridge built this together, which explains the unusually clean data pipeline for academic ML infrastructure.

Key highlights

  • 2 million pairs from Python, JavaScript, Ruby, Go, Java, and PHP
  • Data split by repository, not by file, so no test-set leakage through sibling functions
  • Evaluation uses NDCG against 99 manually annotated general queries
  • Pre-built Docker environment with CUDA 9.0+ GPU support; ~3.5 GB download
  • Baseline models and pre-trained weights included, with W&B integration for experiment tracking

Caveats

  • The challenge is closed; no new leaderboard submissions accepted
  • Setup requires Nvidia-Docker and a GPU, so cloud-only developers may hit friction
  • The README notes a TODO on the sha field and some fields are explicitly unused, suggesting the schema carries a bit of historical baggage

Verdict Grab this if you’re doing research on code representation learning or building semantic search for an internal codebase. Skip it if you want a production drop-in search engine; this is a benchmark and starting point, not a finished product.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.