Is CodeSearchNet open source?

Yes — github/CodeSearchNet is open source, released under the MIT license.

What language is CodeSearchNet written in?

github/CodeSearchNet is primarily written in Jupyter Notebook.

How popular is CodeSearchNet?

github/CodeSearchNet has 2.4k stars on GitHub.

Where can I find CodeSearchNet?

github/CodeSearchNet is on GitHub at https://github.com/github/CodeSearchNet.

← all repositories

github/CodeSearchNet

GitHub's answer to 'find me code that does X'

A benchmark dataset and baseline models for teaching machines to search code by natural language intent, not just keyword matching.

★2.4k stars Jupyter Notebook Coding Assistants Data Tooling

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does CodeSearchNet is a research corpus and benchmark for semantic code search: matching a human query like “extract video ID from URL” to the actual function that does it. It ships 2 million (docstring, code) pairs across six languages, baseline neural models, and a leaderboard hosted on Weights & Biases. The challenge itself has concluded, but the data and evaluation tools remain open.

The interesting bit The dataset treats docstrings as search queries and entire functions as documents to retrieve, which neatly sidesteps the chicken-and-egg problem of building code search without labeled query-code pairs. GitHub and Microsoft Research Cambridge built this together, which explains the unusually clean data pipeline for academic ML infrastructure.

Key highlights

2 million pairs from Python, JavaScript, Ruby, Go, Java, and PHP
Data split by repository, not by file, so no test-set leakage through sibling functions
Evaluation uses NDCG against 99 manually annotated general queries
Pre-built Docker environment with CUDA 9.0+ GPU support; ~3.5 GB download
Baseline models and pre-trained weights included, with W&B integration for experiment tracking

Caveats

The challenge is closed; no new leaderboard submissions accepted
Setup requires Nvidia-Docker and a GPU, so cloud-only developers may hit friction
The README notes a TODO on the sha field and some fields are explicitly unused, suggesting the schema carries a bit of historical baggage

Verdict Grab this if you’re doing research on code representation learning or building semantic search for an internal codebase. Skip it if you want a production drop-in search engine; this is a benchmark and starting point, not a finished product.

Frequently asked

What is github/CodeSearchNet?: A benchmark dataset and baseline models for teaching machines to search code by natural language intent, not just keyword matching.
Is CodeSearchNet open source?: Yes — github/CodeSearchNet is open source, released under the MIT license.
What language is CodeSearchNet written in?: github/CodeSearchNet is primarily written in Jupyter Notebook.
How popular is CodeSearchNet?: github/CodeSearchNet has 2.4k stars on GitHub.
Where can I find CodeSearchNet?: github/CodeSearchNet is on GitHub at https://github.com/github/CodeSearchNet.