Open-source Copilot clone admits: most of our models score zero
A community effort to replicate GitHub Copilot that publishes its training recipes, its failures, and its honest confusion about which model to use.

What it does GPT-Code-Clippy fine-tunes GPT-2 and GPT-Neo on scraped GitHub code to generate code completions. It ships a VS Code extension, a HuggingFace demo, and a 159GB deduplicated training dataset built from SEART GitHub Search plus The Pile. The project is explicitly framed as an open-source answer to GitHub Copilot.
The interesting bit The README’s candor is the feature. The authors publish HumanEval results showing their fine-tuned models scoring 0.00% on pass@1 through pass@10, note that “None improve on the standard GPT-Neo 125M model except for APPs specific models,” and leave TODOs asking which model is recommended and how to train properly. This is less a product than a public lab notebook.
Key highlights
- Dataset filtered by regex deduplication on alphanumeric “variables,” with source code and a datasheet available
- Training hyperparameters fully documented: AdamW with GPT-3-style cosine decay for CodeClippy, Adafactor for 1.3B APPS fine-tuning “in part determined by hardware limitations”
- VS Code extension exists but relies on HuggingFace Inference API
- Multiple model variants on HuggingFace Hub, including 125M and 1.3B parameter sizes
- Active issue tracking a data bug where wrong filenames may have corrupted language filtering
Caveats
- HumanEval results show base GPT-Neo outperforming all CodeClippy variants; several models score literally zero
- A known dataset bug means file extensions used for language filtering may be wrong, with unknown impact on training data quality
- README contains multiple TODOs and no clear guidance on which model or training path to follow
Verdict Worth following if you’re researching open-source code generation or want to see how a community project documents its stumbles in real time. Skip if you need a working Copilot replacement today.