← all repositories
VHellendoorn/Code-LMs

PolyCoder: a code model that admits its own limitations

A 2.7B-parameter code generator trained on 12 languages, with unusually honest documentation about what it can't do.

Code-LMs
Velocity · 7d
+1.1
★ / day
Trend
steady
star history

What it does

PolyCoder is a family of GPT-2-style language models (160M to 2.7B parameters) trained exclusively on source code across 12 programming languages. You can run it via HuggingFace, a Docker image, or a fork of the GPT-NeoX toolkit. It autocompletes code snippets from a prompt—feed it def binarySearch(arr, left, right, x):\n mid = (left + and it will suggest continuations.

The interesting bit

The README spends as much space on caveats as on features. The authors openly state the model was not trained to solve programming problems, may fail at HumanEval, and learned natural language only from code comments—not from prose like Codex did. This transparency is almost as rare as the 249GB deduplicated training corpus they published alongside it.

Key highlights

  • Three model sizes (160M, 405M, 2.7B) with checkpoints on Zenodo and HuggingFace Hub
  • Trained on 249GB of code from 24.1M files across C, C++, C#, Go, Java, JavaScript, PHP, Python, Ruby, Rust, Scala, TypeScript
  • Full data provenance: SHA-256 hashes for every file, enabling contamination checks against future test sets
  • Docker image (5.4GB base) and forked GPT-NeoX toolchain for reproduction
  • Whitespace-aware: the model cares deeply about tabs and newlines because it saw raw, unprocessed files

Caveats

  • The 2.7B model may not have converged: training was stopped at 150K steps instead of the default 320K due to resource constraints
  • It tends to generate random new files once it thinks the current one ended, possibly because end-of-document tokens were mishandled in training data
  • Requires careful indentation in prompts; a single wrong tab can send predictions off the rails

Verdict

Researchers studying code model training data, reproducibility, or multilingual code generation should grab this. Developers looking for a Copilot replacement should look elsewhere—the authors will tell you so themselves.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.