Is Code-LMs open source?

Yes — VHellendoorn/Code-LMs is open source, released under the MIT license.

What language is Code-LMs written in?

VHellendoorn/Code-LMs is primarily written in Python.

How popular is Code-LMs?

VHellendoorn/Code-LMs has 1.8k stars on GitHub.

Where can I find Code-LMs?

VHellendoorn/Code-LMs is on GitHub at https://github.com/VHellendoorn/Code-LMs.

← all repositories

VHellendoorn/Code-LMs

PolyCoder: a code model that admits its own limitations

A 2.7B-parameter code generator trained on 12 languages, with unusually honest documentation about what it can't do.

★1.8k stars Python Language Models Coding Assistants

View on GitHub ↗

Not currently ranked — collecting fresh signals.

star history

What it does

PolyCoder is a family of GPT-2-style language models (160M to 2.7B parameters) trained exclusively on source code across 12 programming languages. You can run it via HuggingFace, a Docker image, or a fork of the GPT-NeoX toolkit. It autocompletes code snippets from a prompt—feed it def binarySearch(arr, left, right, x):\n mid = (left + and it will suggest continuations.

The interesting bit

The README spends as much space on caveats as on features. The authors openly state the model was not trained to solve programming problems, may fail at HumanEval, and learned natural language only from code comments—not from prose like Codex did. This transparency is almost as rare as the 249GB deduplicated training corpus they published alongside it.

Key highlights

Three model sizes (160M, 405M, 2.7B) with checkpoints on Zenodo and HuggingFace Hub
Trained on 249GB of code from 24.1M files across C, C++, C#, Go, Java, JavaScript, PHP, Python, Ruby, Rust, Scala, TypeScript
Full data provenance: SHA-256 hashes for every file, enabling contamination checks against future test sets
Docker image (5.4GB base) and forked GPT-NeoX toolchain for reproduction
Whitespace-aware: the model cares deeply about tabs and newlines because it saw raw, unprocessed files

Caveats

The 2.7B model may not have converged: training was stopped at 150K steps instead of the default 320K due to resource constraints
It tends to generate random new files once it thinks the current one ended, possibly because end-of-document tokens were mishandled in training data
Requires careful indentation in prompts; a single wrong tab can send predictions off the rails

Verdict

Researchers studying code model training data, reproducibility, or multilingual code generation should grab this. Developers looking for a Copilot replacement should look elsewhere—the authors will tell you so themselves.

Frequently asked

What is VHellendoorn/Code-LMs?: A 2.7B-parameter code generator trained on 12 languages, with unusually honest documentation about what it can't do.
Is Code-LMs open source?: Yes — VHellendoorn/Code-LMs is open source, released under the MIT license.
What language is Code-LMs written in?: VHellendoorn/Code-LMs is primarily written in Python.
How popular is Code-LMs?: VHellendoorn/Code-LMs has 1.8k stars on GitHub.
Where can I find Code-LMs?: VHellendoorn/Code-LMs is on GitHub at https://github.com/VHellendoorn/Code-LMs.