Microsoft's six-model code-AI family, ready to pip install
A monorepo of pretrained transformers for code understanding, generation, review, and even execution traces.

What it does
This repository bundles six related models—CodeBERT, GraphCodeBERT, UniXcoder, CodeReviewer, CodeExecutor, and LongCoder—into a single pip-installable lineage. Each is a pretrained transformer for code, loadable via Hugging Face’s standard transformers API exactly like RoBERTa. The base CodeBERT model was trained on natural-language-to-code pairs across six languages (Python, Java, JavaScript, PHP, Ruby, Go).
The interesting bit The family keeps branching into weirder capabilities: GraphCodeBERT injects data-flow structure, CodeReviewer digests actual code-review conversations, and CodeExecutor tries to predict execution traces rather than just syntax. It’s less a single tool than a research group’s longitudinal study of what you can pretrain a transformer to understand about code.
Key highlights
- Drop-in Hugging Face compatibility:
AutoModel.from_pretrained("microsoft/codebert-base") - Two variants for different jobs: standard CodeBERT for embeddings, CodeBERT-MLM for masked-token prediction
- GraphCodeBERT adds data-flow edges for structure-aware tasks (clone detection, code translation)
- LongCoder targets long-range code completion with sparse attention
- Each model has its own subfolder with task-specific reproduction code
Caveats
- The README is mostly a directory of paper links; actual tutorials and downstream task code live in per-model subfolders
- Some subfolders use “will provide” phrasing, suggesting not everything is fully documented yet
Verdict Worth bookmarking if you’re doing empirical research on code intelligence or need a solid pretrained embedding baseline. Skip it if you want a single polished product—this is a lab’s paper-reproduction archive, not a unified framework.