← all repositories
microsoft/CodeBERT

Microsoft's six-model code-AI family, ready to pip install

A monorepo of pretrained transformers for code understanding, generation, review, and even execution traces.

CodeBERT
Velocity · 7d
+1.3
★ / day
Trend
steady
star history

What it does This repository bundles six related models—CodeBERT, GraphCodeBERT, UniXcoder, CodeReviewer, CodeExecutor, and LongCoder—into a single pip-installable lineage. Each is a pretrained transformer for code, loadable via Hugging Face’s standard transformers API exactly like RoBERTa. The base CodeBERT model was trained on natural-language-to-code pairs across six languages (Python, Java, JavaScript, PHP, Ruby, Go).

The interesting bit The family keeps branching into weirder capabilities: GraphCodeBERT injects data-flow structure, CodeReviewer digests actual code-review conversations, and CodeExecutor tries to predict execution traces rather than just syntax. It’s less a single tool than a research group’s longitudinal study of what you can pretrain a transformer to understand about code.

Key highlights

  • Drop-in Hugging Face compatibility: AutoModel.from_pretrained("microsoft/codebert-base")
  • Two variants for different jobs: standard CodeBERT for embeddings, CodeBERT-MLM for masked-token prediction
  • GraphCodeBERT adds data-flow edges for structure-aware tasks (clone detection, code translation)
  • LongCoder targets long-range code completion with sparse attention
  • Each model has its own subfolder with task-specific reproduction code

Caveats

  • The README is mostly a directory of paper links; actual tutorials and downstream task code live in per-model subfolders
  • Some subfolders use “will provide” phrasing, suggesting not everything is fully documented yet

Verdict Worth bookmarking if you’re doing empirical research on code intelligence or need a solid pretrained embedding baseline. Skip it if you want a single polished product—this is a lab’s paper-reproduction archive, not a unified framework.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.