BERT, but make it do five jobs at once
A wrapper around Hugging Face transformers that tries to make multi-task learning as easy as single-task, with pluggable strategies for sampling, loss combination, and gradient surgery.

What it does
M3TL is a convenience layer over Hugging Face transformers for multi-modal, multi-task learning. It exposes programmable modules—problem sampling, loss combination, gradient surgery, and post-transformer model architecture—that let you stack multiple NLP tasks (classification, NER, sequence tagging, masked LM, etc.) on a single shared backbone. The pitch: write MTL models with roughly the effort of a single-task model.
The interesting bit
The project doesn’t just wire tasks together; it treats MTL’s gnarly coordination problems as first-class, swappable components. Gradient surgery in particular is the kind of thing that usually lives in research code and never gets reused.
Key highlights
- Built-in problem types: classification, multi-label, sequence labeling, masked LM, regression, contrastive learning, and more
- Pluggable strategies for which tasks to sample, how to combine losses, and how to avoid gradient conflicts
- Post-transformer model module is user-programmable
- Claims various “SOTA MTL algorithms” included, though specifics aren’t enumerated in the README
- Multi-modal support extends beyond text
Caveats
- README is heavy on promises and light on implementation details; “tutorials” are referenced but not linked or summarized
- No benchmarks, citation, or comparison against TencentNLP/PyText (the projects it criticizes as “naive”)
- “SOTA MTL algorithms” are asserted, not listed
Verdict
Worth a look if you’re already in the Hugging Face ecosystem and need to bolt multiple NLP tasks onto one model without hand-rolling the coordination logic. Skip if you need rigorous comparisons or documentation before trusting a training pipeline.