A Swiss Army knife for fine-tuning 1,000+ models without losing your mind
ModelScope's ms-swift wraps the entire LLM/Multimodal training pipeline—from LoRA to Megatron GRPO—into a single Python framework with Day-0 model support.

What it does ms-swift is a training and deployment framework that covers the full lifecycle of large language and multimodal models: pre-training, supervised fine-tuning, RLHF (DPO, GRPO, KTO, etc.), inference, evaluation, quantization, and deployment. It supports 600+ text-only models and 400+ multimodal models including Qwen3, DeepSeek-R1, Llama4, and Qwen3-VL. The project also provides a Web UI for click-through training and integrates vLLM, SGLang, and LMDeploy for inference acceleration.
The interesting bit The breadth is almost comical: it doesn’t just do LoRA and QLoRA, it also wraps Megatron-scale distributed training (TP, PP, CP, EP) and a whole zoo of GRPO-family algorithms (DAPO, GSPO, SAPO, CHORD, Reinforce++). The “Mcore-Bridge” abstraction tries to make Megatron training feel as simple as HuggingFace transformers—an ambitious bit of plumbing that could save weeks of config archaeology.
Key highlights
- 600+ text models and 400+ multimodal models with Day-0 support
- Lightweight methods: LoRA, QLoRA, DoRA, ReFT, Adapter, LISA, and more
- Quantized training on BNB/AWQ/GPTQ models (claims 9GB VRAM for 7B models)
- Megatron parallelism for MoE models, plus Ulysses/Ring-Attention sequence parallelism for long contexts
- Built-in 150+ datasets and support for mixed-modality training (text, image, video, audio)
- Full pipeline: training → evaluation (EvalScope) → quantization (GPTQ/AWQ/FP8) → deployment
- AAAI 2025 published paper
Caveats
- The README is a feature list firehose; actual API ergonomics are hard to judge without using it
- “2026.03.03” release date in the News section appears to be a typo (likely meant 2025)
- With so many integrations, version compatibility across PyTorch, CUDA, Megatron, and various inference engines is a maintenance burden you’ll own
Verdict Worth a look if you’re doing production fine-tuning across many model architectures and need one toolchain instead of five. Probably overkill if you just need to LoRA-tune a single Llama model on a single GPU—use axolotl or unsloth instead.