Unsloth for your MacBook: prototype locally, ship to CUDA
A compatibility shim that lets you run the same fine-tuning scripts on Apple Silicon and NVIDIA GPUs without rewriting imports.

What it does
mlx-tune wraps Apple’s MLX framework in an Unsloth-compatible API. You import FastLanguageModel and SFTTrainer from mlx_tune instead of unsloth and trl, and the rest of your script stays identical. It supports SFT, DPO, GRPO, vision-language models, TTS, STT, embeddings, and even OCR fine-tuning — all on Apple Silicon, using unified memory up to 512 GB.
The interesting bit
The author calls this the “Context Switch” problem: Unsloth depends on Triton, which doesn’t run on Macs. Rather than reimplementing Unsloth or claiming speed parity, mlx-tune optimizes for code portability. Prototype on an M4 Mac, then move the exact same script to a CUDA cluster for production training. The v0.5.0 release adds real optimizations — GRPO roughly 10× faster via KV-cache reuse, gradient checkpointing wired into every trainer, and ORPO now fitting in 48 GB at 4096 context — but the core pitch remains workflow, not benchmarks.
Key highlights
- Unsloth-compatible API: swap imports, keep your training scripts
- Broad modality coverage: LLMs, vision (Gemma 4, Qwen3.5, LLaVA), audio (5 TTS and 7 STT models), embeddings, OCR, and MoE architectures
- Exports to HuggingFace format and GGUF for Ollama/llama.cpp
convert()utility for HF → MLX model conversion- Continual pretraining support with decoupled learning rates
Caveats
- GGUF export has limitations with 4-bit base models; the README points to workarounds
- Merging into a 4-bit base with
save_method="merged_4bit"can round away weak LoRA deltas — 16-bit merge is preferred - Apple Silicon only; no Intel Mac or cross-platform support
Verdict
Grab this if you own a recent Mac and want to iterate on fine-tuning pipelines without burning cloud GPU credits. Skip it if you’re already on CUDA full-time — Unsloth itself is still the gold standard there, and mlx-tune doesn’t pretend otherwise.