← all repositories
lasgroup/SDPO

Teaching LLMs to learn from their own mistakes

SDPO exists because scalar pass/fail rewards waste the rich textual feedback—error logs, judge evaluations—that verifiable domains already produce.

984 stars Python Language ModelsML Frameworks
SDPO
Collecting fresh signals — velocity needs a few days of history.
collecting data…
star history

What it does

SDPO fine-tunes large language models with reinforcement learning in verifiable domains such as math and code. It augments standard on-policy optimization by distilling the model’s own high-reward trajectories—and any textual feedback from failures—back into the policy. The result is a denser training signal that uses error messages and judge evaluations without requiring an external teacher or explicit reward model.

The interesting bit

The policy acts as its own teacher: it conditions on feedback from failed attempts, predicts what it should have done, and distills those feedback-informed next-token probabilities back into itself. The same trick works at test time, where the model iteratively refines hard problems by treating its own best generations as in-context demonstrations.

Key highlights

  • Converts rich textual feedback—runtime errors, judge evaluations, stack traces—into dense supervision.
  • Outperforms GRPO on chemistry reasoning and LiveCodeBenchv6 when rich feedback is available, according to the README benchmarks.
  • Falls back to implicit self-distillation from high-reward rollouts when environment feedback is sparse.
  • Supports test-time self-distillation to bootstrap harder coding problems without additional training.
  • Built as an extension to the verl RL framework, with Docker images pre-configured for NVIDIA GH200 clusters.

Caveats

  • Blackwell GPU support (RTX 50 series, B100/B200) is explicitly flagged as not fully tested.
  • The README is dominated by cluster deployment, Docker builds, and reproduction scripts; this is an infrastructure-heavy research codebase, not a lightweight library.
  • Results shown focus on specific models and benchmarks (Olmo3-7B-Instruct, chemistry, LiveCodeBench); broader generalization is not demonstrated.

Verdict

A solid bet if you are already running GPU clusters for LLM reinforcement learning and want to exploit detailed environment feedback instead of collapsing it to a scalar reward. Everyone else should admire the idea from a distance.

Frequently asked

What is lasgroup/SDPO?
SDPO exists because scalar pass/fail rewards waste the rich textual feedback—error logs, judge evaluations—that verifiable domains already produce.
Is SDPO open source?
Yes — lasgroup/SDPO is open source, released under the Apache-2.0 license.
What language is SDPO written in?
lasgroup/SDPO is primarily written in Python.
How popular is SDPO?
lasgroup/SDPO has 984 stars on GitHub.
Where can I find SDPO?
lasgroup/SDPO is on GitHub at https://github.com/lasgroup/SDPO.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.