Yes — lasgroup/SDPO is open source, released under the Apache-2.0 license.

What language is SDPO written in?

lasgroup/SDPO is primarily written in Python.

lasgroup/SDPO has 984 stars on GitHub.

Where can I find SDPO?

lasgroup/SDPO is on GitHub at https://github.com/lasgroup/SDPO.

lasgroup/SDPO

Teaching LLMs to learn from their own mistakes

SDPO exists because scalar pass/fail rewards waste the rich textual feedback—error logs, judge evaluations—that verifiable domains already produce.

★984 stars Python Language Models ML Frameworks

View on GitHub ↗ Homepage ↗

Collecting fresh signals — velocity needs a few days of history.

collecting data…

star history

What it does

SDPO fine-tunes large language models with reinforcement learning in verifiable domains such as math and code. It augments standard on-policy optimization by distilling the model’s own high-reward trajectories—and any textual feedback from failures—back into the policy. The result is a denser training signal that uses error messages and judge evaluations without requiring an external teacher or explicit reward model.

The interesting bit

The policy acts as its own teacher: it conditions on feedback from failed attempts, predicts what it should have done, and distills those feedback-informed next-token probabilities back into itself. The same trick works at test time, where the model iteratively refines hard problems by treating its own best generations as in-context demonstrations.

Key highlights

Converts rich textual feedback—runtime errors, judge evaluations, stack traces—into dense supervision.
Outperforms GRPO on chemistry reasoning and LiveCodeBenchv6 when rich feedback is available, according to the README benchmarks.
Falls back to implicit self-distillation from high-reward rollouts when environment feedback is sparse.
Supports test-time self-distillation to bootstrap harder coding problems without additional training.
Built as an extension to the verl RL framework, with Docker images pre-configured for NVIDIA GH200 clusters.

Caveats

Blackwell GPU support (RTX 50 series, B100/B200) is explicitly flagged as not fully tested.
The README is dominated by cluster deployment, Docker builds, and reproduction scripts; this is an infrastructure-heavy research codebase, not a lightweight library.
Results shown focus on specific models and benchmarks (Olmo3-7B-Instruct, chemistry, LiveCodeBench); broader generalization is not demonstrated.

Verdict

A solid bet if you are already running GPU clusters for LLM reinforcement learning and want to exploit detailed environment feedback instead of collapsing it to a scalar reward. Everyone else should admire the idea from a distance.

Frequently asked

What is lasgroup/SDPO?: SDPO exists because scalar pass/fail rewards waste the rich textual feedback—error logs, judge evaluations—that verifiable domains already produce.
Is SDPO open source?: Yes — lasgroup/SDPO is open source, released under the Apache-2.0 license.
What language is SDPO written in?: lasgroup/SDPO is primarily written in Python.
How popular is SDPO?: lasgroup/SDPO has 984 stars on GitHub.
Where can I find SDPO?: lasgroup/SDPO is on GitHub at https://github.com/lasgroup/SDPO.