← all repositories
microsoft/SkillOpt

Training prompts like neural nets — without touching a single weight

Microsoft's SkillOpt treats a markdown skill document as the trainable parameter of a frozen LLM agent, complete with epochs, batching, and validation gates.

5.3k stars Python AgentsLLMOps · Eval
SkillOpt
Velocity · 7d
+172
★ / day
Trend
steady
star history

What it does SkillOpt optimizes natural-language “skills” for LLM agents using the same machinery as weight-space training — epochs, minibatches, learning-rate budgets, and held-out validation — but everything happens in text. A separate optimizer model proposes bounded add/delete/replace edits to a single skill document; only edits that strictly improve validation scores survive. The result is a compact best_skill.md (300–2,000 tokens) that runs against the unchanged target model with zero extra inference-time calls.

The interesting bit The discipline is the product. Most agent skills are hand-crafted or one-shot generated; SkillOpt makes skill improvement reproducible and measurable, borrowing stability tricks from deep learning (rejected-edit buffers, cosine-decayed textual learning rates, epoch-wise slow/meta updates) that keep optimization from drifting. The paper reports best-or-tied-best results across all 52 evaluated (model, benchmark, harness) cells.

Key highlights

  • Supports six benchmarks (SearchQA, ALFWorld, DocVQA, LiveMathematicianBench, SpreadsheetBench, OfficeQA) and multiple backends (OpenAI/Azure, Claude, Qwen via vLLM, MiniMax)
  • GPT-5.5 skills lift average no-skill accuracy by +23.5 points (direct chat), +24.8 (Codex CLI), +19.1 (Claude Code)
  • Optimized skills transfer across model scales and between execution harnesses without re-optimization
  • Training auto-resumes from last completed step; outputs full provenance trail (patches, evals, slow-update logs)
  • Pretrained ckpt/ artifacts provided for paper reproduction; PyPI installable (pip install skillopt)

Caveats

  • Most benchmark datasets are not included; you bring your own splits in a specific directory format (only SearchQA split is currently bundled)
  • main branch defaults to post-submission force-accept slow-update mode; paper reproduction requires flipping slow_update_gate_with_selection: true
  • Azure OpenAI endpoint is effectively required for most setups; env var naming is idiosyncratic (AZURE_OPENAI_* reused even for plain OpenAI endpoints)

Verdict Worth a look if you’re building agent pipelines and tired of prompt engineering by vibe check. Skip it if you need end-to-end data included or aren’t prepared to manage API credentials across multiple backend formats.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.