Why 27,000 Developers Want AI Agents to Write Specs First

Contributing Editor

Addy Osmani's agent-skills repo encodes the full software lifecycle into structured Markdown workflows, forcing coding agents past their default tendency to ship the shortest path.

addyosmani/agent-skills

★80.5k stars Velocity · 7d +169 ★/day ↘cooling

star history

View on GitHub ↗

The Shortcut Problem

AI coding agents have a default mode: find the shortest path to a green checkmark. Ask for a feature and you receive an implementation, often without a spec, without tests that actually fail first, and without the quiet diligence that separates prototype code from production software. Addy Osmani, whose new repository recently crossed 27,000 stars, diagnoses this as the same failure mode every senior engineer learns to avoid. The senior version of any task includes invisible work—surfacing assumptions, sizing changes for human review, leaving evidence that the result is correct—that agents routinely skip because their reward signal points at “task complete,” not “task complete and the design doc exists.”

The timing is not accidental. The industry is currently oscillating between “vibe coding,” where natural language substitutes for engineering judgment, and a more skeptical view that treats AI as a collaborator rather than a replacement. One developer’s account of moving from copy-pasting into ChatGPT to structured AI-assisted coding notes that LLMs excel at syntax but still fall short on strategic planning, debugging, and sensible repository structure. Meanwhile, an Anthropic randomized controlled trial found that developers using AI assistance scored 17 percent lower on post-task mastery quizzes than those who coded by hand—the equivalent of nearly two letter grades—with the largest gaps appearing in debugging questions. When agents write code humans do not fully understand, the absence of process becomes a liability fast.

Process, Not Prose

Osmani’s response is not another set of linter rules or a lengthy style guide. It is a pack of 23 Markdown files—“skills”—that encode workflows rather than reference material. The distinction matters. A 2,000-word essay on testing best practices sits in the context window and gets ignored under pressure; a workflow that commands “write the failing test first, run it, watch it fail, write the minimum code to pass, watch it pass, refactor” gives the agent something to execute and the human something to verify. As Osmani puts it, process over prose, workflows over reference, steps with exit criteria over essays without them.

The format itself borrows from an open standard originally developed by Anthropic. Each skill is a directory containing a SKILL.md file with YAML frontmatter, loaded through progressive disclosure. At startup, the agent sees only the skill’s name and description—roughly a hundred tokens. When a task matches the description, the full instructions load. Additional scripts, references, or templates sit on the filesystem and enter context only when explicitly called. This keeps the token footprint small while allowing deep expertise to remain available.

The most distinctive design choice, however, is the anti-rationalization table. Every skill includes a pre-written list of excuses an agent—or a tired engineer—might use to skip a step, paired with a rebuttal. “This task is too simple to need a spec” meets the response that acceptance criteria still apply; five lines is fine, zero is not. “I’ll write tests later” is countered with the observation that later is the load-bearing word. The tables exist because LLMs are excellent at rationalization; they will generate plausible paragraphs explaining why this particular change does not need review. The skills treat that tendency as a bug and patch it explicitly.

The Full Lifecycle in Seven Commands

The twenty-two lifecycle skills plus one meta-skill map onto a classic software development lifecycle: Define, Plan, Build, Verify, Review, Ship. Seven slash commands serve as entry points—/spec, /plan, /build, /test, /review, /code-simplify, and /ship—each activating the relevant workflows automatically. A complex feature might trigger eleven skills in sequence; a small bug fix might use three. The meta-skill routes incoming work to the appropriate subset, scaling the ceremony to the scope rather than imposing a one-size-fits-all waterfall.

The Define phase includes an interview-me skill that extracts what the user actually wants through one-question-at-a-time interrogation until confidence reaches roughly 95 percent, and a spec-driven-development skill that mandates a PRD before any code. The Build phase includes source-driven-development, which grounds every framework decision in official documentation, and frontend-ui-engineering, which enforces WCAG 2.1 AA accessibility. The Verify phase pairs test-driven-development with browser-testing-with-devtools, the latter using Chrome DevTools MCP for live runtime data rather than static assertions alone.

The content is opinionated in a specific way: it encodes Google’s engineering culture. Hyrum’s Law appears in API design; the Beyoncé Rule and test pyramid shape the testing skill; Chesterton’s Fence guards code simplification; trunk-based development and atomic commits govern git workflow; Shift Left and feature flags appear in CI/CD. These are not abstract principles dropped into a README. They are embedded as step-by-step instructions with verification gates—tests passing, build output, runtime data—so that “seems right” is never sufficient.

There is even a skill for doubt. “Doubt-driven development” runs an adversarial review of every non-trivial decision in flight, using a CLAIM → EXTRACT → DOUBT → RECONCILE → STOP loop, with optional cross-model escalation. It is the kind of paranoia that does not come naturally to an optimizer trained to minimize token count and maximize task completion.

Glue Code with a Pedigree

A fair criticism, hinted at in community discussion, is that this is fundamentally conditional context loading—sophisticated prompting dressed up as a framework. The repository contains no compiled binaries, no runtime, no enforcement layer. It is, in a sense, glue code: curated Markdown files that rely entirely on the underlying agent to read, comprehend, and obey. If the model decides to ignore the workflow, the skill cannot stop it.

That criticism is accurate but incomplete. The same could be said of any engineering playbook or runbook; their value lies in curation and structure, not in executable force. The gap between a random “AI rules” repository and this one is the rigor of the exit criteria and the specificity of the anti-rationalization tables. Still, the limitation is real. These skills are guardrails painted on the road, not concrete barriers. An agent determined to hallucinate past them can do so.

Every IDE, For Now

One reason the repository has spread quickly is its portability. Because the underlying format is an open standard, the skills install into Claude Code via the plugin marketplace, drop into Cursor’s .cursor/rules/, load as native skills in the Gemini CLI, and adapt for GitHub Copilot, Windsurf, OpenCode, and Kiro. In a landscape where developers routinely juggle six to ten tools, and where organizations are desperate to consolidate rather than fragment their toolchain further, a filesystem-based, version-controlled skill pack travels well.

This cross-product reuse aligns with a broader trend noted by GitLab: agentic AI is shifting from isolated autocomplete to orchestration layers that coordinate across the lifecycle. Rather than adding another siloed tool, skills act as portable procedural knowledge that any compatible agent can load on demand.

The Deeper Worry

Agent Skills addresses the agent’s behavior, but it does not fully resolve the human side of the equation. The Anthropic study on skill atrophy suggests that developers who lean heavily on AI assistance become less engaged with their work and offload their thinking, particularly in debugging scenarios. Better agent workflows may produce more reliable diffs, yet they do not guarantee that the human reviewing those diffs understands what they are looking at.

The repository’s philosophy implicitly acknowledges this tension. By forcing agents to write specs, tests, and architecture decision records, it leaves behind artifacts that humans can read. In an era where determining the origin of a specific piece of code is becoming more difficult and compliance teams are asking harder questions about AI-generated changes, those artifacts double as audit trails. The skills are as much a communication protocol between agent and engineer as they are instructions for the agent.

Outlook

Where this project goes depends on whether agents learn to internalize these workflows or whether they will always need external scaffolding. If future models reliably generate their own specs and adversarial reviews, curated Markdown skills may look quaint. The repository also faces a structural tension: because the skills are plain Markdown, they compete with a growing ecosystem of agent instructions, Copilot personas, and IDE-specific rule files. Their longevity depends on whether the open Agent Skills standard continues to be adopted across tools, or whether each platform retreats into its own proprietary format.

For now, though, the skills represent a pragmatic admission that current agents behave like talented juniors who need senior oversight encoded into their environment. The open question is whether the industry will adopt that oversight as a default, or continue to reward the shortest path.