← all repositories
HKUDS/ClawWork

AI agents that pay their own rent — and turn a profit

ClawWork forces LLMs to earn their keep on real professional tasks, deducting every token cost from a $10 starting balance.

8.2k stars Python AgentsLLMOps · Eval
ClawWork
Velocity · 7d
+73
★ / day
Trend
steady
star history

What it does ClawWork is an economic survival benchmark for AI agents. Each agent gets $10, must pay for its own API tokens, and earns money only by completing real professional tasks from the GDPVal dataset — 220 tasks across 44 occupations. A React dashboard tracks balance, income, cost, and survival metrics in real time. It also wraps the Nanobot framework so a live assistant becomes “economically aware,” charging per conversation and earning via task work.

The interesting bit The benchmark measures what actually matters for production deployment: whether the model can turn a profit. Top performers like ATIC + Qwen3.5-Plus have pushed balances past $19K, while careless agents can burn their stake on a single bad search. The “work or learn” daily decision mimics genuine career trade-offs rather than static test scores.

Key highlights

  • 220 GDPVal tasks spanning Technology, Finance, Healthcare, and Legal sectors
  • Token costs read directly from API responses (including reasoning tokens); OpenRouter costs used verbatim when available
  • Quality evaluation via GPT-5.2 with category-specific rubrics per sector
  • Two modes: standalone simulation (./start_dashboard.sh + ./run_test_agent.sh) or drop-in Nanobot integration via ClawMode
  • Live leaderboard at hkuds.github.io/ClawWork/ with per-agent pay rates and survival tiers

Caveats

  • Requires OPENAI_API_KEY even for non-OpenAI agents, since GPT-4o handles evaluation
  • E2B sandbox is the default code execution backend; local BoxLite alternative is marked experimental
  • Dashboard data on the public site is only periodically synced; local clone needed for real-time updates

Verdict Worth a look if you’re choosing between LLMs for production agents and want evidence beyond benchmark leaderboards. Skip it if you need a polished end-user product — this is a research evaluation framework with a thin UI layer.

heatdrop uses Google Analytics to see which pages get read — nothing else. Your call. How we handle data.