Is ToolOrchestra open source?

Yes — NVlabs/ToolOrchestra is open source, released under the Apache-2.0 license.

What language is ToolOrchestra written in?

NVlabs/ToolOrchestra is primarily written in Python.

How popular is ToolOrchestra?

NVlabs/ToolOrchestra has 748 stars on GitHub.

Where can I find ToolOrchestra?

NVlabs/ToolOrchestra is on GitHub at https://github.com/NVlabs/ToolOrchestra.

← all repositories

NVlabs/ToolOrchestra

An 8B model that coordinates GPT-5—and beats it on benchmarks

ToolOrchestra is an RL framework that trains small models to orchestrate calls to specialist tools and larger LLMs, chasing better results with smaller bills.

★748 stars Python Agents LLMOps · Eval Language Models

View on GitHub ↗ Homepage ↗

Not currently ranked — collecting fresh signals.

star history

What it does

ToolOrchestra is a reinforcement-learning training pipeline that produces compact “orchestrator” models—specifically its 8B-parameter flagship—which solve multi-step tasks by delegating work. Instead of answering directly, the model reasons across multiple turns, deciding when to invoke web search, code interpreters, or even larger LLMs like GPT-5 and Claude Opus 4.1. The framework jointly optimizes for task success, efficiency, and human preference via end-to-end RL, and it ships with an automated synthetic-data pipeline to generate training environments and tool-call scenarios at scale.

The interesting bit

The central bet is that routing is cheaper than reasoning: the 8B orchestrator beats GPT-5 on HLE, τ²-Bench, FRAMES, and GAIA while using roughly 30% of the cost on the latter two and claiming 2.5× efficiency gains on HLE. That turns the usual scale race on its head by making the smallest model the project manager and the largest models the subcontractors.

Key highlights

Orchestrator-8B scores 37.1% on HLE versus GPT-5’s 35.1%, with the README claiming 2.5× efficiency gains.
On τ²-Bench and FRAMES, the 8B model surpasses GPT-5 while using roughly 30% of the cost, according to the README’s figures.
Ranks #1 on the GAIA benchmark leaderboard and, as of early December 2025, the project reported its ToolScale dataset hit #1 on Hugging Face downloads and its model hit #3 among all models.
Supports a heterogeneous toolset: basic tools (search, code), specialist LLMs (math, coding), and generalist LLMs (GPT-5, Llama-Nemotron-Ultra-253B, Claude Opus 4.1).
End-to-end RL with composite rewards—outcome, efficiency, and preference—rather than simple supervised fine-tuning.

Caveats

The README is heavy on benchmark claims but light on architectural detail; it is unclear how the orchestrator represents tool state or handles failure recovery beyond the high-level diagram.
Running the system appears to require multiple Conda environments, an Enroot container for τ²-Bench evaluation, and API keys for Tavily, Weights & Biases, and NVIDIA NGC—suggesting the current setup is tuned for internal cluster infrastructure rather than a generic workstation.
Extending the tool set or swapping LLM backends means editing scattered Python files and JSON configs by hand; no plugin API is visible in the sources.

Verdict

Worth a look if you are researching agentic routing, cost-efficient inference, or RL-based tool use at scale. Skip it if you need a drop-in, single-environment orchestrator that runs locally without API keys or cluster access.

Frequently asked

What is NVlabs/ToolOrchestra?: ToolOrchestra is an RL framework that trains small models to orchestrate calls to specialist tools and larger LLMs, chasing better results with smaller bills.
Is ToolOrchestra open source?: Yes — NVlabs/ToolOrchestra is open source, released under the Apache-2.0 license.
What language is ToolOrchestra written in?: NVlabs/ToolOrchestra is primarily written in Python.
How popular is ToolOrchestra?: NVlabs/ToolOrchestra has 748 stars on GitHub.
Where can I find ToolOrchestra?: NVlabs/ToolOrchestra is on GitHub at https://github.com/NVlabs/ToolOrchestra.