Alibaba's 30B research agent runs on 3.3B active params
A sparse Mixture-of-Experts model trained end-to-end for multi-step web research, not just chat.

What it does
Tongyi DeepResearch is a 30.5B-parameter Mixture-of-Experts model that activates only 3.3B parameters per token. It is built specifically for long-horizon information-seeking: the model plans searches, reads web pages, parses uploaded files, and reasons across multiple steps to answer complex questions. It supports two inference modes: a standard ReAct loop for evaluation, and a heavier “IterResearch” mode that scales compute at test time for harder tasks.
The interesting bit
Most open models are trained to chat; this one is trained to search. The team built a fully automated synthetic data pipeline for agentic pre-training, then ran large-scale continual pre-training and end-to-end reinforcement learning with a customized on-policy GRPO variant. The result is a model that tops several agentic search benchmarks rather than just language perplexity leaderboards.
Key highlights
- Sparse MoE architecture: 30.5B total params, 3.3B active per token, 128K context window
- Trained with automated synthetic data generation, continual pre-training on agentic interactions, and token-level on-policy RL with leave-one-out advantage estimation
- Supports ReAct and IterResearch inference paradigms; latter uses test-time scaling for maximum performance
- Available via HuggingFace, ModelScope, OpenRouter API, and Alibaba’s Bailian cloud service
- Evaluation scripts and inference code provided; requires Python 3.10 and multiple API keys (Serper, Jina, OpenAI-compatible, Dashscope)
Caveats
- Online demos are explicitly marked “for quick exploration only” and may fail intermittently due to model latency and tool QPS limits; local deployment or Bailian is recommended for stability
- Setup is involved: you need API keys for search, page reading, summarization, file parsing, and optionally a Python sandbox
- The README claims “state-of-the-art performance” on several benchmarks but does not provide absolute scores or comparison tables in the excerpt shown
Verdict
Worth a look if you are building autonomous research agents or studying RL training for tool use. Skip it if you want a drop-in chat replacement or lack the API budget and patience to wire up six external services.