How Sakana trained a 7B model to orchestrate GPT-5, Claude Sonnet 4 and Gemini 2.5 Pro


Every LangChain pipeline your team hardcodes starts breaking the moment the query distribution shifts — and it always shifts. That bottleneck is what Sakana AI set out to eliminate.

Researchers at Sakana AI have introduced the “RL Conductor,” a small language model trained via reinforcement learning to automatically orchestrate a diverse pool of worker LLMs. Conductor dynamically analyzes inputs, distributes labor among workers, and coordinates among agents.

This automated coordination achieves state-of-the-art results on difficult reasoning and coding benchmarks, outperforming individual frontier models like GPT-5 and Claude Sonnet 4 as well as expensive human-designed multi-agent pipelines. It achieves this performance at a fraction of the cost and with fewer API calls than competitors. RL Conductor is the backbone of Fugu, Sakana AI’s commercial multi-agent orchestration service.

The limitations of manual agentic frameworks

Large language models have strong latent capabilities. But tapping these capabilities to their fullest is a great challenge. Extracting this level of performance relies heavily on manually designed agentic workflows, which serve as critical components in commercial AI products. 

However, these frameworks fall short because they are inherently rigid and constrained. In comments to VentureBeat, Yujin Tang, co-author of the paper, explained the exact breaking point of current systems: “While using frameworks with hard-coded pipelines like LangChain and Mixture-of-Agents can work well for specific use cases … In production, an inherent bottleneck arises when targeting domains with large user bases with very heterogeneous demands.” 

Tang noted that achieving “real-world generalization in such heterogeneous applications inherently necessitates going beyond human-hardcoded designs.”

Another bottleneck for building robust agentic systems is that no single model is optimal for all tasks. Different models are fine-tuned to specialize in distinct domains. One model might excel at scientific reasoning, while another is superior at code generation, mathematical logic, or high-level planning. 

Because models have these varying characteristics and complementary skills, manually predicting and hard-coding the ideal combination of models for every query is practically impossible. An optimal agentic framework should be able to analyze a problem and delegate subtasks to the most suitable expert in the pool.

Conducting an orchestra of agents

The RL Conductor is designed to overcome the limitations of rigid, human-designed frameworks. As the name implies, it conducts an orchestra of agents by dividing challenging problems, delegating targeted subtasks, and designing communication topologies for a set of worker LLMs. 

Instead of relying on fixed code or static routing, the Conductor orchestrates these models by generating a customized workflow. For each step in the workflow, the model generates a natural language instruction for a specific aspect of the task, assigns an agent to carry it out, and defines an “access list” that dictates which past subtasks and responses from other agents are included in that agent’s context.

By defining everything in natural language, the Conductor builds flexible workflows tailored to each input. It can construct simple sequential chains, parallel tree structures, or even recursive loops depending on the problem’s demands. 

RL Conductor (source: Sakana AI)

Importantly, the model learns these strategies not by human design but through reinforcement learning (RL) and reward maximization. During training, the model is given a task, a pool of workers, and a reward signal based on whether its answer and output format are correct.

Through a simple trial-and-error RL algorithm, the model organically discovers which combinations of instructions and communication structures yield the highest reward. As a result, it automatically adopts advanced orchestration strategies such as targeted prompt engineering, iterative refinement, and meta-prompt optimization. 

The model learns to dynamically adjust its strategies and leverage the distinct strengths of its worker agents without any human developer having to hard-code the process.

Conductor in action

To test RL Conductor in action, the researchers fine-tuned the 7-billion parameter Qwen2.5-7B using the framework. During training, the Conductor was tasked with designing agentic workflows of up to five steps. It was given access to a worker pool containing seven different models: three closed-source giants (Gemini 2.5 Pro, Claude-Sonnet-4, and GPT-5) and four open-source models (including DeepSeek-R1-Distill-Qwen-32B, Gemma3-27B, and Qwen3-32B).

The team evaluated the Conductor across a variety of highly challenging benchmarks, comparing it against individual frontier models acting alone, self-reflection agents prompted iteratively to improve their own answers, and state-of-the-art multi-agent routing frameworks like MASRouter, Mixture-of-Agents (MoA), RouterDC, and Smoothie. The small 7B Conductor set new benchmarks across the board. It achieved an average score of 77.27% across all tasks, hitting 93.3% on the AIME25 math benchmark, 87.5% on GPQA-Diamond, and 83.93% on LiveCodeBench, according to the researchers.

Remarkably, it achieved these marks while remaining highly efficient. While baseline models like MoA burned through 11,203 tokens per question, the Conductor used an average of just 1,820 tokens, taking an average of only three steps per workflow.

rl-conductor-performance

RL Conductor outperforms other baselines on key industry benchmarks (source: arXiv)

A closer look at the experimental details shows exactly why the framework is so effective. The Conductor automatically learned to measure task difficulty. For simple factual recall questions, it often solved the problem in a single step or used a basic two-agent setup. However, for complex coding problems, it built extensive workflows involving up to four agents with dedicated planning, implementation, and verification phases.

The Conductor also learned that frontier models have different strengths. To achieve record scores on coding benchmarks, the Conductor frequently assigned Gemini 2.5 Pro and Claude Sonnet 4 to act as high-level planners, and only brought in GPT-5 at the very end to write the final optimized code. In a particularly clever display of adaptability, the Conductor would sometimes completely abdicate its own role, handing the entire planning process over to Gemini 2.5 Pro and allowing it to dictate the subtasks for the rest of the pool.

Beyond math and coding benchmarks, Sakana AI is already putting the underlying architecture to work in front-office utility. “We have been using our Fugu models based on the Conductor technology internally for various practical enterprise applications: software development, deep research, strategy development, and even visual tasks like slide generation,” Tang said.

Bringing orchestration to the enterprise: Sakana Fugu

While the 7B model described in the research paper was an exploratory blueprint and is not publicly available, Sakana AI has productized the Conductor framework into its flagship commercial AI product, Sakana Fugu. Now in its beta phase, Fugu serves as a multi-agent orchestration system accessible through a standard OpenAI-compatible API.

Tang noted Fugu targets “the large market of industries where AI adoption has yet to bring large productivity gains due to the generalization limitations of current hard-coded pipelines, such as finance and defense.”

For enterprise developers, this allows seamless integration into existing applications without the headache of managing multiple API keys or manually routing tasks across different vendors. Behind the API interface, Fugu automates complex collaboration topologies and role assignments across a pool of models. To support varying business needs, Sakana released two variants: Fugu Mini, built for low-latency operations, and Fugu Ultra, designed for maximum performance on demanding workloads.

Addressing governance concerns around autonomous agents spinning up invisible workflows, Tang pointed out that the interpretability risks are functionally similar to the hidden reasoning traces of current top-tier closed APIs, and the system is managed with established guardrails to minimize hallucinations. 

For enterprise architects weighing when to deploy RL-orchestration versus traditional routing, the decision often comes down to engineering resources. “We believe the absolute sweet spot comes whenever users and their teams feel they are spending a disproportionate amount of time guiding their underlying agents,” Tang said. However, he cautioned that the framework isn’t necessary for everything, noting that “it’s hard to beat the economic proposition of a local model running directly on the user’s machine for simple queries.”

As the diversity of specialized open- and closed-source AI models continues to grow, static hardcoded pipelines will inevitably become obsolete. Looking ahead, this dynamic orchestration will likely extend beyond text and code environments. “There is indeed a large potential to fill this gap with cross-modal Conductor frameworks becoming the foundation for more autonomous, self-coordinating physical AI systems,” Tang said.


Leave a Reply

Your email address will not be published. Required fields are marked *

Back To Top