New Research Advances Multi-Turn Tool-Calling Agents and Reveals Gaps in AI Behavioral Consistency

Researchers have published the first application of Multi-Turn Group Relative Policy Optimization (MT-GRPO) for training tool-calling agents on realistic customer service tasks, according to arxiv.org. The study introduces “Iterative Reward Calibration,” a methodology for designing per-turn rewards, after discovering that naively designed dense rewards degraded performance by up to 14 percentage points.

Applied to the Tau-Bench airline benchmark, the approach improved Qwen3.5-4B from 63.8% to 66.7% and Qwen3-30B-A3B from 58.0% to 69.5%, according to arxiv.org. Notably, the trained 4B model exceeded GPT-4.1 (49.4%) and GPT-4o (42.8%) despite being 50 times smaller, while the 30.5B MoE model approached Claude Sonnet 4.5 (70.0%). The researchers state these are “the first published RL training results on Tau-Bench.”

Separately, arxiv.org reports the introduction of CostBench, a benchmark revealing that leading models struggle with cost-aware planning. Even GPT-5 achieved less than 75% exact match rate on the hardest tasks, with performance dropping around 40% under dynamic conditions.

Another study on arxiv.org examined behavioral consistency across Claude 4.5 Sonnet, GPT-5, and Llama-3.1-70B on SWE-bench. The research found that higher consistency aligned with higher accuracy across models, with Claude achieving the lowest variance (CV: 15.2%) and highest accuracy (58%). However, 71% of Claude’s failures stemmed from “consistent wrong interpretation,” suggesting that “consistency amplifies outcomes rather than guaranteeing correctness,” according to arxiv.org.