New Benchmark Reveals Frontier AI Models Struggle with Multi-Level Instruction Conflicts

According to a new paper on arxiv.org, current frontier AI models perform poorly when handling instruction conflicts across multiple privilege levels in agent settings, achieving approximately 40% accuracy in complex scenarios.

The research introduces Many-Tier Instruction Hierarchy (ManyIH), a paradigm for resolving instruction conflicts among arbitrarily many privilege levels. According to the paper, large language model agents receive instructions from multiple sources—including system messages, user prompts, and tool outputs—each carrying different levels of trust and authority. The dominant instruction hierarchy (IH) paradigm assumes fewer than five privilege levels defined by rigid role labels, which the researchers argue is “inadequate for real-world agentic settings.”

To evaluate this challenge, the researchers created ManyIH-Bench, described as “the first benchmark for ManyIH.” According to arxiv.org, the benchmark requires models to navigate up to 12 levels of conflicting instructions across 853 agentic tasks, comprising 427 coding and 426 instruction-following tasks. The benchmark uses constraints developed by LLMs and verified by humans to create test cases spanning 46 real-world agents.

The experiments showed that even current frontier models struggle when instruction conflicts scale. The work, published April 13, 2026, emphasizes “the urgent need for methods that explicitly target fine-grained, scalable instruction conflict resolution in agentic settings,” according to the abstract.