Three new research papers published on arXiv address critical challenges in large language model (LLM) agent performance and reliability.
GeoAgentBench for Spatial Analysis
According to arxiv.org, researchers introduced GeoAgentBench (GABench), a dynamic evaluation benchmark for tool-augmented Geographic Information Systems (GIS) agents. The benchmark provides “a realistic execution sandbox integrating 117 atomic GIS tools, encompassing 53 typical spatial analysis tasks across 6 core GIS domains,” the paper states. The researchers also developed a “Plan-and-React” agent architecture that “mimics expert cognitive workflows by decoupling global orchestration from step-wise reactive execution.” Experiments with seven representative LLMs demonstrated that this paradigm “significantly outperforms traditional frameworks,” according to the paper.
Cognitive Companion for Agent Monitoring
A separate arxiv.org paper introduced the Cognitive Companion, a parallel monitoring architecture designed to address reasoning degradation in LLM agents. According to the research, agents on multi-step tasks experience “reasoning degradation, looping, drift, stuck states, at rates up to 30% on hard tasks.” The LLM-based Companion “reduced repetition on loop-prone tasks by 52-62% with approximately 11% overhead” in experiments centered on Gemma 4 E4B, while a zero-overhead Probe-based Companion showed “a mean effect size of +0.471.”
Dynamic Memory Framework
Arxiv.org also published research on ReMe (Remember Me, Refine Me), a framework for agent memory systems. The paper reports that “Qwen3-8B equipped with ReMe outperforms larger, memoryless Qwen3-14B,” suggesting memory systems provide “a computation-efficient pathway for lifelong learning.”