Three New Benchmarks Test AI Models on Spatial Reasoning, Embedded Systems, and Surgical Tool Detection

New Evaluation Frameworks Target Specialized AI Capabilities

Three separate research papers published on arXiv introduce benchmarks for evaluating AI models in distinct technical domains.

TangramPuzzle addresses what researchers describe as a largely unexplored area: the ability of Multimodal Large Language Models (MLLMs) to perform “precise compositional spatial reasoning,” according to arXiv:2601.16520v1. While MLLMs have demonstrated “remarkable progress in visual recognition and semantic understanding,” the paper notes that existing benchmarks have not adequately tested these spatial reasoning capabilities.

EmbedAgent introduces a paradigm “designed to simulate real-world roles in embedded [system development],” according to arXiv:2506.11003v3. The researchers note that while Large Language Models have “shown promise in various tasks,” few benchmarks currently “assess their capabilities in embedded system development.”

A third benchmark evaluates Large Vision-Language Models for surgical tool detection, addressing what researchers identify as a limitation in current AI systems for surgery. According to arXiv:2601.16895v1, “the unimodal nature of most current AI systems limits their ability” to fully support surgical guidance, despite AI emerging “as a transformative force in supporting surgical guidance and decision-making.”

All three papers are cross-category submissions on arXiv’s AI section.