PinpointQA Dataset Introduced to Test AI Models on Small Object Localization in Indoor Videos

Researchers release PinpointQA, a benchmark with 10,094 QA pairs testing multimodal AI models on finding and describing small objects in indoor videos.

According to arxiv.org, researchers have introduced PinpointQA, described as “the first dataset and benchmark for small object-centric spatial understanding in indoor videos.” The dataset was published on April 13, 2026.

PinpointQA addresses what the researchers identify as “a significant challenge for multimodal large language models (MLLMs)” in localizing target objects in video with sufficient precision for practical applications like object search and assistive technologies. According to the abstract, “no existing benchmark directly evaluates whether a model can localize a target object in video and express its position with sufficient precision for downstream use.”

The dataset comprises 1,024 scenes and 10,094 question-answer pairs built from ScanNet++ and ScanNet200, according to arxiv.org. It organizes evaluations into four progressively challenging tasks: Target Presence Verification (TPV), Nearest Reference Identification (NRI), Fine-Grained Spatial Description (FSD), and Structured Spatial Prediction (SSP).

According to the researchers, experiments on representative MLLMs revealed “a consistent capability gap along the progressive chain, with SSP remaining particularly difficult.” However, supervised fine-tuning on PinpointQA yielded “substantial gains, especially on the harder tasks,” demonstrating that the dataset serves as both a diagnostic benchmark and an effective training resource. The dataset and project page are available at rainchowz.github.io/PinpointQA.