SPARROW Model Advances Pixel-Level Video Understanding with Spatial and Temporal Tracking

According to arxiv.org, researchers have introduced SPARROW, a pixel-grounded video multimodal large language model (MLLM) designed to improve spatial accuracy and temporal stability in video understanding. The model has been accepted at CVPR 2026.

According to the paper, existing video MLLMs often struggle with spatial drift, identity switches, and unstable initialization when tracking objects across frames. SPARROW addresses these limitations through two key components: Target-Specific Tracked Features (TSF), which provide temporally aligned reference cues during training, and a dual-prompt design that decodes both box ([BOX]) and segmentation ([SEG]) tokens to combine geometric priors with semantic grounding.

The research is supported by a curated dataset containing 30,646 videos and 45,231 question-and-answer pairs, according to arxiv.org. The model operates end-to-end without external detectors, using a class-agnostic SAM2-based proposer.

According to the paper, when integrated into three open-source video MLLMs (UniPixel, GLUS, and VideoGLaMM), SPARROW demonstrated improvements across six benchmarks, including gains of up to +8.9 J&F on RVOS, +5 mIoU on visual grounding, and +5.4 CLAIR on GCG. The researchers state these results demonstrate substantial improvements in referential stability, spatial precision, and temporal coherence.