New Research Addresses Safety and Spatial Understanding in Vision-Language Models

Multiple research teams have published work addressing critical limitations in vision-language models (VLMs), with papers appearing on arXiv on March 20, 2026.

According to arxiv.org, researchers introduced VLM-AutoDrive, a framework for adapting pretrained VLMs to detect safety-critical driving events. The paper states that off-the-shelf VLMs like NVIDIA’s Cosmos-Reason1 7B “exhibit near-zero Collision recall in zero-shot settings,” but fine-tuning with VLM-AutoDrive improved “Collision F1 from 0.00 to 0.69 and overall accuracy from 35.35% to 77.27%” when evaluated on real-world Nexar dashcam videos.

In separate work, researchers presented Perceptio, which addresses VLMs’ struggles with “fine grained spatial grounding,” according to arxiv.org. The system integrates semantic segmentation and depth tokens directly into autoregressive sequences, achieving state-of-the-art performance with improvements of “+0.8/+1.4/+1.1 cIoU on RefCOCO/+/g” and “HardBLINK spatial understanding accuracy by 10.3%.”

Another team introduced SAVeS, a benchmark examining how semantic cues affect VLM safety decisions. According to arxiv.org, experiments showed “safety decisions are highly sensitive to semantic cues, indicating reliance on learned visual-linguistic associations rather than grounded visual understanding,” potentially exposing vulnerabilities in multimodal safety systems.

Additionally, arxiv.org published work on Mi:dm K 2.5 Pro, a 32B parameter language model focused on Korean-language enterprise applications, though this appears distinct from the vision-focused research.