Three arXiv Papers Examine Limitations in Vision-Language Models

Three new papers posted to arXiv highlight different weaknesses in current vision-language AI systems.

According to arXiv:2602.20659v1, current vision-language-action (VLA) models “struggle with long-horizon manipulation under partial observability.” The paper notes that most existing approaches “remain observation-driven, relying on short context windows or repeated queries to vision-language models (VLMs).”

A second paper (arXiv:2602.20878v1) examines causal reasoning capabilities in Large Vision-Language Models (LVLMs). The research finds that while these models “achieve strong performance on visual question answering benchmarks,” they “often rely on spurious correlations rather than genuine causal reasoning.” The paper states that “existing evaluations primarily assess the correctness of the answers” without examining underlying reasoning quality.

The third paper (arXiv:2602.20520v1) investigates how visual artifacts from diffusion-based inpainting affect language generation. According to the abstract, researchers “study how visual artifacts introduced by diffusion-based inpainting affect language generation in vision-language models” using “a two-stage diagnostic setup in which masked image regions are reconstructed and then provided to captioning models.”

All three papers appear as new submissions to arXiv’s AI category (cs.AI), with the inpainting paper listed as a cross-submission.