Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries

Introduction

While Vision-Language Models (VLMs) post-trained with Reinforcement Learning (RL) show impressive general reasoning, their evaluation is often confined to language-dominant tasks (e.g., math). This raises a critical question: can RL post-training truly extend the inherent capability boundary of a base VLM, particularly for visual-centric spatial tasks where it initially fails? To investigate this, we introduce Ariadne, a framework utilizing synthetic mazes for multi-step spatial reasoning where task difficulty (e.g., path length, turns) is precisely controlled. We leverage this controllable environment to train VLMs using Reinforcement Learning with Verified Rewards (RLVR) in a difficulty-aware curriculum. Surprisingly, post-RLVR training, the VLM achieves over 50% accuracy on a problem set where the base model scored 0%, demonstrating that our approach expands the model's initial capability boundary. To assess real-world viability, we evaluate out-of-distribution (OOD) generalization on practical benchmarks. Despite training only on synthetic maze samples, Ariadne achieves significant zero-shot improvements, averaging 16% on MapBench (e.g., museum navigation) and 24% on ReasonMap (subway transfer tasks). These results confirm that our method not only broadens the model's fundamental limits but also enhances its generalization to real-world spatial reasoning. We acknowledge our study is limited to the post-training phase, given the opaqueness of pre-training data, and hope our research motivates further work on specialized, capability-extending alignment.

RLVR Enables VLMs to Break Reasoning Boundaries

The reward trajectory during GRPO training on AlphaMaze, along with quantitative evaluations on the test set. Across all evaluated movement step ranges, Ariadne shows a consistent advantage over its base model, Qwen2.5-VL-7B-Instruct. For Qwen2.5-VL-7B-Instruct, we perform eight rollouts to confirm model stability and observe that performance drops to zero when the path requires three movement steps or three turns, which we define as the model’s reasoning boundary. After RLVR training, Ariadne’s success rate rises to over 50% on 3 step cases and over 10% on 3 turn cases, and the collapse point shifts from 3 to 5, indicating that RLVR effectively extends the model’s reasoning boundary to more complex path configurations.

RLVR Extends VLM Reasoning to Real-world Tasks

Ariadne achieves strong performance on MapBench and ReasonMap. The model generalizes well across diverse spatial layouts, from unstructured outdoor networks to structured indoor grids, and demonstrates clear gains in long-horizon, multi-turn reasoning. These improvements are most apparent under high path complexity and extended reasoning chains, where instruction-tuned baselines degrade notably. Together, the results suggest that GRPO training on synthetic maze data effectively enhances spatial-visual reasoning, improving the model’s capability to handle complex real-world tasks.

Explore More

Explore more about the details of our approach, analysis, and insights within our paper!

Reference

If you find our work useful, please give us a free cite:


			@article{ariadne,
			      title={Ariadne: A Controllable Framework for Probing and Extending VLM Reasoning Boundaries},
			      author = {Minghe Shen and Zhuo Zhi and Chonghan Liu and Shuo Xing and Zhengzhong Tu and Che Liu},
			      journal={arXiv preprint arXiv: 2511.00710},
			      year={2025}
			}