DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models

1Bosch Research North America
2Bosch Center for Artificial Intelligence (BCAI)
3Texas A&M University
Demo Image

Figure.1 Visualization of query distributions across decoder layers under visual prompts. We highlight queries associated with negative prompts in red and those aligned with positive prompts in other colors. GRQO shows progressively focused and semantically aligned query behavior, while SFT suffers from more scattered and ambiguous query activation.

Abstract

The recent explosive interest in the reasoning capabilities of large language models, such as DeepSeek-R1, has demonstrated remarkable success through reinforcement learning-based fine-tuning frameworks, exemplified by methods like Group Relative Policy Optimization (GRPO). However, such reasoning abilities remain underexplored and notably absent in vision foundation models, including representation models like the DINO series. In this work, we propose DINO-R1, the first such attempt to incentivize visual in-context reasoning capabilities of vision foundation models using reinforcement learning. Specifically, DINO-R1 introduces Group Relative Query Optimization (GRQO), a novel reinforcement-style training strategy explicitly designed for query-based representation models, which computes query-level rewards based on group-normalized alignment quality. We also apply KL-regularization to stabilize the objectness distribution to reduce the training instability. This joint optimization enables dense and expressive supervision across queries while mitigating overfitting and distributional drift. Building upon Grounding-DINO, we train a series of DINO-R1 family models that integrate a visual prompt encoder and a visual-guided query selection mechanism. Extensive experiments on COCO, LVIS, and ODinW demonstrate that DINO-R1 significantly outperforms supervised fine-tuning baselines, achieving strong generalization in both open-vocabulary and closed-set visual prompting scenarios.

Introduction

Demo Image

Figure.2 SFT vs. GRQO. SFT leads to limited and homogeneous supervision signals, while GRQO produces richer and more diverse learning signals, encouraging queries to be more expressive.

Inspired by the recent breakthroughs in RL-based training frameworks for large reasoning models (LRMs), which efficaciously exploit large-scale noisy training data, we aim to similarly unlock the potential of reasoning capabilities within pure vision models, e.g., VFMs. Yet, na\"ive application of language-based RL methods such as GRPO to vision presents non-trivial challenges. For one thing, GRPO assumes the model behaves as a probabilistic generator, explicitly sampling diverse outputs from learned distributions per input. In contrast, vision models typically produce deterministic structured predictions, making it nontrivial to optimize over a sampled output space. For another, GRPO's KL-regularization that stabilizes training via constraining token-level output distributions in language models can not be easily translated to structured visual predictions due to fundamental differences in language and vision formulation. To this end, we present a novel vision-centric RL learning method called group related query optimization (GRQO) designed to incentivize reasoning capabilities in VFMs, notably DINO families.

Method

Demo Image

Figure.3 SFT vs. GRQO. SFT leads to limited and homogeneous supervision signals, while GRQO produces richer and more diverse learning signals, encouraging queries to be more expressive.

Demo Image

Experiments

Demo Image
Demo Image
Demo Image
Demo Image
Demo Image

BibTeX

@article{pan2025dino,
        title={DINO-R1: Incentivizing Reasoning Capability in Vision Foundation Models},
        author={Pan, Chenbin and He, Wenbin and Tu, Zhengzhong and Ren, Liu},
        journal={arXiv preprint arXiv:2505.24025},
        year={2025}
      }