Object-centric Video Question Answering with Visual Grounding and Referring

Haochen Wang^1*	Qirui Chen^2*	Cilin Yan³	Jiayin Cai³

Xiaolong Jiang³	Yao Hu³	Weidi Xie^2†	Stratis Gavves¹

¹University of Amsterdam

²SAI, Shanghai Jiao Tong University

³Xiaohongshu Inc.

Code [GitHub]

Paper [arXiv]

Cite [BibTeX]

Video Demo

Recommended watching in fullscreen.

Query 1: [Red circle at Frame 16] Look at the masked region and answer the question: What is it?
Answer 1: Sure, the item is a pillow.

Query 2: Can you segment the place where the cat stands in the video?
Answer 2: Sure, [SEG].

Abstract

Video Large Language Models (VideoLLMs) have recently demonstrated remarkable progress in general video understanding. However, existing models primarily focus on high-level comprehension and are limited to text-only responses, restricting the flexibility for object-centric, multi-round interactions. In this paper, we make three contributions:

We address these limitations by introducing a VideoLLM model, capable of performing both object referring for input and grounding for output in video reasoning tasks, i.e., allowing users to interact with videos using both textual and visual prompts;
We propose STOM (Spatial-Temporal Overlay Module), a novel approach that propagates arbitrary visual prompts input at any single timestamp to the remaining frames within a video;
We present VideoInfer, a manually curated object-centric video instruction dataset featuring question-answering pairs that require reasoning.

We conduct comprehensive experiments on VideoInfer and other existing benchmarks across video question answering and referring object segmentation. The results on 12 benchmarks spanning 6 tasks show that our proposed model consistently outperforms baseline models in both video question answering and segmentation, underscoring its robustness in multimodal, object-centric video and image understanding.

Architecture Overview.

The proposed RGA3 architecture overview. (a) The Spatial-Temporal Overlay Module (STOM) is introduced to process arbitrary visual prompts (e.g., scribble, ellipse, arrow, etc.) at any timestamp and propagate to all frames based on CoTracker3, allowing for interactive object-centric reasoning and continual visual attention. (b) A visual encoder is employed to extract video representations of overlaid frames processed through STOM. (c) A Large Language Model (LLM) takes the concatenated sequence of visual and text tokens as input and generates responses. (d) To facilitate reasoning-based video object segmentation, a SAM2 decoder is incorporated for generating segmentation masks when prompted with a [SEG] token, extending RGA3's capabilities beyond text-only responses.

VideoInfer: A Manually Curated Object-centric VideoQA Dataset

VideoInfer is a manually curated, object-centric video question-answering dataset, designed to challenge models with questions requiring semantic understanding, temporal reasoning, and multi-step inference over video content. Compared to existing object-level video question-answering datasets, which are often generated through automated pipelines, VideoInfer serves as a more rigorous benchmark for evaluating the reasoning capabilities of advanced Video LLMs.

Partial Results

We have conducted extensive comparisons between RGA3 and state-of-the-art methods across a variety of referring QA and object segmentation benchmarks both at image-level and video-level.

Overall, through extensive evaluations across VideoInfer as well as 11 existing benchmarks (the full lists are presented in the paper), we demonstrate the superior performance of RGA3 in both referring object-centric question-answering and segmentation tasks.

Acknowledgements

Based on a template by Phillip Isola and Richard Zhang.