Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos


Qirui Chen1
Shangzhe Di1
Weidi Xie1,2

1Shanghai Jiao Tong University 2Shanghai AI Laboratory


Code [GitHub]

Paper [arXiv]

Cite [BibTeX]


Abstract

This paper considers the problem of Multi-Hop Video Question Answering (MH-VidQA) in long-form egocentric videos. This task not only requires to answer visual questions, but also to localize multiple relevant time intervals within the video as visual evidences. We develop an automated pipeline to mine multi-hop question-answering pairs with associated temporal evidence, enabling to construct a large-scale dataset for instruction-tuning. To monitor the progress of this new task, we further curate a high-quality benchmark, MultiHop-EgoQA, through meticulous manual verification and refinement. Our experiments reveal that existing multi-modal systems exhibit inadequate multi-hop grounding and reasoning abilities, resulting in unsatisfactory performance. We then propose a novel architecture, termed as GeLM, to leverage the world knowledge reasoning capabilities of multi-modal large language models (LLMs), while incorporating a grounding module to retrieve temporal evidence in the video with flexible grounding tokens. Once trained on our constructed visual instruction data, GeLM demonstrates enhanced multi-hop grounding and reasoning capabilities, establishing a new baseline for this challenging task. Furthermore, when trained on third-view videos, the same architecture also achieves state-of-the-art performance on the existing single-hop VidQA benchmark, ActivityNet-RTL, showing the architecture's effectiveness.



Problem Background


In the existing literature, video question-answering tasks for egocentric videos typically assume relevance to a single time interval, thus limiting their applicability in understanding complex questions that necessitate reasoning across multiple time spans in a video. To bridge the gap, this paper introduces the problem of Multi-Hop Video Question-Answering (MH-VidQA). As illustrated in the figure, this task requires the model to simultaneously answer questions that involve visual information from multiple time intervals and localize these time spans as evidence within long, egocentric videos.




MultiHop-EgoQA: Data Curation Pipeline


Illustration of our data curation pipeline. To collect large-scale Multi-Hop VidQA data, we have developed an automated pipeline. We begin by using action scene graphs to identify potential multi-hop reasoning questions based on the syntax trees of annotated narrations. Next, we use GPT-4o to generate data samples that include questions, answers, and relevant time spans. Finally, we perform manual validation and refinement to create the new benchmark.




GeLM: A Baseline Method for MH-VidQA

Existing models for video question answering typically provide answers without supporting temporal evidence, or are restricted to identifying a single time interval. Here, we propose a novel architecture, termed as GeLM: Grounding Scattered Evidence with Large Language Model for Multi-Hop Video Question-Answering. As depicted in the figure, our model primarily comprises a multi-modal large language model and a grounding module, with special grounding tokens indicating the time span of the enclosed key information in the response. To ground the time spans that support the answer, we design an evidence grounding module that processes a variable number of grounding queries and predicts the corresponding temporal proposals in the video.




Results

We evaluate several latest multi-modal models on MultiHop-EgoQA, exploring their abilities of multi-hop reasoning and temporal grounding.

From experiments presented in the following table, we can draw the following observations:

  1. Both the proprietary model and open-source multi-modal LLMs significantly lag behind human performance, underscoring the current limitations in multi-hop reasoning and grounding capabilities within multi-modal systems.
  2. Reasoning and grounding abilities are disentangled in existing visual systems. For instance, LLaVA-NeXT-Video is unable to handle requests involving temporal grounding, but can still answer part of questions that do not involve temporal grounding.
  3. Instruction-tuning with single-hop data does not guarantee superiority in multi-hop grounding. For example, despite TimeChat and VTimeLLM having been fine-tuned with temporally aware instructions and multi-turn conversations, the ability to ground multiple intervals for a single query remains limited.
  4. Dense captions do indeed help temporal grounding, but errors may cascade. Although captioning at per second provides explicit temporal information for grounding, errors in the captioning process are difficult to correct through the subsequent stages.

Overall, we have established a new baseline method for this benchmark, outperforming current systems and explicitly supporting scattered evidence grounding.





Acknowledgements

Based on a template by Phillip Isola and Richard Zhang.