File size: 513 Bytes
76de984
 
 
87f9654
 
595112c
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
---
datasets:
- SurplusDeficit/MultiHop-EgoQA
---

# Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos


## GeLM Model

We propose a novel architecture, termed as <b><u>GeLM</u></b> for *MH-VidQA*, to leverage the world knowledge reasoning capabilities of multi-modal large language models (LLMs), while incorporating a grounding module to retrieve temporal evidence in the video with flexible grounding tokens.

<div align="center">
   <img src="./assets/architecture_v3.jpeg" style="width: 80%;">
</div>