Towards Better Video Understanding through Language Guidance

DSpace Repositorium (Manakin basiert)


Dateien:

Zitierfähiger Link (URI): http://hdl.handle.net/10900/172025
http://nbn-resolving.org/urn:nbn:de:bsz:21-dspace-1720251
Dokumentart: Dissertation
Erscheinungsdatum: 2025-11-10
Sprache: Englisch
Fakultät: 7 Mathematisch-Naturwissenschaftliche Fakultät
Fachbereich: Informatik
Gutachter: Akata, Zeynep (Prof. Dr.)
Tag der mündl. Prüfung: 2025-06-23
DDC-Klassifikation: 004 - Informatik
Schlagworte: Deep Learning , Maschinelles Lernen , Maschinelles Sehen
Freie Schlagwörter:
Video Understanding
Video Retrieval
Multi-Modal Learning
Lizenz: http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=de http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=en
Zur Langanzeige

Abstract:

Video understanding is a crucial area of computer vision, with applications ranging from autonomous driving and robotics to multimedia interaction. Despite significant progress in image analysis, video understanding remains a complex problem due to the temporal nature of videos, requiring models to analyse both individual frames and their relationships over time. This work explores how various forms of language, ranging from class labels to natural language instructions, can be leveraged to overcome these challenges and to improve model capabilities and generalisation. Moreover, through novel settings, benchmarks, and frameworks, this work explores how integrating language with visual information can address key challenges in video understanding. First, we explore audio-visual video classification in low-data regimes to address the limitations of traditional supervised learning. In audio-visual generalised zero-shot learning, class labels represented as pre-trained word embeddings serve as a semantic bridge, enabling models to classify unseen video classes by aligning audio-visual features with textual representations in a shared embedding space. Our Temporal and Cross-Attention Framework (TCaF) improves the alignment and, consequently, generalisation by better modelling temporal relationships and cross-modal interactions. Next, this setting is extended to audio-visual generalised few-shot learning, where models must learn to classify new video classes with only a few labelled examples. In addition to protocols and benchmarks, we propose AV-Diff, which uses class-label text representations to guide a diffusion model to generate synthetic training samples, thereby enhancing model generalisation for novel video classes. Beyond classification, this thesis explores fine-grained action understanding through video-adverb retrieval. This task extends traditional action recognition by incorporating adverbs to provide richer information about actions. By learning compositional embeddings that combine actions and adverbs, our proposed model achieves a more nuanced understanding of video content. Finally, this thesis tackles composed video retrieval (CVR), a task where natural language instructions modify a reference video query to retrieve semantically altered videos. To successfully solve this task, a model requires compositional reasoning to interpret both video content and the transformative effect of textual instructions. We propose the egocentric evaluation benchmark EgoCVR, which tests the fine-grained temporal video understanding capabilities of vision-language models Furthermore, we present TFR-CVR, a modular and training-free framework that achieves improved temporal reasoning by strategically utilising the reasoning abilities of large language models. By integrating language at different levels – from class labels to fine-grained action modifications and natural language instructions – the work presented pushes beyond traditional video classification towards more robust, flexible, and fine-grained video understanding capabilities.

Das Dokument erscheint in: