Towards Better Video Understanding through Language Guidance

Hummel, Thomas

Publikationsdienste
→
TOBIAS-lib - Publikationen und Dissertationen
→
7 Mathematisch-Naturwissenschaftliche Fakultät
→
Dokumentanzeige

dc.contributor.advisor	Akata, Zeynep (Prof. Dr.)
dc.contributor.author	Hummel, Thomas
dc.date.accessioned	2025-11-10T14:26:24Z
dc.date.available	2025-11-10T14:26:24Z
dc.date.issued	2025-11-10
dc.identifier.uri	http://hdl.handle.net/10900/172025
dc.identifier.uri	http://nbn-resolving.org/urn:nbn:de:bsz:21-dspace-1720251	de_DE
dc.description.abstract	Video understanding is a crucial area of computer vision, with applications ranging from autonomous driving and robotics to multimedia interaction. Despite significant progress in image analysis, video understanding remains a complex problem due to the temporal nature of videos, requiring models to analyse both individual frames and their relationships over time. This work explores how various forms of language, ranging from class labels to natural language instructions, can be leveraged to overcome these challenges and to improve model capabilities and generalisation. Moreover, through novel settings, benchmarks, and frameworks, this work explores how integrating language with visual information can address key challenges in video understanding. First, we explore audio-visual video classification in low-data regimes to address the limitations of traditional supervised learning. In audio-visual generalised zero-shot learning, class labels represented as pre-trained word embeddings serve as a semantic bridge, enabling models to classify unseen video classes by aligning audio-visual features with textual representations in a shared embedding space. Our Temporal and Cross-Attention Framework (TCaF) improves the alignment and, consequently, generalisation by better modelling temporal relationships and cross-modal interactions. Next, this setting is extended to audio-visual generalised few-shot learning, where models must learn to classify new video classes with only a few labelled examples. In addition to protocols and benchmarks, we propose AV-Diff, which uses class-label text representations to guide a diffusion model to generate synthetic training samples, thereby enhancing model generalisation for novel video classes. Beyond classification, this thesis explores fine-grained action understanding through video-adverb retrieval. This task extends traditional action recognition by incorporating adverbs to provide richer information about actions. By learning compositional embeddings that combine actions and adverbs, our proposed model achieves a more nuanced understanding of video content. Finally, this thesis tackles composed video retrieval (CVR), a task where natural language instructions modify a reference video query to retrieve semantically altered videos. To successfully solve this task, a model requires compositional reasoning to interpret both video content and the transformative effect of textual instructions. We propose the egocentric evaluation benchmark EgoCVR, which tests the fine-grained temporal video understanding capabilities of vision-language models Furthermore, we present TFR-CVR, a modular and training-free framework that achieves improved temporal reasoning by strategically utilising the reasoning abilities of large language models. By integrating language at different levels – from class labels to fine-grained action modifications and natural language instructions – the work presented pushes beyond traditional video classification towards more robust, flexible, and fine-grained video understanding capabilities.	en
dc.language.iso	en	de_DE
dc.publisher	Universität Tübingen	de_DE
dc.rights	ubt-podno	de_DE
dc.rights.uri	http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=de	de_DE
dc.rights.uri	http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=en	en
dc.subject.classification	Deep Learning , Maschinelles Lernen , Maschinelles Sehen	de_DE
dc.subject.ddc	004	de_DE
dc.subject.other	Video Understanding	en
dc.subject.other	Video Retrieval	en
dc.subject.other	Multi-Modal Learning	en
dc.title	Towards Better Video Understanding through Language Guidance	en
dc.type	PhDThesis	de_DE
dcterms.dateAccepted	2025-06-23
utue.publikation.fachbereich	Informatik	de_DE
utue.publikation.fakultaet	7 Mathematisch-Naturwissenschaftliche Fakultät	de_DE
utue.publikation.noppn	yes	de_DE

Dateien:	Dissertation-Hummel-Thomas.pdf 20.4 MB PDF Beschreibung: Dissertation PDF

Das Dokument erscheint in:

7 Mathematisch-Naturwissenschaftliche Fakultät [5102]

Zur Kurzanzeige

Veröffentlichen

Stöbern

Gesamter Bestand
Diese Sammlung

Mein Benutzerkonto

Einloggen

Towards Better Video Understanding through Language Guidance

DSpace Repositorium (Manakin basiert)

Das Dokument erscheint in:

Stöbern

Gesamter Bestand

Diese Sammlung

Mein Benutzerkonto