Towards Better Video Understanding through Language Guidance

DSpace Repositorium (Manakin basiert)

Zur Kurzanzeige

dc.contributor.advisor Akata, Zeynep (Prof. Dr.)
dc.contributor.author Hummel, Thomas
dc.date.accessioned 2025-11-10T14:26:24Z
dc.date.available 2025-11-10T14:26:24Z
dc.date.issued 2025-11-10
dc.identifier.uri http://hdl.handle.net/10900/172025
dc.identifier.uri http://nbn-resolving.org/urn:nbn:de:bsz:21-dspace-1720251 de_DE
dc.description.abstract Video understanding is a crucial area of computer vision, with applications ranging from autonomous driving and robotics to multimedia interaction. Despite significant progress in image analysis, video understanding remains a complex problem due to the temporal nature of videos, requiring models to analyse both individual frames and their relationships over time. This work explores how various forms of language, ranging from class labels to natural language instructions, can be leveraged to overcome these challenges and to improve model capabilities and generalisation. Moreover, through novel settings, benchmarks, and frameworks, this work explores how integrating language with visual information can address key challenges in video understanding. First, we explore audio-visual video classification in low-data regimes to address the limitations of traditional supervised learning. In audio-visual generalised zero-shot learning, class labels represented as pre-trained word embeddings serve as a semantic bridge, enabling models to classify unseen video classes by aligning audio-visual features with textual representations in a shared embedding space. Our Temporal and Cross-Attention Framework (TCaF) improves the alignment and, consequently, generalisation by better modelling temporal relationships and cross-modal interactions. Next, this setting is extended to audio-visual generalised few-shot learning, where models must learn to classify new video classes with only a few labelled examples. In addition to protocols and benchmarks, we propose AV-Diff, which uses class-label text representations to guide a diffusion model to generate synthetic training samples, thereby enhancing model generalisation for novel video classes. Beyond classification, this thesis explores fine-grained action understanding through video-adverb retrieval. This task extends traditional action recognition by incorporating adverbs to provide richer information about actions. By learning compositional embeddings that combine actions and adverbs, our proposed model achieves a more nuanced understanding of video content. Finally, this thesis tackles composed video retrieval (CVR), a task where natural language instructions modify a reference video query to retrieve semantically altered videos. To successfully solve this task, a model requires compositional reasoning to interpret both video content and the transformative effect of textual instructions. We propose the egocentric evaluation benchmark EgoCVR, which tests the fine-grained temporal video understanding capabilities of vision-language models Furthermore, we present TFR-CVR, a modular and training-free framework that achieves improved temporal reasoning by strategically utilising the reasoning abilities of large language models. By integrating language at different levels – from class labels to fine-grained action modifications and natural language instructions – the work presented pushes beyond traditional video classification towards more robust, flexible, and fine-grained video understanding capabilities. en
dc.language.iso en de_DE
dc.publisher Universität Tübingen de_DE
dc.rights ubt-podno de_DE
dc.rights.uri http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=de de_DE
dc.rights.uri http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=en en
dc.subject.classification Deep Learning , Maschinelles Lernen , Maschinelles Sehen de_DE
dc.subject.ddc 004 de_DE
dc.subject.other Video Understanding en
dc.subject.other Video Retrieval en
dc.subject.other Multi-Modal Learning en
dc.title Towards Better Video Understanding through Language Guidance en
dc.type PhDThesis de_DE
dcterms.dateAccepted 2025-06-23
utue.publikation.fachbereich Informatik de_DE
utue.publikation.fakultaet 7 Mathematisch-Naturwissenschaftliche Fakultät de_DE
utue.publikation.noppn yes de_DE

Dateien:

Das Dokument erscheint in:

Zur Kurzanzeige