Multimodal Data Efficient Learning

Mercea, Otniel-Bogdan

Publikationsdienste
→
TOBIAS-lib - Publikationen und Dissertationen
→
7 Mathematisch-Naturwissenschaftliche Fakultät
→
Dokumentanzeige

« zurück

Multimodal Data Efficient Learning

Mercea, Otniel-Bogdan

Dateien:	otniel_bogdan_mercea_thesis.pdf 15.8 MB PDF Beschreibung: Dissertation PDF

Zitierfähiger Link (URI):	http://hdl.handle.net/10900/169090 http://nbn-resolving.org/urn:nbn:de:bsz:21-dspace-1690907 http://dx.doi.org/10.15496/publikation-110417
Dokumentart:	Dissertation
Erscheinungsdatum:	2025-08-18
Sprache:	Englisch
Fakultät:	7 Mathematisch-Naturwissenschaftliche Fakultät
Fachbereich:	Informatik
Gutachter:	Akata, Zeynep (Prof. Dr.)
Tag der mündl. Prüfung:	2025-04-16
DDC-Klassifikation:	004 - Informatik
Freie Schlagwörter:	Deep learning Artificial intelligence Computer vision Efficient learning Multimodal learning
Lizenz:	http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=de http://tobias-lib.uni-tuebingen.de/doku/lic_ohne_pod.php?la=en
Zur Langanzeige

Abstract:

Recently, unimodal models have attained good performance in many tasks. However, using one modality may not provide sufficient information in complex situations. Humans use multimodal input, such as vision and hearing, to act in the real world. Similarly, this thesis proposes systems that use multimodal input for video classification and visual-language learning. However, multimodal models need significant amounts of qualitative paired data, which is costly and time-consuming to gather. At the same time, humans require very few training samples, even for the most complex tasks. Given these aspects, this thesis addresses the problem of multimodal data efficient learning. Firstly, this thesis studies the audio-visual video classification task in generalized zero- and few-shot learning settings. It introduces new training and evaluation protocols, dataset splits, and baselines. Using transformers to fuse the audio and visual modalities leads to higher performance than prior works. Furthermore, typical full-attention does not lead to the best results, and new attention patterns are developed. New loss functions are essential for increasing the performance of both settings. Moreover, the performance in few-shot learning is further improved by using a diffusion model to generate synthetic audio-visual features for the novel classes. The second task is video-adverb retrieval, which is studied both when plenty of training data is available and in the zero-shot learning scenario. The goal is to improve the text embeddings using a residual gating mechanism and a new training objective. New zero-shot splits are also introduced to facilitate a more comprehensive evaluation. Finally, this thesis uses multimodal large language models (MLLMs) to focus on visual-language learning. This task studies the ability of MLLMs to adapt the communication on the fly given a conversation partner by using very few interactions. This work provides a general framework for testing this ability for multiple agents, providing insights about their strengths and weaknesses. It turns out that the ability to adapt the communication to different partners with different comprehension abilities is already present in the current MLLMs.

Das Dokument erscheint in:

7 Mathematisch-Naturwissenschaftliche Fakultät [5131]

Veröffentlichen

Stöbern

Gesamter Bestand
Diese Sammlung

Mein Benutzerkonto

Einloggen

Multimodal Data Efficient Learning

DSpace Repositorium (Manakin basiert)

Multimodal Data Efficient Learning

Abstract:

Das Dokument erscheint in:

Stöbern

Gesamter Bestand

Diese Sammlung

Mein Benutzerkonto