Abstract:
Recently, unimodal models have attained good performance in many tasks. However, using one modality may not provide sufficient information in complex situations. Humans use multimodal input, such as vision and hearing, to act in the real world. Similarly, this thesis proposes systems that use multimodal input for video classification and visual-language learning. However, multimodal models need significant amounts of qualitative paired data, which is costly and time-consuming to gather. At the same time, humans require very few training samples, even for the most complex tasks. Given these aspects, this thesis addresses the problem of multimodal data efficient learning.
Firstly, this thesis studies the audio-visual video classification task in generalized zero- and few-shot learning settings. It introduces new training and evaluation protocols, dataset splits, and baselines. Using transformers to fuse the audio and visual modalities leads to higher performance than prior works. Furthermore, typical full-attention does not lead to the best results, and new attention patterns are developed. New loss functions are essential for increasing the performance of both settings. Moreover, the performance in few-shot learning is further improved by using a diffusion model to generate synthetic audio-visual features for the novel classes.
The second task is video-adverb retrieval, which is studied both when plenty of training data is available and in the zero-shot learning scenario. The goal is to improve the text embeddings using a residual gating mechanism and a new training objective. New zero-shot splits are also introduced to facilitate a more comprehensive evaluation.
Finally, this thesis uses multimodal large language models (MLLMs) to focus on visual-language learning. This task studies the ability of MLLMs to adapt the communication on the fly given a conversation partner by using very few interactions. This work provides a general framework for testing this ability for multiple agents, providing insights about their strengths and weaknesses. It turns out that the ability to adapt the communication to different partners with different comprehension abilities is already present in the current MLLMs.