Abstract:
The human visual system performs the impressive task of converting light arriving at the retina into a useful representation that allows us to make sense of the visual environment. We can navigate easily in the three-dimensional world and recognize objects and their properties, even if they appear from different angles and under different lighting conditions. Artificial systems can also perform well on a variety of complex visual tasks. While they may not be as robust and versatile as their biological counterpart, they have surprising capabilities that are rapidly improving. Studying the two types of systems can help us understand what computations enable the transformation of low-level sensory data into an abstract representation. To this end, this dissertation follows three different pathways.
First, we analyze aspects of human perception. The focus is on the perception in the peripheral visual field and the relation to texture perception. Our work builds on a texture model that is based on the features of a deep neural network. We start by expanding the model to the temporal domain to capture dynamic textures such as flames or water. Next, we use psychophysical methods to investigate quantitatively whether humans can distinguish natural textures from samples that were generated by a texture model. Finally, we study images that cover the entire visual field and test whether matching the local summary statistics can produce metameric images independent of the image content.
Second, we compare the visual perception of humans and machines. We conduct three case studies that focus on the capabilities of artificial neural networks and the potential occurrence of biological phenomena in machine vision. We find that comparative studies are not always straightforward and propose a checklist on how to improve the robustness of the conclusions that we draw from such studies.
Third, we address a fundamental discrepancy between human and machine vision. One major strength of biological vision is its robustness to changes in the appearance of image content. For example, for unusual scenarios, such as a cow on a beach, the recognition performance of humans remains high. This ability is lacking in many artificial systems. We discuss on a conceptual level how to robustly disentangle attributes that are correlated during training, and test this on a number of datasets.