Abstract:
In this thesis, our primary objective is to reduce the computational footprint of state-of-the-art stereo deep neural network (DNN) methods while maintaining their performance. Classical stereo methods used in computer vision applications, such as robotics and autonomous driving, often require complex tuning and struggle to perform well in real-world scenarios. On the other hand, recent end-to-end DNN methods have shown superior performance but come with high computational requirements, making them unsuitable for real-time applications.
To achieve our objective, we pursue two complementary paths. Firstly, we optimize the individual components of state-of-the-art deep neural networks through a detailed empirical evaluation. This evaluation helps us identify the bottlenecks present in state-of-the-art stereo methods. Our findings reveal that the computational load primarily stems from the use of three-dimensional (3D) convolutions in performance-oriented end-to-end stereo methods. Taking inspiration from the success of MobileNet blocks used for two-dimensional (2D) convolutions, we propose a set of separable convolutions in the 3D space. We thoroughly investigate the impact of making convolutions separable in different dimensions and demonstrate significant reductions in computational load without sacrificing performance. In fact, we observe performance improvements. Building on these conclusions, we design a family of networks based on 2D and 3D separable convolutions.
Furthermore, we explore the design of a leaner backbone for real-time stereo networks. We introduce a two-branch-based architecture that explicitly captures pixel-level and semantic-level information from the input images. This design choice results in a lean backbone that reduces computational load, albeit with a slight performance loss. To recover the lost performance, we propose to use learned attention weights based on cost volume combined with LogL1 loss for stereo matching.
In addition to optimizing individual components and modules, we investigate the application of knowledge distillation for designing leaner and faster stereo networks. Leveraging insights from stereo methods and general knowledge distillation techniques, we introduce a novel knowledge distillation pipeline. Through a systematic study of various design choices, we develop a leaner and faster stereo network with competitive performance. We emphasize the importance of carefully selecting distillation points and loss functions in distilling stereo networks, as they have a significant impact on performance. The trained student networks not only rival performance-oriented methods but also gives comparable results to speed-oriented stereo methods.
Overall, our thesis contributes to the development of computationally efficient and high-performing stereo vision systems. By addressing the computational challenges of state-of-the-art stereo methods and leveraging knowledge distillation techniques, we facilitate the adoption of these methods for real-world systems and applications. We firmly believe that the findings and methodologies presented in this thesis advance the field of stereo vision and pave the way for more practical and effective depth estimation solutions.