< Terug naar vorige pagina

Publicatie

Multi-Task Learning for Visual Scene Understanding

Boek - Dissertatie

Despite the recent progress in deep learning, most approaches still go for a silo-like solution, training a separate neural network for each individual task. Yet, many real-world problems require to solve a multitude of tasks concurrently. For example, an autonomous vehicle should be able to detect all objects in the scene, localize them, estimate their distance and trajectory, etc., in order to safely navigate itself in its surroundings. Similarly, an image recognition system for commerce applications should be able to tag products, retrieve similar items, come up with personalized recommendations, etc., to provide customers with the best possible service. These types of problems have motivated researchers to build multi-task learning models that can jointly tackle multiple tasks. The key idea of multi-task learning is to learn multiple tasks in parallel while sharing the learned representations. Multi-task networks bring several practical advantages compared to the single-task case, where each individual task is solved separately by its own network. First, due to the sharing of layers, the resulting memory footprint is substantially reduced. Second, as they avoid to repeatedly calculate features in the shared layers, once for every task, they show increased inference speeds. Third, they have the potential to improve the performance if the associated tasks share complementary information, or act as a regularizer for one another. We are faced with two important challenges when building multi-task learning models. First, we need to come up with neural network architectures that can handle multiple tasks. Second, we need to develop novel training recipes for jointly learning multiple tasks. In particular, since we are optimizing for multiple objectives in parallel, one or more tasks could start to dominate the weight updates, hindering the model from learning the other tasks. In this manuscript, we delve into both problems within the context of visual scene understanding. We propose two new types of models to address the architectural problem. First, we explore branched multi-task networks, where the deeper layers of the neural network gradually grow more task-specific. We introduce a principled method to automate the construction of such branched multi-task networks. The construction process groups together the tasks that can be solved with a similar set of features, while making a trade-off between task similarity and network complexity. In this way, our method produces models that yield a better trade-off between the performance and the amount of computational resources. Second, we propose a novel neural network architecture for jointly tackling multiple dense prediction tasks. The key idea is to improve the predictions for each task by distilling useful information from the predictions of other tasks at \emph{multiple scales}. The motivation for including multiple scales is based upon the observation that tasks with high affinity at a certain scale are not guaranteed to retain this behavior at other scales, and vice versa. Extensive experiments across two popular benchmarks for dense labeling show that, unlike prior work, our model delivers on the full potential of multi-task learning, that is, smaller memory footprint, reduced number of calculations, and better performance w.r.t. single-task learning. Further, we also consider the multi-task learning optimization problem. We first analyze several existing techniques for balancing the learning of tasks. Surprisingly, we uncover several serious discrepancies between works. We hypothesize that this can be attributed to the lack of standardized benchmarks for multi-task learning, and different benchmarks benefit from a specific strategy. Based upon this result, we then isolate the elements that are most promising and propose a set of heuristics to balance the tasks. The heuristics are of more practical nature compared to existing works. Further, applying the heuristics results in more robust performance across different benchmarks. In the final chapter, we consider the problem of scene understanding from an alternative point-of-view. Many of the models described in the literature benefit from supervised pretraining. In this case, the model is first pretrained on a larger annotated dataset like ImageNet, before being transferred to the tasks of interest. This allows the model to perform well, even on datasets with only a modest amount of labeled examples. Unfortunately, supervised pretraining relies on an annotated dataset itself, which limits it's applicability. To address this problem, researchers started to explore self-supervised learning methods. We choose to revisit recent popular works based upon contrastive learning. First, we show that an existing method like MoCo obtains robust results across different datasets, including scene-centric, long-tailed and domain-specific data. Second, we improve the learned representations by imposing additional invariances. This result directly benefits many downstream tasks like semantic segmentation, detection, etc. Finally, we show that the improvements obtained via self-supervised learning translate to multi-task learning networks too. In summary, this manuscript presents several important contributions for improving multi-task learning models for visual scene understanding. The innovations focus on improving the neural network architecture, the optimization process and the pretraining aspect. All methods are thoroughly tested across a variety of benchmarks. The code is made publicly available: https://github.com/SimonVandenhende.
Jaar van publicatie:2022
Toegankelijkheid:Open