< Back to previous page

Project

Overcoming Complexity of Visual Perception by Efficient Pose Estimation and Self-Supervised Representation Learning


Machine vision is a tremendously complex task considering the extent of semantics an image can convey and the structure of the content in an image. It is infeasible to instruct a machine to understand visual sensory data by defining a set of rules and heuristics. However, thanks to deep neural networks, data-centric approaches have led to many breakthroughs that make machine vision possible in a wide range of real-world environments. Nevertheless, current top-performing visual perception systems are not efficient in speed-accuracy trade-offs. Additionally, state-of-the-art models require large amounts of labeled data for training. This thesis investigates visual perception from these two fundamental viewpoints; first, we have tackled the problem of object pose estimation so that to reduce the speed and accuracy gap through designing a novel framework that better handles the diverse and complex content of images. Second, we have studied multiple problems in self-supervised representation learning, where our goal is to make perception models learn based on semantic invariants of visual data and be less dependent on manually annotated training data to make sense of visual content.


Object pose estimation is a significant building block of many high-level visual perception systems. For example, it is used in smartphone authentication, visual search, home assistant robots, and self-driving cars, to name a few. It is an inherently difficult task as a given image may contain an arbitrary number of objects from an unknown number of categories and in a complex structure and composition. An efficient and accurate solution for pose estimation will tackle the main complexity of a wide range of visual perception systems. However, the most accurate pose estimation models are computationally demanding and relatively slow at runtime. Hence, we propose a fast and highly accurate pose estimation framework using mixture density estimation. A mixture regression formulation for pose estimation acts as a prior that makes it easier for a model to deal with challenging factors obstructing efficient pose estimation without explicitly defining them.


Self-supervised learning is about designing learning systems that do not need manually annotated data. We have studied three self-supervised learning problems in close connection to representation learning, which concerns mapping raw images to tractable lower-dimensional or sparse numerical vectors (also called features) to facilitate downstream higher-level inferences. First, we propose a self-supervised framework for image classification that learns to classify images from an unlabeled image dataset. Our novel framework uses a mixture of embeddings module to impose semantic disentanglement in the representation space. Second, we propose a new formulation for self-supervised representation learning. Our new solution models representation learning as a learning to rank problem and empirically improves the quality of the learned representations in terms of accuracy of downstream tasks that rely on self-supervised representation encoders for feature extraction. Third, we introduce a self-supervised noise-contrastive approach for learning to generate representations for unseen images that are expected to be similar to a set of images we have access to. We use a representation generator trained in this manner to augment training data in the feature space for few-shot learning tasks that lack enough training samples.


Throughout this thesis, our approach has been to illustrate that encoding a better inductive bias in the training and design of models consistently leads to higher accuracy and more efficient visual perception models. Indeed humans' intuition of how visual content should be interpreted plays a central role in building more performant models. We provide principled formulations and analysis for our solutions and share extensive experimental results to showcase their advantage. 

Date:29 Jun 2018 →  27 Jan 2022
Keywords:Artificial intelligence, Computer vision, Deep learning, Machine learning
Disciplines:Nanotechnology, Design theories and methods
Project type:PhD project