< Back to previous page

Project

Spatially Adaptive Neural Networks for Computer Vision

Over the past decade, computer vision has seen remarkable progress due to the emergence of data-driven deep learning approaches. Convolutional neural networks (CNN) extract relevant features in an automated way by training on annotated data. As research advances, more complex architectures have more trainable parameters and require more computations. However, executing these models requires powerful hardware, which limits their applicability in low-power and real-time applications. As a result, research on efficient deep learning gained momentum.

Most convolutional neural network architectures use static inference graphs; they apply the same operations on every image. However, one could argue that not every image is equally complex, with the high-capacity models only required for the most difficult images. Based on that idea, dynamic neural networks adapt the operations to the input image. This way, the average number of computational operations to process an image can be reduced, resulting in faster processing times and lower power consumption. Additionally, dynamic models can expand the parameter space while keeping the computational cost under control, leading to superior representations.

In this PhD, we focus on spatially adaptive neural networks where operations are adapted in the spatial dimensions. New computer vision tasks use high-resolution images or video, with a wider range of content than ever before. Standard CNNs apply the same on every image and every pixel. In contrast, spatially adaptive neural networks apply more calculations to the most essential areas of an image, making better use of the available computing resources. We propose three new methods to achieve this.

Our first method is 'dynamic convolutions', that executes convolutional layers sparsely by only updating representations for important image regions. For each residual block, a small gating branch predicts which spatial positions should be evaluated. These discrete gating decisions are trained end-to-end using the Gumbel-Softmax trick. We demonstrate the method on image classification datasets such as CIFAR and ImageNet, as well as on human pose estimation. The latter is inherently spatially sparse and the processing speed is increased by 60 procent with no loss in accuracy.

Our second method is designed for dense prediction tasks such as semantic segmentation, where every pixel requires a prediction. We propose dual-resolution networks, where the image is split into block-sized regions and the processing resolution of each region is adapted based on its complexity. Simple regions, such as plain sky and road areas, are processed with fewer computations, while complex and important ones have fine-grained representations. In addition, the method reduces the memory consumption by storing features in a lower resolution. The policy to select high-resolution regions is trained using reinforcement learning. We integrate the method in the SwiftNet backbone, for semantic segmentation on the Cityscapes dataset. The number of operations is reduced by 60 procent and the inference speed is increased by 50 procent, with only a 0.3 procent reduction of the intersection-over-union accuracy metric.

The last method shifts the focus from image processing to video processing. Due to the availability of large-scale annotated datasets, most convolutional neural networks are trained on image datasets. They are then deployed on video streams by processing those frame-by-frame. However, as visual content tends to change slowly between frames, the temporal dimension introduces opportunities to re-use features and further reduce the number of computations. Our BlockCopy method uses standard networks designed for images and runs these efficiently on video. A lightweight policy network determines the important regions in a frame, and operations are applied on these selected regions only using custom block-sparse convolutions. Non-selected regions simply re-use the features of the preceding frame.

One noteworthy feature is that the lightweight policy is trained in an online fashion at deployment time, without requiring annotated video ground-truth data. The predictions of the main network guide the selections of the policy. Our universal framework is demonstrated on dense prediction tasks such as pedestrian detection, instance segmentation and semantic segmentation, using both state-of-the-art (Center-and-Scale Predictor, MGAN, SwiftNet) and standard baseline networks (Mask-RCNN, DeepLabV3+).

Dynamic neural networks could play a role in edge computing, continual learning and deploying large-scale multi-task foundation models.

Date:1 Oct 2018 →  26 Jan 2023
Keywords:computer vision, deep learning, hand pose estimation, machine learning
Disciplines:Multimedia processing, Signal processing
Project type:PhD project