< Terug naar vorige pagina

Publicatie

Large Scale Video Understanding

Boek - Dissertatie

In this dissertation, we address the challenge of video understanding and propose several promising methods for its solution. In the context of video analysis, the recognition of human actions, the detection of objects present, and the classification of the observed scene are all important tasks. Good application examples for such semantic video analysis are more effective video retrieval, summarization, or recommendation tools. With those goals in mind, we present novel deep neural network-based methods, apply transfer learning and self-supervised learning, extract motion information, and obtain holistic visual representations. In recent years, Convolutional Neural Networks (CNNs) have been shown to be reliable approaches for any visual understanding task. In order to recognize human actions in video clips, we process them with a novel 3D CNN, which handles spatial and temporal information simultaneously. These proposed CNN architectures can derive appearance and motion features together, thus enabling fast and accurate video understanding. Additionally, it is important to encode features from different parts of a video in order to create an overall representation. To that end, we propose Temporal Linear Encoding (TLE), Spatio-Temporal Channel Correlation (STC), Temporal 3D Networks (T3D), and a Temporal Transition Layer (TTL) as possible solutions. All of these methods have one thing in common: they allow for end-to-end learning. That renders them more applicable in situations with minimum supervision. An essential part of understanding videos is the extraction of meaningful motion patterns. As mentioned, 3D CNNs collect temporal information across the frames of a video. We introduce the Dynamic Action and Motion Network (DynamoNet), which learns both action and motion patterns concurrently in one deep neural network. This combined analysis yields better action interpretation. It does so without the need for optical flow, a form of rather strong supervision that is often used in alternative video recognition methods. Instead, motion patterns are learned by predicting the next frame as a self-supervised task while concurrently learning action recognition. The probably most impactful contribution to this thesis is the introduction of holistic video understanding that goes well beyond mere action recognition. Many datasets used today are limited in their annotations to such action and activity labels, though. Therefore, as part of the work, we have produced the large-scale `Holistic Video Understanding' (HVU) dataset. It boasts almost 700K videos, with annotations for objects, scenes, actions, attributes, and more. A novel 2D/3D CNN architecture called `Holistic Appearance and Temporal Network' (HATNet) is proposed to handle holistic video understanding.
Jaar van publicatie:2022
Toegankelijkheid:Open