Hardware-aware, context-scalable processing for embedded visual navigation
Parallelization is a key technique to achieve the high computation rates required for modern computing applications. Especially for data dominated applications, executing multiple operations in parallel is a necessity to meet performance requirements. The way in which this parallelization is implemented, however, impacts both efficiency and flexibility: sharing resources between parallel stages positively influences energy efficiency, while also limiting flexibility as the shared resource enforces specific computation patterns. Therefore, the goal of this dissertation is to characterize the performance-flexibility-efficiency trade-offs for different parallelization techniques and build upon these insights to develop a new architecture to flexibly and efficiently handle data dominated applications in real time.
The two main investigated parallelization techniques are data and task level parallelism. Data level parallelism aims to execute the same operation on multiple data elements, sharing the same instruction between parallel processing units. In programmable processor architectures, this approach leverages re-utilization of instruction fetching and decoding (SIMD - Single Instruction Multiple Data). This work explores the effects of scaling up data level parallelism and demonstrates its limitations. Most notably, higher degrees of parallelization don’t necessarily contribute to a more energy efficient platform, due to the processing units requiring high memory bandwidths. The cost of high memory bandwidth related to widely parallel data lanes in conjunction with the limitation of Amdahl's law limits parallelism in SIMD architectures.
Task level parallelism distributes instructions over parallel processing units and shares the data between them (MISD – Multiple Instruction Single Data). To relieve pressure from the memory bandwidth, data elements are pushed through a deep pipeline before being committed to memory again, similar to the systolic array operating principle. As a result, this technique remains energy efficient, even for higher degrees of parallelization. The price to pay for this effective parallelization is flexibility: data sharing puts significant constraints on processing unit interconnections and requires effort in application mapping.
To facilitate this MISD operating principle, this work introduces three major innovations in the domain of Coarse Grained Reconfigurable Arrays (CGRAs). First, it eliminates overhead related to excessive reconfigurability. Second, it distributes memory buffers across the architecture to replace the inefficient central memory in traditional designs. And third, it interconnects the processing units with an application domain optimized network structure. In addition, this work presents a series of techniques to address challenges throughout the entire architectural development cycle, from conceptualization and visualization to physical implementation.
Together, these innovations have led to two silicon implementations in recent technology nodes (22 nm FDX and 28 nm CMOS). They achieve peak performances of 103.2 GOPS and 295.9 GOPS at efficiencies of 554.0 GOPS/W and 1163.1 GOPS/W respectively, marking up to a 42× performance improvement and a 4× efficiency improvement over an optimized data parallel design.