Institute for Communication Technologies and Embedded Systems

Efficiency and scalability exploration of an application-specific instruction-set processor for deep convolutional neural networks

Authors:
Bytyn, A.
Ph. D. Dissertation
 
School:
RWTH Aachen University
Adress:
Institute for Integrated Signal Processing Systems
Date:
Dec. 2020
DOI:
10.18154/RWTH-2021-02253
hsb:
RWTH-2021-02253
Language:
English
Abstract:
With the increasing adoption of artificial neural networks (ANNs) for the realization of tasks like image classification, object detection, etc., an equally increased need for the efficient acceleration of these workloads has arisen. Today, especially convolutional neural networks (CNNs) have reached both the mainstream consumer market, as evident e.g. by many CNN-based image enhancement filters found in modern smartphone cameras, and also more traditional branches like the automotive industry, which aims to use CNNs for pedestrian detection, traffic sign recognition, etc. Currently available off-the-shelf processing systems, however, do not provide both the desired degree of flexibility and energy efficiency that is required in energy and power constrained environments. This thesis provides methods and a proof of concept hardware implementation that help in bridging this efficiency-flexibility gap by jointly investigating algorithm, software, and hardware optimizations. Several algorithmic techniques that help to increase the overall processing efficiency are quantitatively investigated in this thesis, including different methods for the quantization of computations and for the compression of neural network parameters (pruning). To ensure that the overall results are representative, different CNNs using regular convolutional layers, as well as more advanced residual blocks and depthwise-separable convolutions are explored. Based on this in-depth analysis, an application-specific instruction-set processor (ASIP) was designed from scratch, which is capable of exploiting several of these efficiency enhancing techniques. Its arithmetic units were designed based on the previously mentioned quantization analysis in order to minimize their dynamic power consumption by employing subword-parallel multipliers with reduced wordwidth, which are also capable of exploiting the sparsity found in CNN computations. Many application-specific sub-modules were incorporated to ensure a high degree of utilization regarding the arithmetic units. Furthermore, the ASIP’s design space, resulting from choosing different degrees of data-level parallelism (DLP), is thoroughly investigated based on results obtained via logic synthesis using a modern 28nm CMOS technology. Several differently scaled versions of the resulting processor were implemented down to the placed and routed chip to subsequently simulate and further analyze them. Highly optimized software kernels, which exploit multiple degrees of parallelism and aim at maximizing local data reuse in an effort to minimize redundant off-chip transfers, were implemented for the proposed ASIP. The problem of determining a suitable data flow pattern, i.e. figuring out which portions of the CNN to keep in local memory and when and where to load new data, is analytically modeled in this thesis via an integer non-linear programming (INLP) formulation. This INLP model is used for a system-level exploration of several relevant parameters like on-chip memory size and off-chip bandwidth. Finally, post-layout netlist simulations are used together with the optimized software kernels to execute several benchmark CNNs. These final results are compared to other state-of-the-art accelerators, showing performance values and energy efficiencies on the same level as other hardwired solutions, while offering vastly increased flexibility through the ASIP’s software programmability.
Download:
BibTeX