- Book Title:
The compute demand for deep learning workloads is well known and is a prime motivator for powerful parallel computing platforms such as GPUs or dedicated hardware accelerators. The massive inherent parallelism of these workloads enables us to extract more performance by simply provisioning more compute hardware for a given task. This strategy can be directly exploited to build higher-performing hardware for DNN workloads, by incorporating as many parallel compute units as possible in a single system. This strategy is referred to as scaling up. Alternatively, it's feasible to arrange multiple hardware systems to work on a single problem, and in some cases, a cheaper alternative to exploit the given parallelism, or in other words, scaling out. As DNN based solutions become increasingly prevalent, so does the demand for computation, making the scaling choice (scale-up vs scale-out) critical. To study this design-space, this work makes two major contributions. (i) We describe a cycle-accurate simulator called SCALE-SIM for DNN inference on systolic arrays, which we use to model both scale-up and scale-out systems, modeling on-chip memory access, runtime, and DRAM bandwidth requirements for a given workload. (ii) We also present an analytical model to estimate the optimal scale-up vs scale-out ratio given hardware constraints (e.g, TOPS and DRAM bandwidth) for a given workload. We observe that a judicious choice of scaling can lead to performance improvements as high as 50 per layer, within the available DRAM bandwidth. This work demonstrates and analyzes the trade-off space for performance, DRAM bandwidth, and energy, and identifies sweet spots for various workloads and hardware configurations.