Deep learning, with its most representative algorithm, deep neural networks, has been the primary driving force for the recent rapid development of high-performance computing systems. Hardware researchers are proposing a large number of specialized chip architectures and the corresponding scheduling schemes, while software developers are optimizing deep learning frameworks to maximize the utilization of both existing CPU/GPU platforms and new hardware like TPU. Both communities are attempting to efficiently map the neural network computations to the underlying hardware, but use approaches from different system levels. In this article, we attempt to compare these approaches, and explore the opportunities of handling the challenges in a more unified way.
Dataflow techniques in domain-specific hardware accelerators
As a workload with abundant parallelism and locality, neural network is a good fit for spatial architectures, which use many simple processing elements (PEs) with multiple levels of local buffers. Besides efficient circuit-level implementations, the key architectural questions are how to schedule the computations onto the many PEs to fully exploit parallelism, and how to transfer data between off-chip memories and on-chip buffers to serve such highly parallel processing in an efficient manner.
The word dataflow is used to describe such scheduling schemes. More specifically, dataflow not only covers how data are distributed to the PEs with broadcast or systolic transfers, but also includes the methods to block the data in off-chip memories into smaller tiles to fit in on-chip buffers, as well as the data partitioning and communication in architectures with multiple individual processing engines.
As many neural network layers including 2D convolution and matrix multiplication are naturally represented as nested loops, different dataflow can be described by various loop transformations and be analyzed by many classical compiler techniques like polyhedral models. Alternatively, we can focus on the spatial data distribution, and highlight their stationary characteristics to reflect the data movement. MAESTRO extends this idea and proposes a systematic data-centric dataflow model, capturing both spatial and temporal resource mapping. If the accelerator scales up its resources to multiple individual engines that each contains a spatial PE array, either on the same chip or across multiple chips such as Simba, there will be an additional level of coarse-grained parallelism to be exploited. One can partition the computations within a layer, as in TETRIS and HyPar, or pipeline the execution of multiple layers, as in ScaleDeep and TANGRAM.
Given the huge design space, it is not trivial to identify an optimized dataflow for a specific workload running on a given platform. This is particularly true for data blocking between off-chip memories and on-chip buffers, as recent work demonstrates that such blocking choices are more critical than data transfers between PEs. Due to the non-convexity of the problem, many models and tools, such as Timeloop, TETRIS, and Interstellar, rely on random or exhaustive search to find optimized dataflow schedules.
System-level optimizations in software frameworks
On the other hand, in the system community that develop practically-used deep learning frameworks targeting CPU/GPU platforms, various system-level optimizations are applied to improve resource utilization and hence performance. They typically treat the neural networks as computation graphs with the operators (a.k.a., the layers) as nodes and the dependencies as edges, and schedule the graphs onto platforms with many CPU/GPU devices.
Operator placement, scheduling and pipelining. As multiple operators in a network run on multiple CPU/GPU devices in a system, an operator-to-device placement question naturally arises. Machine learning methods, specifically reinforcement learning, can effectively offer static placement decisions. Furthermore, in order to maximally parallelize the runtime processing, overlapping computation and communication without breaking data dependencies, and aggressively pipelining the operators in both forward and backward propagation, can both significantly improve the performance.
Operator fusion and subgraph substitution. In addition to dynamic scheduling decisions, framework compilers can do more with the target computation graphs. TensorFlow XLA fuses together small operators to increase hardware utilization and to reduce intermediate data access. TASO further allows automatic generation of beneficial subgraph substitutions. By doing such compiler-level graph transformations, a functionally-equivalent graph with improved performance can be used by the framework at runtime, without end users reimplementing or fine-tuning for every specific platform.
Memory management. Another challenge on GPUs is that the limited high-speed device memory capacity is not sufficient for many large models like BERT and graph neural networks. Proposals such as vDNN, AutoTM, and Capuchin leverage the large host CPU memory as a backup storage; others aggregate the memories from multiple GPUs as a shared pool. There is a large design space of smart data movement or recompute policies.
Optimized operator code generation. At the bottom level with individual operators, although there exist highly-optimized, hand-tuned operator libraries such as NVIDIA cuDNN for GPUs and Intel MKL for CPUs, new operators from continuous algorithm evolution and new hardware from agile domain-specific design both motivate automatic code generation for optimized operator implementations on specific hardware. TVM is a seminal work in this direction that compiles optimized tensor code using learning-based approaches.
A potential unification
To make a comparison, for specialized hardware accelerators, dataflow is typically discussed for a single layer (convolution or matrix multiplication) on a single accelerator chip. In contrast, the targets in software frameworks are usually the entire neural networks, potentially with tens or hundreds of operators, that execute on many CPU/GPU devices in a single machine or even across multiple machines.
Despite the different scopes, there are substantial similarities between the technical approaches they leverage. For example, both the dataflow exploration in hardware and the operator code generation in TVM rely on loop transformation primitives such as those implemented in the Halide language. The memory management for GPUs shares similar goals with the off-chip data blocking in accelerators, i.e., selecting the best subset of data to cache in the fast, near memories and reducing the accesses to slow, far memories. And the coarse-grained dataflow on accelerators for intra-layer and inter-layer parallelism can be viewed as miniature versions of the operator placement/fusion/pipelining optimizations on distributed GPU platforms.
Such significant similarities inspire us to extend the idea of dataflow to larger scales, in terms of both the computing systems and the target workloads. Larger systems involve multiple processing units per chip, multiple chips per machine, and eventually multiple distributed machines; larger workloads cover not only a single operator but also a subgraph of multiple connected operators, and even the entire neural network. We can then use a generalized and unified dataflow scheme to describe the overall mapping, including how to hierarchically parallelize the workloads on different units/chips/machines, and how to transfer data within and across operators with various blocking, overlapping, pipelining, or other strategies.
Other than just conceptually interesting, we would also hope such a unification can bring us more ideas about how to co-design efficient hardware-software systems for deep learning. One promising direction for specialized accelerators is that as their compute capabilities continuously scale up (e.g., TANGRAM and Simba), the hardware dataflow would start to borrow ideas from the software frameworks, including but not limited to efficient operator placement, fusion, and pipelining, in order to fully utilize the abundant resources with more and more layers executed concurrently.
Another perspective from the software frameworks is that, as various types of new hardware like TPU are being widely deployed in systems, the frameworks have to be aware of the underlying hardware dataflow to make good operator fusion and pipelining decisions. It already takes minutes to hours now to explore the graph transformations based on cost models built upon real hardware measurements. If further multiplied by the large number of hardware dataflow choices, the time would be prohibitively long. An end-to-end analytical model that captures the overall dataflow could be helpful in this case, to derive the optimized choices both accurately and quickly.
As a summary, we believe that dataflow optimizations in specialized hardware and graph transformations in software frameworks share many common philosophies and have the potential to be handled in a more unified way. A thorough study would be beyond the scope of this article, but we hope such a new perspective can inspire more thoughts and new ideas from both communities.
About the author: Mingyu Gao is an Assistant Professor at the Institute for Interdisciplinary Information Sciences (IIIS) at Tsinghua University. His research interests lie in the fields of computer architecture and systems, including efficient memory architectures and scalable acceleration for data processing.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.