The past few years have seen an unprecedented increase in the number of systems targeting machine learning (ML) applications and deep learning in particular (Jeff Dean has compiled a telling graph on the exponentially increasing number of ML papers). From hardware accelerators, to high-level programming models, ML applications have monopolized conference proceedings and revolutionized hardware and software innovations.
This post takes a different view at the relationship between machine learning and systems. Specifically, instead of only building systems for ML, we examine whether system architects can also leverage ML itself to better build and manage complex computing systems.
Why Does Using ML for Systems Make Sense?
Architecture, among several systems areas has traditionally relied on empirical approaches, and for good reasons. Well-tuned empirical approaches work well in the common case, are – ideally – not overly complex, and for small-scale systems, can keep overheads low. They also help architects extract useful insights about system trade-offs that guide future design optimizations. That being said, the last few years have ushered in three key trends that make automated, data-driven approaches in system design and management more lucrative than before.
Three Motivating Factors…
1. Large-scale (or highly-complex) systems have become more prominent.
Before the ubiquity of cloud computing, the main large-scale distributed systems belonged to HPC infrastructure, highly-specialized equipment operated by expert engineers for expert users. As cloud computing rose to prominence, the number and size of warehouse-scale infrastructures also grew, bringing with them high power consumptions, and as public clouds gained traction, large numbers of non-expert programmers got access to production environments. Apart from optimizing power consumption, cloud services abide to strict quality-of-service (QoS) requirements, in terms of throughput and more importantly tail latency, e.g., 99th percentile request latency. Meeting such goals is especially challenging in systems with hundreds of thousands of machines, prone to hardware and software failures, resource contention, human errors, and security attacks. Relying on empirical (or even manual) management at such scale is cumbersome, expensive, and error-prone, pushing the need for automated solutions that abstract a lot of the system’s complexity from the user.
Apart from datacenters, a similar argument applies to traditional processing systems, where both the main processor has become increasingly complex, as the end of Moore’s Law is pushing architects to squeeze any performance potential possible, and with heterogeneity becoming more prevalent, systems have become increasingly diverse as well. This adds extra complexity to problems like resource management, concurrency, and memory request scheduling.
2. Advances in tracing/monitoring systems have increased the availability of large datasets.
A lot of attempts to apply ML to system problems in the past had been hindered by the lack of datasets of high quality and sufficient size. With large-scale systems gaining popularity, the ability to collect detailed tracing/monitoring information about their behavior also improved. The same applies to all levels of the system stack, as better monitoring tools now make it possible to collect high quality, detailed tracing information about complex systems and applications (the following ACM Queue article by Richard Sites provides great motivation on the value of monitoring infrastructures for modern systems).
3. The recent surge in ML model research has improved their quality and practicality.
The need for high quality datasets goes hand-in-hand with the need for high quality mining algorithms. The past few years have brought vast improvements in the accuracy and practicality of ML models, making them capable to handle the scale and complexity of modern systems. While the emphasis of late has been on deep learning, simpler techniques like classification and clustering techniques, which can be applied to many systems problems, have also seen rapid advances.
… And One Concern
A common trope against using ML in systems is faulting it for acting as a black box, making it hard to extract useful insights for a system. While there are certainly ML algorithms whose results are not easily interpretable and designing models whose behavior can be better understood is critical, one must remember that ML approaches are best suited for systems whose scale and complexity make manually/empirically extracting insights already a daunting challenge. In those cases, ML can be instrumental in filtering out noisy patterns, allowing users to draw important insights more easily.
For example, consider a large-scale datacenter experiencing violations of its QoS requirements. To address this, several service reliability engineers (SREs) would have to be involved to identify the source of poor performance as it’s happening. While they can leverage monitoring systems and their collective experience to zero-in on the root cause of the problem, this is a slow and on-the-critical path process that becomes increasingly difficult as the system’s scale and application diversity increases. Leveraging ML instead, could quickly eliminate unlikely sources of disruptions, and provide a list of the most prominent causes behind unpredictable performance, allowing the system to quickly take action and recover. Furthermore, by identifying problematic patterns that repeat over time, engineers can appropriately refactor their applications, network protocols, scheduling policies, etc. to avoid unpredictable performance in the future. An example of a system that moves in that direction is Seer, a performance debugging system we recently presented in ASPLOS’19 that leverages ML to find patterns that cause performance disruptions in cloud infrastructures, and corrects them over time.
Examples of ML Improving System Design and Management
There have already been several successful examples where ML has provided higher quality and/or more practical solutions than previous empirical approaches. The list below is not exhaustive, and is only meant to provide some examples from diverse systems areas where ML-driven approaches can be applied.
Hardware design & optimization
Recent work from Google and Stanford showed that deep learning can be effectively used to learn memory access patterns and better design sophisticated prefetchers. The authors show that an ML-driven approach achieves both better accuracy and recall compared to traditional approaches. Similar techniques have also been applied to memory request scheduling, page placement, and cache replacement policies.
Cloud management
Along the same lines, several datacenter management systems, like Resource Central from Microsoft, and Quasar showed that automating resource allocation in large clusters improves both their performance predictability and resource efficiency. The specific systems rely on classification and online recommenders respectively, but similar systems using other ML techniques have also been proposed. Note that such systems do not only apply to private infrastructures, but public clouds as well, where the number of resource options is overwhelming for expert and non-expert users alike.
Compiler optimizations
A research theme that initially appeared in the mid-90s, leveraging ML to guide compiler optimizations has made a decisive comeback with approaches like superoptimization and automated program synthesis. Zheng Wang and Michael O’Boyle recently wrote a great reference paper on the state-of-the-art on ML-driven compiler optimizations and potential future directions, highlighting how far the field has come, and what the main challenges are moving forward.
In fact, ISCA next week, is hosting the Machine Learning for Systems Workshop, with a great program on ML-driven solutions for diverse systems problems.
What Does This Mean for Computer Architects?
So, should you use ML in your next architecture-related conference submission? It depends. Like with most applications of ML, it is not a panacea, and it almost always comes with overheads both for training and inference. If your target system is relatively simple and easy to understand, then it’s likely that an ML-driven approach is an overkill. There is, however, an increasing number of use cases where ML can provide practical solutions for problems where previous empirical approaches fall short, allowing engineers to extract insights on how to better design and manage their systems. Given the increasing complexity, diversity, and scale of computer systems today, it’s likely that a brush-up on your ML background moving forward will not go amiss!
About the Author: Christina Delimitrou is an Assistant Professor in Electrical and Computer Engineering at Cornell University.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.