How many times have you tried setting up a virtual machine, container pods or serveless functions in the cloud and wondered how many resources (cores, memory etc.) to configure? In practice, users request more resources than actually end up using. This is a known problem for cloud providers for many years now (analysis from Alibaba and Microsoft Azure), that leads to low resource and cost efficiency. To address this problem, cloud resource management systems try to dynamically scale the amount of resources given across user applications (Autopilot, Madu), to match their actual resource usage, while setting overcommitment policies (Take it to the limit) that guarantee a user will always receive the requested amount.
Why is it still a challenging problem to solve?
Cloud resource management systems make use of forecasting models to predict the user load, that is, how many resources are used across time. This is not a trivial task, since load patterns can vary from stable and periodic all the way to dynamic and even random. Thus, the more accurate the predictions, the more efficient and timely decisions can be made to dynamically manage the resources across cloud users.
Machine learning to the rescue!
This sounds like a perfect use case for Machine Learning (ML). Indeed, recent works explore the effectiveness of using ML-based methods, such as Long Short Term Memory Networks (InfAdapter), to predict future resource utilization. The results are promising and show significantly improved prediction accuracy. Yet, training and inference overheads can be substantial, raising concerns about the practical use of ML, while at the same time the need for interpretability and security arises, when using ML in the cloud.
Industry thinks twice before using machine learning.
The practical and trustworthy integration of ML in cloud resource management systems is hard to achieve. Thus, production-level systems still use simple formulas, statistical methods and empirical heuristics for forecasting resource usage (Facebook’s Prophet, Microsoft’s Seagull), autoscaling resources (Google’s Autopilot) and setting overcommitment policies (Google’s Take it to the limit). This is not to say that these systems do not perform well. In reality, such simple techniques are still very effective when managing cloud resources, at least for the most part. Why is this the case?
It is all about the data persistence!
The answer lies in the data itself. Recent analysis (Christofidi et al.) across public datasets from various cloud providers (Google, Microsoft Azure, Alibaba, Bitbrains) shows that cloud resource usage exhibits high data persistence, meaning that the load changes very little in short time windows (e.g., every 5 minutes). This data persistence is higher for physical and virtual machines, that are usually highly loaded or exhibit stable, periodic and diurnal usage patterns (Microsoft Azure, Alibaba), compared to application-level resource usage that exhibits more dynamic behaviors (Google). The high persistence over time of cloud resource usage data allows simple forecasting methods to deliver sufficiently accurate predictions for resource management systems to be effective.
Is there still room for machine learning?
Absolutely! To deal with more dynamic data patterns and behaviors that are not predictable with simple approaches. Augmenting current resource management systems with machine learning, can enable higher prediction accuracy and result in higher resource efficiency and application performance. The new challenge that arises is to identify which patterns need machine learning to be accurately predicted. For this, we need methods and tools to deeply understand the data, classify patterns and identify irregular behaviors that will trigger the necessity for machine learning-based predictions. Systems components for monitoring, analyzing and storing the data, will need to be enhanced with support for metadata and potentially specialized monitoring counters. Last but not least, the challenges of practical, interpretable and trustworthy use of ML are still very much relevant.
TL;DR: I urge you to think not whether, but when to use machine learning to manage cloud resources.
About the author: Thaleia Dimitra Doudali is an Assistant Professor at IMDEA Software Institute in Madrid, Spain, working at the intersection of machine learning and computer systems. She received her Ph.D. from Georgia Tech, USA, advised by Ada Gavrilovska.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.