It was more than 10 years ago that I first started studying machine learning (ML). Serendipitously, I ended up leveraging ML techniques developed for biological sequence analysis for optimizing storage systems. In the recent past, I have focused on optimizing and managing datacenter and cloud-hosted systems, using ML techniques, which is the topic of this post. The increasing potential of ML in the context of Systems is reflected in the rise of multiple workshops that focus on ML for systems today: MLArchSys, MLforSys, EuroMLSys.
So, in this post, I will not try to convince you of the usefulness of ML for computer systems. Instead, I will focus on what happens next. Through this post, I bring your attention to the challenges to anticipate when setting up a problem to leverage ML, and some of the lessons and insights I learned along the way. This post presents a list of guidelines to enable research in this area.
A “Simple” Recipe
Let me start with what appears to be a “simple” recipe –
- Step 1 – Formulate the learning problem. To get a mathematically tractable learning formulation, we require domain knowledge, first, to understand limitations of existing mechanisms for the problem, and second, to understand the metrics of success. Note that model accuracy may not always be a good indicator of success. This follows the training data-collection phase, and we build a model using that data.
- Step 2 – Implement and deploy the learned models in the corresponding systems.
- Step 3 – Evaluate the resulting system using the previously identified metrics of success.
However, in reality, we face many challenges that complicate this recipe. Key challenges in most cases are – lack of availability of curated datasets, lack of an “ideal” vantage point from where one would like data to be collected. For example, cloud environments may not make hardware-level counters, such as cache hit rate, accessible to the predictors. We need to factor these limitations in when designing models for systems.
In this post, to demonstrate some of the challenges leading to the guidelines, I will use three problem instances keeping cluster resource efficiency in mind –- First, job scheduling in datacenter environments, second, right-sizing workloads in public cloud environments, and finally, harvesting allocated but unused resources in public clouds.
Guideline #1. Explore multiple domain-specific ways to formulate a problem
Let me make a case for this guideline using the job scheduling problem. Data intensive parallel execution frameworks, such as MapReduce, divide a job into multiple smaller tasks that are executed simultaneously on multiple nodes in a cluster to accelerate the overall job execution. A job finishes when all its tasks finish execution. This parallelism accelerates the job’s completion. However, as a natural consequence of this parallel processing mechanism, a job’s execution time is decided by the slow-running tasks, commonly called stragglers. Stragglers are problematic, and despite addressing data-skew, blacklisting faulty hardware, and inherently slow nodes, stragglers continue to exist in production clusters. Through Wrangler, we proposed a predictive scheduler that predicts stragglers in advance using a Support Vector Machine (SVM) based binary classifier and uses these predictions as hints to the scheduler to avoid stragglers. Wrangler’s predictive approach achieves significant improvements in job completion times and resource efficiency.
Behind the scenes…
What I described was the final version of the work . We went over multiple formulations before reaching here:
- We started with a question: “What causes stragglers?”. Our idea was that identifying the causes will help us mitigate them. I spent a lot of time working with the features, trying out various ways to find correlations between features and the straggling tasks using various feature selection/subset selection methods. But across different nodes, the features that highly correlated with straggler tasks were different. Moreover, such features varied on the same node over time. So, this way of formulating the problem did not lead to the desired solution.
- I then decided to predict the execution time of tasks, so we could use this knowledge in finding better ways to schedule them. I set this up as an instance of regression, but as many others have noted, predicting task execution times is a hard problem, given the performance variabilities observed in the heterogeneous and dynamically changing execution environments.
- I then simplified the problem statement further, to only predicting whether a task will take longer than expected. Making this an instance of binary classification.
- However, knowing task-level features may not always be feasible. Instead, we predicted straggler causing situations on the underlying nodes, and that’s what worked.
So, explore multiple domain-specific ways to formulate a learning problem.
Guideline #2. Guard the system from modeling errors
Mispredictions or modeling errors can cause the system state to be impacted negatively. Initial versions of the binary classification used in Wrangler ended up elongating job completions! Only after attaching confidence measures with the predictions and letting only confident predictions influence the scheduling decisions, I got the improvements in job completions and resource efficiency. So, develop ways to guard the system against such modeling errors.
Guideline #3. Utilize domain-specific correlation structures between learning tasks
I will build on the ongoing predictive job scheduler example to present this guideline. As noted earlier, lack of available curated datasets is one of the challenges in using ML for systems. For building Wrangler’s straggler predictor models, I had to run enough number of tasks to form a training dataset. Due to the inherent heterogeneity across servers and workloads observed in real-world datacenters, a single model shared across servers and workloads resulted in poor accuracy. So, Wrangler built models customized to each server-workload combination. However, this resulted in a long data collection time (e.g., 4 hours of data collection for a small cluster of 20 servers). This presents a practical challenge to the adoption of ML models in production.
To address this challenge, we proposed a new SVM-based formulation that shares data across different straggler predictors in a compute cluster. This idea of sharing data across similar modeling tasks falls under the ambit of Multi-Task Learning (MTL). Intuitively, our new formulations provide efficient and accurate ways to share data across modeling tasks depending on their similarities. We found a measure of similarity based on the domain knowledge that servers in a cluster are likely to have similar configurations in terms of their resources (e.g., availability of GPUs, SSDs). So, for instance, if a server with GPU doesn’t have enough training data, we should borrow from another server that also has a GPU, instead of falling back to the global model that shares data uniformly across all the servers. Our MTL formulations reduced training data collection time, improved the generalizability of the models due to the sharing, and this improvement in model accuracy translated into further improvements in job completion times and resource efficiency. So, to improve adoptability of the models in real-world systems, look for ways to optimize data collection times for models using domain knowledge. Keep in mind that combining data in domain-agnostic ways may hurt the efficacy of the model.
Guideline #4. For cost-efficiency and generalizability, decouple learning of different systemic aspects
For this guideline, I will use the example of right-sizing workloads in public cloud environments. Cloud users want their applications to meet certain performance/cost criteria. To meet the diverse requirements of applications, the cloud providers offer various virtual machine (VM) configurations, commonly called VM types or instance types. However, the huge number of VM types (AWS alone offers more than 500 today) creates a challenging problem of choice for users. Since the correlation between the VM types and cost/performance for a user application is unclear, users end up making suboptimal/overprovisioning choices causing severe underutilization of resources. PARIS comes up with a cost-performance trade-off map customized to a user application for the VM types across different public cloud providers. This arbitrage of competing offers enables users to make informed decisions depending on their constraints. A promising but expensive way to produce cost-performance estimates across VM types for an application is profiling. Profiling allows us to learn the relationship between the observed application performance and VM types.
We made an observation: VM types change much less frequently compared to the number of new user applications introduced to the cloud. So, PARIS performs the costly operation of learning the relationship between VM types in a one-time offline extensive profiling phase using benchmark workloads, and performs the frequent operation of learning about user applications in a light-weight fingerprinting phase that captures the characteristics of the application. By reducing the frequency of the costly operation and the cost of frequent operation, PARIS achieves accuracy and cost-efficiency. So, model distinct systemic aspects separately and combine the outcomes to improve cost-efficiency of the solution.
Guideline #5: Build with synergy between learning and system optimizations
This guideline recommends co-design of ML models and systemic optimizations. Let’s say we build an optimal model in isolation for a given problem. When placed in a real system, this model may not even get enough resources to run given the dynamicity of today’s cloud hosted systems. Systemic optimizations are needed to ensure enough resources are available at desired time for the model. Model itself may need to be optimized to fit within the available resources while being accurate enough. Lastly, the system should be able to execute the action produced by the model’s prediction in the desired time.
To make a case for this guideline, let me use SmartHarvest as an example. SmartHarvest harvests allocated but unused resources without impacting the performance of customer workloads. To do so given the lack of application-level metrics and dynamicity of the resource requirements, we needed accurate and fine-grained (~ms) predictions of resources required by user VMs.
These constraints informed our modeling formulation choices: We designed an online learning mechanism to use a limited set of features to keep the feature computation time low. On the flip side, to harvest the slack resources based on the model’s predictions, we needed to quickly reassign the idle cores from the customer VM to a VM running best-effort workloads. To realize this, we designed a fast core-reallocation mechanism as a systemic optimization.
Such co-design of models and systemic optimizations is key to effective ML for systems solutions.
About the Author: Neeraja J. Yadwadkar is an incoming Assistant Professor in the ECE department at University of Texas at Austin, and is currently working at VMWare Research. Most of her research straddles the boundaries of Computer Systems and ML.
Disclaimer: These posts are written by individual contributors to share their thoughts on the Computer Architecture Today blog for the benefit of the community. Any views or opinions represented in this blog are personal, belong solely to the blog author and do not represent those of ACM SIGARCH or its parent organization, ACM.