Slurm MachineLearning
Slurm MachineLearning
for computer scientists. On the other hand, HPC users of the current run, computed by estimating the odds of
are implicitly encouraged to overestimate predictions specific outcomes (or log odds, in the case of logistic
in terms of memory, CPUs, and time so they will avoid regression), and finally an expected utility based on
severe consequences and their jobs will not be killed probability distribution over outcomes. While the first
due to an insufficient amount of resources. two use cases are purely predictive and solvable by
Overestimate job resources will negatively impact the supervised or semisupervised inductive learning, the
performance of the scheduler by wasting infrastructure third presents an opportunity for sequential problem
resources; lower throughput; and leads to longer user solving, towards reinforcement learning-based
response time. automation (learning to act).
We are focused on developing a predictive analytics
1.1 Slurm Workload Manager capability for Slurm so it can predict needed amount of
There are different varieties of job schedulers such as memory resources and required running time for each
SGE (Sun Grid Engine) [14], Maui Cluster Scheduler [2], particular submitted jobs (regression). We hope to
TORQUE (Tera-scale Open-source Resource and Queue improve the efficiency of Slurm and the HPC systems
manager) [6], and PBS (Portable Batch System) [4]. itself by increase system throughput; increase system
Slurm (Simple Linux Utility for Resource Management) utilization; decrease turnaround time, and decrease
which is one of the most popular among all of them average job waiting time. To do so, we train different
[22]. Slurm is an open source; fault tolerant; secure; models with different machine learning algorithms
highly configurable; highly scalable, and support most described in Section 3. In Section 4 we present the
of Unix variants .Slurm role is both workload manager results of our experiments, and conclude in Section 5.
and a job scheduler, which makes Slurm more
convenient to use. Resource manager role is allocating
2 RELATED WORK
resources such as nodes, sockets, cores, hyper-threads,
The primary research conducted in a related field of
memory, interconnect, and other generic resources
study focused on predicting the length of time of the
within the HPC environment. While the scheduler role
jobs temporarily waiting in the queue. Besides, the
is managing the queue of work jobs including different
previous research either predicted memory usage of
scheduling algorithms such as fair share scheduling,
the jobs or predicted the execution time of the jobs
preemption, gang scheduling, advanced reservation,
running on the cluster. The central point and novel
etc. [5].
contribution of our study is to predict and determine
the resources needed to accomplish the jobs
1.2 Slurm Simulator submitted on the cluster and determine which is more
In order to test our module, we implemented a harmful for the HPC system, overestimate the memory
machine learning module and testing it using the or the time for the jobs running on the cluster?
Slurmsimulator developed by Center for Computational Matsunaga and Fortes [18] introduced an extended
Research, SUNY University at Buffalo. The Slurm machine learning tree algorithm called Predicting
simulator is located in the Github [3]. The Slurm Query Runtime 2 (PQR2).
simulator was developed to help the administrators to This method is a modified implementation of an
choose the best Slurm configuration while avoiding existing classification tree algorithm (PQR). PQR2
impacting the existing production system. We used this focused on the two bioinformatics applications, BLAST,
Slurm simulator because it is implemented from a and RAxML. Their method increased the accuracy of
modification of the actual Slurm code while disabling predicting the job execution time, memory and space
some unnecessary functionalities which does not affect usage, and decreased the average percentage error for
the functionality of the real Slurm, and it can simulate those applications.
up to 17 days of work in an hour [19]. hence, we can Warren Smith [20] introduced a lower prediction
test our models accurately and quickly. error rate machine learning method based on instance-
Slurm is a vital component of supercomputers but based learning (IBL) techniques to predict job
using it is hard, and this leads to inefficiencies. Hence, execution times, queue wait time, and file transfer
we are trying to use supervised machine learning to time.
address these efficiencies. This entails first defining Kumar and Vadhiyar [16] developed a prediction
inference tasks: regression-based estimation of the system called Predicting Quick Starters (PQStar) for
probability of a job being killed given its runtime identified and prediction of quick starters jobs (jobs
parameters and given a user’s historical track record to who has waiting time < 1 hour). PQStar prediction
date; a classificationbased prediction of the outcome
Improving HPC System Performance ... Learning PEARC ’19, July 28-August 1, 2019, Chicago, IL, USA
based on jobs request size and estimated run-time 3.2 Data Preparation and Feature Analysis
time, queue and processor occupancy states. For training our machine learning model, we used
García [12] study and found that automatically fourteen million instances which cover approximately
collecting and combining real performance running job eight years of log history data between the years 2009
data specifically "memory bandwidth usage of to 2017 from our local HPC cluster, “Beocat.” Each
applications", and scheduling data that extracted from instance on the log file has forty-five features. We
the hardware counters during jobs execution and used chose eight features as described in Table 1 in each
it again in the future for scheduling purposes can instance of the fourteen million total instances used for
improve HPC scheduling performance and reduce the training the machine learning model. Beocat is no cost
amount of waste resources and decrease the number educational system, and the most significant cluster in
of killed jobs due to reaching their execution time limit. the state of Kansas.. It is located at Kansas State
Gaussier et al. found that using a more limited University and operated by the Computer Science
approach to machine learning on HPC log data to department [1]. We chose our features based on the
predict jobs running time is an effective method for features required from the Slurm simulator and based
helping and improving scheduling algorithms and on most relevant features to the prediction tasks.
reduced the average bounded slowdown [13].
Other works focused on predict and maximize
power consumption for scientific applications and
maximize performance using machine learning
techniques [9] [10] . 3.3 Machine Learning Algorithms
The framework first developed in [8] admits the use of
3 IMPLEMENTATION a predictive model for job submission outcomes.
In this section, we will explain the workflow for our Learning a predictive distribution facilitates decision
model, our machine learning algorithms used in our support tasks, such as whether to migrate to a
model, the data and the experimental testbeds used, compute cloud or continue within an HPC cluster
and the features used for our machine learning environment, and the framing of these tasks as a
modeling. potential use case of sequential problem solving and
reinforcement learning. Towards the end, several
discriminative models from the scikit-learn machine
3.1 Workflow Model
learning library [7] [17] were trained to implement
The workflow model of our work described in Figure 1
predictive functionality in our experiments. Data
as follows. 1)
preparation steps included data cleaning by means of
The user submits their job which is including the
validating the data model for logged data and applying
amount of memory and requested time limit for the
transformations to normalize the data, reduce
proposed job. 2) The submitted job will be passed
redundancies, and otherwise standardize the
through our machine learning model to predict the
coalesced data model. For the baseline predictive task,
amount of the required memory and the amount of
we specified a classification target: specifically, learning
time needed for the job to run. 3) Our model will
the concept of a job that is more likely than not to be
update the amount of memory resources and update
killed given historical and runtime variables. This
the amount of time required for the submitted job. 4)
admits the use of a logistic regression or logit model,
The user will be notified about the changes to their
support vector machines, or k-nearest neighbor,
jobs.
whereas for the planned expected utility estimation
5) Finally, The updated job will be scheduled for
task, estimating the actual probability of a job being
running on the cluster.
killed is a genera regression task [15] that admits
linear, distance-weighted, or support vector regression, Submission and Execution Time shows the
as well as probit and generative models. difference between the job submission time and the
For the regression task, we used several supervised execution time (when the job is submitted, start and
models, including linear regression, LassoLarsIC (L1 duration of the run). Job submission time is the time
regularization), ridge regression (L2 regularization), stamp that represent when the job was submitted,
ElasticNetCV (L1/L2 ratio), and a decision tree while the execution time calculated as the difference
regressor. For the linear discriminants and their use on between the start Table 2: Wall Clock Time Limit
this task, we refer the interested reader to [8]. Using Prediction Algorithms Results
these flexible representations admits a balance of Model 2 Time (Second)
R (%)
generalization quality (via overfitting control) and
explainability. LR 0.0677 0.30
4 RESULTS AND DISCUSSION LLIC 0.0677 0.44
In this section, we describe, discuss and evaluate our ENCV 0.0677 4.32
machine learning algorithms results, and the strategy RG 0.0677 0.18
used for our experiment by presenting results and DTR 0.611 7.53
graphs consisting of quantitative metrics. Table 3: Memory Required Prediction Algorithms
Results
4.1 Machine Learning Techniques Model 2
R (%) Time (Second)
There are various machine learning algorithms
LR 0.174 0.39
available, and it is difficult to decide which supervised
machine learning algorithm provided the best results LLIC 0.174 0.46
for our module. Hence, we implemented our model ENCV 0.174 4.98
using five supervised machine learning algorithms and RG 0.174 0.12
trained them using our 14 million instances to predict DTR 0.638 8.28
the required time and memory. The statistical execution time and end execution time. System
measures of the coefficient of determination of the Utilization measure how efficiently the system is
machine learning algorithms shown in Table 2 and utilizing the resources, while the Backfill-Sched
Table 3 respectively. Based on our results we chose Performance shows the performance of the
DecisionTreeRegressor algorithm in our model since it backfillsched algorithm helping the main scheduler to
has the most significant R-squared value which means fit more jobs within the cluster to increase resource
the most fitted data to the regression line. utilization.
The legend for Table 2 and Table 3 described as We used the Slurm Simulator to examine each
follows: metric above by comparing the results of the following:
• LR: Linear Regression • Running each testbed using user requested
• LLIC: LassoLarsIC Regression memory and run time.
• ENCV: ElasticNetCV Regression • Running each testbed using the actual memory
• RG: Ridge Regression usage and duration.
• DTR: Decision Tree Regression • Running each testbed using predicted memory
and run time.
4.2 Evaluating Our Model 4.2.1 Testbed-1. Testbed-1 contains larger jobs (jobs
In this subsection, we show results and evaluate our which are requesting at least 4GB of memory and four
model. To do so, we test our model using two testbeds cores per task). Testbed1 includes a set of a thousand
(Testbed-1) and (Testbed-2). Each testbed is evaluated jobs. Figure 2 shows submission and execution time
based on three metrics as follows: metric based on the job_id, start time, and the
• Submission and Execution Time execution time for (Requested vs. Actual vs. Predicted)
• System Utilization for the jobs included in Testbed-1. The graph shows
that it takes around five days to complete the
• Backfill-Sched Performance
execution for all of the jobs using user requested
memory and time, while it takes only around ten hours Table 4 provides the calculated average waiting time
Table 4: Average Waiting and Turnaround Time and average turn-around time for the jobs in Testbed-1
(Requested vs Actual vs Predicted) For Jobs in Testbed- for each requested, actual, and predicted runs. Using
1 our model significantly reduced the average waiting
Avg Wait Time (Hour) Avg TA Time (Hour) time from 45.37 hours to 3.9 hours and average
turnaround time from 46.29 hours to 4.94. Both
Requested 45.37 46.29 predicted average waiting time and turn-around time is
almost exactly the same as the actual average waiting
Actual 3.90 4.82
time and turnaround time for jobs in testbed-1.
Predicted 4.00 4.94
to complete the running for the jobs using the actual 4.2.2 Testbed-2. Testbed-2 contains smaller jobs
and predicted time and memory for the jobs. Based on (jobs which are requesting less than 4GB of memory
the results, our model predicted the values for the and four cores per task). Testbed-2 includes a set of ten
required time and memory accurately. thousand jobs.
Figure 3 shows that using our module helped the While the results were less impressive than Testbed-
HPC system achieved higher utilization compared to 2, Figure 5 and 6 shows that our predicted model
the utilization of the HPC system that used unmodified achieved better utilization and better backfilling
user requested resources. Figure 4 indicates that the performance. Moreover, Table 5 shows that our
backfill-sched algorithm has achieved more efficiency predicted model incrementally reduced the average
on the testbed that used our module compared to the waiting and turnaround time from (0.08 to 0.06 hours)
ones that did not. We measure the backfill-sched and from (3.90 to 3.54 hours) respectively.
performance by measuring the density of jobs
attempts to schedule over the time.
These results were achieved because using our
model in most cases reduces the amount of resources
required by the user submitted jobs. Hence, the HPC
system has more available resources to fit more jobs in
the system. Thus, the backfill schedule becomes less
needed and the overall system more efficient by using
these available resources.
PEARC ’19, July 28-August 1, 2019, Chicago, IL, USA M. Tanash et al.
Figure 2: Jobs Submission and Running time (Requested vs Actual vs Predicted) for Jobs in Testbed-1. Note dramatic
improvement of Y axis range between graphs.
learning algorithms. Our model helps to reduce the average turn-around time for the submitted jobs.
computational time, increase utilization of the HPC As a result, our analysis indicates that our model
system, decrease average waiting time, and decrease
Figure 7: Jobs Submission and Running time ( Predicted Time Required vs Memory)
Figure 8: Utilization (Requested vs Actual vs Required Time Predicted vs Memory Predicted vs Required Time and
Memory
Predicted).
Improving HPC System Performance ... Learning PEARC ’19, July 28-August 1, 2019, Chicago, IL, USA
Figure 9: Backfill-Sched Performance for (Requested vs Actual vs Required Time Predicted vs Memory Predicted vs
Required Time and Memory Predicted)
helps maximize efficiency, increase the capability, and [5] [n. d.]. Slurm Workload Manager - Documentation.
[Link] (Accessed on 01/07/2019).
decrease the power consumption of the cluster.
[6] [n. d.]. TORQUE Resource Manager.
[Link] products/torque/.
6 FUTURE WORK (Accessed on 02/02/2019).
[7] 2019. Getting Started with Scikit-learn for Machine Learning. In
Our future work will include continuing improving our Python® Machine Learning. John Wiley & Sons, Inc., 93–117.
model by applying additional machine learning [Link] ch5
algorithms, testing our module in a real HPC system [8] Dan Andresen, William Hsu, Huichen Yang, and Adedolapo
Okanlawon. 2018. Machine Learning for Predictive Analytics of
and including a bigger testbed to achieve more Compute Cluster Jobs. CoRR abs/1806.01116 (2018).
accurate results. arXiv:1806.01116 [Link]
In addition, our machine learning approach will [9] Josep Ll. Berral, Íñigo Goiri, Ramón Nou, Ferran Julià, Jordi
Guitart, Ricard Gavaldà, and Jordi Torres. 2010. Towards energy-
incorporate both the classification (logit and other
aware scheduling in data centers using machine learning. In
discriminative modes) and regression (probit Proceedings of the 1st International Conference on Energy-
estimation and other maximum likelihood estimation) Efficient Computing and Networking - e-Energy '10. ACM Press.
into a decision support system. This system will provide https:
//[Link]/10.1145/1791314.1791349
a test bed for personalized recommendations based on [10] Bruce Bugbee, Caleb Phillips, Hilary Egan, Ryan Elmore, Kenny
the task outlined in this work, of learning a predictive Gruchalla, and Avi Purkayastha. 2017. Prediction and
distribution, and potentially for the future use of characterization of application power use in a high-performance
reinforcement learning using historical data and/or a computing environment. Statistical Analysis and Data Mining:
The ASA Data Science Journal 10, 3 (Feb. 2017), 155–165. https:
simulative model based on this distribution. This in //[Link]/10.1002/sam.11339
turn will enable experimental evaluation of the pros, [11] N.R. Council, D.E.L. Studies, D.E.P. Sciences, and C.P.I.H.E.C.I.F.S.
cons, and effectiveness of off-loading large jobs to the Engineering.
2008. The Potential Impact of High-End Capability Computing on
cloud service. As a use case of data science it will also
Four Illustrative Fields of Science and Engineering. National
facilitate exploration of variables that are exogenous to Academies Press. [Link] [Link]/books?
a single job, such as a user’s history of job submission id=2XadAgAAQBAJ
and rates of success or failure by mode (memory vs. [12] Fenoy GarcÃŋa and Carlos. 2014. Improving HPC applications
scheduling with predictions based on automatically collected
CPU). This can also potentially provide insights into the historical data. [Link]
effectiveness of training and the skill acquisition curve [Link]/handle/2099.1/23049
of a user as related to self-efficacy (as indicated on [13] Eric Gaussier, David Glesser, Valentin Reis, and Denis Trystram.
2015. Improving backfilling by using machine learning to predict
surveys) and as discovered automatically by clustering
running times. In Proceedings of the International Conference for
of users. High Performance Computing, Networking, Storage and Analysis
on - SC '15. ACM Press.
[Link] [14] W. Gentzsch. [n.
ACKNOWLEDGMENTS d.]. Sun Grid Engine: towards creating a compute power grid. In
We are greatly appreciate the HPC staff at Kansas State Proceedings First IEEE/ACM International Symposium on Cluster
University, including Adam Tygart and Kyle Hutson, for Computing and the Grid. IEEE Comput. Soc.
their help and technical support. We also thank the [Link]
[15] S. B. Kotsiantis, I. D. Zaharakis, and P. E. Pintelas. 2007. Machine
authors of the Slurm simulator at SUNY U. of Buffalo learning: a review of classification and combining techniques.
for releasing their work. This research was supported [Link] article/10.1007/s10462-007-9052-3
by NSF awards CHE-1726332, ACI-1440548, CNS- [16] Rajath Kumar and Sathish Vadhiyar. 2013. Identifying Quick
Starters: Towards an
1429316, NIH award P20GM113109, and Kansas State
Integrated Framework for Efficient Predictions of Queue Waiting
University. Times of Batch
Parallel Jobs. In Job Scheduling Strategies for Parallel Processing,
REFERENCES Walfredo Cirne,
Narayan Desai, Eitan Frachtenberg, and Uwe Schwiegelshohn
[1] [n. d.]. Beocat.
(Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 196–215.
[Link]
[17] L. Massaron and A. Boschetti. 2016. Regression Analysis with
e. (Accessed on 03/013/2019).
Python. Packt Publishing. [Link]
[2] [n. d.]. Documentation Index.
id=d2tLDAAAQBAJ
[Link] documentation-
[18] Andréa Matsunaga and José A.B. Fortes. 2010. On the Use of
index/. (Accessed on 02/011/2019).
Machine Learning to Predict the Time and Resources Consumed
[3] [n. d.]. GitHub - ubccr-slurm-simulator/slurm_simulator: Slurm
by Applications. In 2010 10th
Simulator: Slurm Modification to Enable its Simulation.
IEEE/ACM International Conference on Cluster, Cloud and Grid
[Link] slurm_simulator.
Computing. IEEE.
(Accessed on 01/03/2019).
[Link]
[4] [n. d.]. PBS Professional Open Source Project.
[19] Nikolay A. Simakov, Martins D. Innus, Matthew D. Jones, Robert L.
[Link] (Accessed on 02/03/2019).
DeLeon, Joseph P. White, Steven M. Gallo, Abani K. Patra, and
PEARC ’19, July 28-August 1, 2019, Chicago, IL, USA M. Tanash et al.