Slurm MachineLearning

The document discusses the development of a supervised machine learning model aimed at improving the performance of High-Performance Computing (HPC) systems by accurately predicting job resource requirements, specifically memory and execution time. By integrating this model into the Slurm resource manager, the authors aim to reduce computational turnaround time and enhance system utilization, ultimately addressing the common issue of users overestimating resource needs. The results indicate significant improvements in job processing efficiency, with a notable reduction in waiting times and increased throughput for larger jobs.

Uploaded by

ankushsonawane36

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

54 views10 pages

Slurm MachineLearning

Uploaded by

ankushsonawane36

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

Improving HPC System Performance by Predicting Job

Resources via Supervised Machine Learning

Mohammed Tanash Brandon Dunn Daniel Andresen
Kansas State University Kansas State University Kansas State University
Manhattan, Kansas Manhattan, Kansas Manhattan, Kansas
tanash@[Link] brdunn@[Link] dan@[Link]
William Hsu Huichen Yang Adedolapo Okanlawon
Kansas State University Kansas State University Kansas State University
Manhattan, Kansas Manhattan, Kansas Manhattan, Kansas
bhsu@[Link] huichen@[Link] arokanlawon@[Link]
ABSTRACT Scheduling; • Hardware → Testing with distributed and
High-Performance Computing (HPC) systems are parallel systems;
resources utilized for data capture, sharing, and
analysis. The majority of our HPC users come from Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or
other disciplines than Computer Science. HPC users
distributed for profit or commercial advantage and that copies bear this notice and
including computer scientists have difficulties and do the full citation on the first page. Copyrights for components of this work owned by
not feel proficient enough to decide the required others than ACM must be honored. Abstracting with credit is permitted. To copy
otherwise, or republish, to post on servers or to redistribute to lists, requires prior
amount of resources for their submitted jobs on the specific permission and/or a fee. Request permissions from permissions@[Link].
cluster. Consequently, users are encouraged to over- PEARC ’19, July 28-August 1, 2019, Chicago, IL, USA
estimate resources for their submitted jobs, so their © 2019 Association for Computing Machinery. ACM
ISBN 978-1-4503-7227-5/19/07...$15.00
jobs will not be killing due insufficient resources. This
[Link]
process will waste and devour HPC resources; hence,
this will lead to inefficient cluster utilization. We KEYWORDS
created a supervised machine learning model and HPC, Scheduling, Supervised Machine Learning, Slurm,
integrated it into the Slurm resource manager Performance, User Modeling
simulator to predict the amount of required memory
resources 1 INTRODUCTION
(Memory) and the required amount of time to run the HPC systems have become more well-known and
computation. Our model involves using different available to users among the universities and research
machine learning algorithms. Our goal is to integrate centers, to name a few. Users rely on running their
and test the proposed supervised machine learning extensive computations on these machines. One of the
model on Slurm. We used over 10000 tasks selected most critical parts of the HPC system is the scheduler,
from our HPC log files to evaluate the performance and which is a piece of software on a high-performance
the accuracy of our integrated model. The purpose of
computing cluster which decides and controls what
our work is to increase the performance of the Slurm
calculations to run next and wherein the HPC systems
by predicting the amount of require jobs memory
[22]. Schedulers can become a bottleneck for HPC
resources and the time required for each particular job
systems through handling vast numbers of submitted
in order to improve the utilization of the HPC system
jobs that are requesting an extensive amount of cluster
using our integrated supervised machine learning
resources (CPUs and memory). Users of the HPC
model.
systems come from different disciplines. Particular
Our results indicate that for larger jobs our model
fields in science and engineering such as atmospheric
helps dramatically reduce computational turnaround
sciences, chemical separations, astrophysics,
time (from five days to ten hours for large jobs),
geoinformation science, and evolutionary biology rely
substantially increased utilization of the HPC system,
on and demand HPC resources through simulations,
and decreased the average waiting time for the
experiments, and dealing with a tremendous amount
submitted jobs.
of data [11] [21]. These users are usually not familiar
and do not have a good knowledge and experience to
CCS CONCEPTS estimate what exactly their jobs need, and the
• Computingmethodologies → Supervisedlearning; scheduler does not know any better. Calculating the
Artificial intelligence; • Software and its engineering → resource needs for a particular job is a hard thing even
PEARC ’19, July 28-August 1, 2019, Chicago, IL, USA M. Tanash et al.

for computer scientists. On the other hand, HPC users of the current run, computed by estimating the odds of
are implicitly encouraged to overestimate predictions specific outcomes (or log odds, in the case of logistic
in terms of memory, CPUs, and time so they will avoid regression), and finally an expected utility based on
severe consequences and their jobs will not be killed probability distribution over outcomes. While the first
due to an insufficient amount of resources. two use cases are purely predictive and solvable by
Overestimate job resources will negatively impact the supervised or semisupervised inductive learning, the
performance of the scheduler by wasting infrastructure third presents an opportunity for sequential problem
resources; lower throughput; and leads to longer user solving, towards reinforcement learning-based
response time. automation (learning to act).
We are focused on developing a predictive analytics
1.1 Slurm Workload Manager capability for Slurm so it can predict needed amount of
There are different varieties of job schedulers such as memory resources and required running time for each
SGE (Sun Grid Engine) [14], Maui Cluster Scheduler [2], particular submitted jobs (regression). We hope to
TORQUE (Tera-scale Open-source Resource and Queue improve the efficiency of Slurm and the HPC systems
manager) [6], and PBS (Portable Batch System) [4]. itself by increase system throughput; increase system
Slurm (Simple Linux Utility for Resource Management) utilization; decrease turnaround time, and decrease
which is one of the most popular among all of them average job waiting time. To do so, we train different
[22]. Slurm is an open source; fault tolerant; secure; models with different machine learning algorithms
highly configurable; highly scalable, and support most described in Section 3. In Section 4 we present the
of Unix variants .Slurm role is both workload manager results of our experiments, and conclude in Section 5.
and a job scheduler, which makes Slurm more
convenient to use. Resource manager role is allocating
2 RELATED WORK
resources such as nodes, sockets, cores, hyper-threads,
The primary research conducted in a related field of
memory, interconnect, and other generic resources
study focused on predicting the length of time of the
within the HPC environment. While the scheduler role
jobs temporarily waiting in the queue. Besides, the
is managing the queue of work jobs including different
previous research either predicted memory usage of
scheduling algorithms such as fair share scheduling,
the jobs or predicted the execution time of the jobs
preemption, gang scheduling, advanced reservation,
running on the cluster. The central point and novel
etc. [5].
contribution of our study is to predict and determine
the resources needed to accomplish the jobs
1.2 Slurm Simulator submitted on the cluster and determine which is more
In order to test our module, we implemented a harmful for the HPC system, overestimate the memory
machine learning module and testing it using the or the time for the jobs running on the cluster?
Slurmsimulator developed by Center for Computational Matsunaga and Fortes [18] introduced an extended
Research, SUNY University at Buffalo. The Slurm machine learning tree algorithm called Predicting
simulator is located in the Github [3]. The Slurm Query Runtime 2 (PQR2).
simulator was developed to help the administrators to This method is a modified implementation of an
choose the best Slurm configuration while avoiding existing classification tree algorithm (PQR). PQR2
impacting the existing production system. We used this focused on the two bioinformatics applications, BLAST,
Slurm simulator because it is implemented from a and RAxML. Their method increased the accuracy of
modification of the actual Slurm code while disabling predicting the job execution time, memory and space
some unnecessary functionalities which does not affect usage, and decreased the average percentage error for
the functionality of the real Slurm, and it can simulate those applications.
up to 17 days of work in an hour [19]. hence, we can Warren Smith [20] introduced a lower prediction
test our models accurately and quickly. error rate machine learning method based on instance-
Slurm is a vital component of supercomputers but based learning (IBL) techniques to predict job
using it is hard, and this leads to inefficiencies. Hence, execution times, queue wait time, and file transfer
we are trying to use supervised machine learning to time.
address these efficiencies. This entails first defining Kumar and Vadhiyar [16] developed a prediction
inference tasks: regression-based estimation of the system called Predicting Quick Starters (PQStar) for
probability of a job being killed given its runtime identified and prediction of quick starters jobs (jobs
parameters and given a user’s historical track record to who has waiting time < 1 hour). PQStar prediction
date; a classificationbased prediction of the outcome
Improving HPC System Performance ... Learning PEARC ’19, July 28-August 1, 2019, Chicago, IL, USA

based on jobs request size and estimated run-time 3.2 Data Preparation and Feature Analysis
time, queue and processor occupancy states. For training our machine learning model, we used
García [12] study and found that automatically fourteen million instances which cover approximately
collecting and combining real performance running job eight years of log history data between the years 2009
data specifically "memory bandwidth usage of to 2017 from our local HPC cluster, “Beocat.” Each
applications", and scheduling data that extracted from instance on the log file has forty-five features. We
the hardware counters during jobs execution and used chose eight features as described in Table 1 in each
it again in the future for scheduling purposes can instance of the fourteen million total instances used for
improve HPC scheduling performance and reduce the training the machine learning model. Beocat is no cost
amount of waste resources and decrease the number educational system, and the most significant cluster in
of killed jobs due to reaching their execution time limit. the state of Kansas.. It is located at Kansas State
Gaussier et al. found that using a more limited University and operated by the Computer Science
approach to machine learning on HPC log data to department [1]. We chose our features based on the
predict jobs running time is an effective method for features required from the Slurm simulator and based
helping and improving scheduling algorithms and on most relevant features to the prediction tasks.
reduced the average bounded slowdown [13].
Other works focused on predict and maximize
power consumption for scientific applications and
maximize performance using machine learning
techniques [9] [10] . 3.3 Machine Learning Algorithms
The framework first developed in [8] admits the use of
3 IMPLEMENTATION a predictive model for job submission outcomes.
In this section, we will explain the workflow for our Learning a predictive distribution facilitates decision
model, our machine learning algorithms used in our support tasks, such as whether to migrate to a
model, the data and the experimental testbeds used, compute cloud or continue within an HPC cluster
and the features used for our machine learning environment, and the framing of these tasks as a
modeling. potential use case of sequential problem solving and
reinforcement learning. Towards the end, several
discriminative models from the scikit-learn machine
3.1 Workflow Model
learning library [7] [17] were trained to implement
The workflow model of our work described in Figure 1
predictive functionality in our experiments. Data
as follows. 1)
preparation steps included data cleaning by means of
The user submits their job which is including the
validating the data model for logged data and applying
amount of memory and requested time limit for the
transformations to normalize the data, reduce
proposed job. 2) The submitted job will be passed
redundancies, and otherwise standardize the
through our machine learning model to predict the
coalesced data model. For the baseline predictive task,
amount of the required memory and the amount of
we specified a classification target: specifically, learning
time needed for the job to run. 3) Our model will
the concept of a job that is more likely than not to be
update the amount of memory resources and update
killed given historical and runtime variables. This
the amount of time required for the submitted job. 4)
admits the use of a logistic regression or logit model,
The user will be notified about the changes to their
support vector machines, or k-nearest neighbor,
jobs.
whereas for the planned expected utility estimation
5) Finally, The updated job will be scheduled for
task, estimating the actual probability of a job being
running on the cluster.
killed is a genera regression task [15] that admits

Figure 1: Work Flow Diagram for our Model.

PEARC ’19, July 28-August 1, 2019, Chicago, IL, USA M. Tanash et al.

linear, distance-weighted, or support vector regression, Submission and Execution Time shows the
as well as probit and generative models. difference between the job submission time and the
For the regression task, we used several supervised execution time (when the job is submitted, start and
models, including linear regression, LassoLarsIC (L1 duration of the run). Job submission time is the time
regularization), ridge regression (L2 regularization), stamp that represent when the job was submitted,
ElasticNetCV (L1/L2 ratio), and a decision tree while the execution time calculated as the difference
regressor. For the linear discriminants and their use on between the start Table 2: Wall Clock Time Limit
this task, we refer the interested reader to [8]. Using Prediction Algorithms Results
these flexible representations admits a balance of Model 2 Time (Second)
R (%)
generalization quality (via overfitting control) and
explainability. LR 0.0677 0.30
4 RESULTS AND DISCUSSION LLIC 0.0677 0.44
In this section, we describe, discuss and evaluate our ENCV 0.0677 4.32
machine learning algorithms results, and the strategy RG 0.0677 0.18
used for our experiment by presenting results and DTR 0.611 7.53
graphs consisting of quantitative metrics. Table 3: Memory Required Prediction Algorithms
Results
4.1 Machine Learning Techniques Model 2
R (%) Time (Second)
There are various machine learning algorithms
LR 0.174 0.39
available, and it is difficult to decide which supervised
machine learning algorithm provided the best results LLIC 0.174 0.46
for our module. Hence, we implemented our model ENCV 0.174 4.98
using five supervised machine learning algorithms and RG 0.174 0.12
trained them using our 14 million instances to predict DTR 0.638 8.28
the required time and memory. The statistical execution time and end execution time. System
measures of the coefficient of determination of the Utilization measure how efficiently the system is
machine learning algorithms shown in Table 2 and utilizing the resources, while the Backfill-Sched
Table 3 respectively. Based on our results we chose Performance shows the performance of the
DecisionTreeRegressor algorithm in our model since it backfillsched algorithm helping the main scheduler to
has the most significant R-squared value which means fit more jobs within the cluster to increase resource
the most fitted data to the regression line. utilization.
The legend for Table 2 and Table 3 described as We used the Slurm Simulator to examine each
follows: metric above by comparing the results of the following:
• LR: Linear Regression • Running each testbed using user requested
• LLIC: LassoLarsIC Regression memory and run time.
• ENCV: ElasticNetCV Regression • Running each testbed using the actual memory
• RG: Ridge Regression usage and duration.
• DTR: Decision Tree Regression • Running each testbed using predicted memory
and run time.
4.2 Evaluating Our Model 4.2.1 Testbed-1. Testbed-1 contains larger jobs (jobs
In this subsection, we show results and evaluate our which are requesting at least 4GB of memory and four
model. To do so, we test our model using two testbeds cores per task). Testbed1 includes a set of a thousand
(Testbed-1) and (Testbed-2). Each testbed is evaluated jobs. Figure 2 shows submission and execution time
based on three metrics as follows: metric based on the job_id, start time, and the
• Submission and Execution Time execution time for (Requested vs. Actual vs. Predicted)
• System Utilization for the jobs included in Testbed-1. The graph shows
that it takes around five days to complete the
• Backfill-Sched Performance
execution for all of the jobs using user requested

Table 1: Features Selected

Feature Type Description
job _id Numeric Id of submitted job
username Text User name of submitted job
submit Date Date and time to submit job
wclimit Numeric Requested time in minutes (predicted variable)
duration Numeric Actual running wall time for the job in seconds
cpu_per_task Numeric Number of requested CPU’s per task
req_mem Numeric Requested memory for job at submission time in MB (predicted variable)
req _mem _per _cpu Numeric Required memory per CPU
Improving HPC System Performance ... Learning PEARC ’19, July 28-August 1, 2019, Chicago, IL, USA

memory and time, while it takes only around ten hours Table 4 provides the calculated average waiting time
Table 4: Average Waiting and Turnaround Time and average turn-around time for the jobs in Testbed-1
(Requested vs Actual vs Predicted) For Jobs in Testbed- for each requested, actual, and predicted runs. Using
1 our model significantly reduced the average waiting
Avg Wait Time (Hour) Avg TA Time (Hour) time from 45.37 hours to 3.9 hours and average
turnaround time from 46.29 hours to 4.94. Both
Requested 45.37 46.29 predicted average waiting time and turn-around time is
almost exactly the same as the actual average waiting
Actual 3.90 4.82
time and turnaround time for jobs in testbed-1.
Predicted 4.00 4.94
to complete the running for the jobs using the actual 4.2.2 Testbed-2. Testbed-2 contains smaller jobs
and predicted time and memory for the jobs. Based on (jobs which are requesting less than 4GB of memory
the results, our model predicted the values for the and four cores per task). Testbed-2 includes a set of ten
required time and memory accurately. thousand jobs.
Figure 3 shows that using our module helped the While the results were less impressive than Testbed-
HPC system achieved higher utilization compared to 2, Figure 5 and 6 shows that our predicted model
the utilization of the HPC system that used unmodified achieved better utilization and better backfilling
user requested resources. Figure 4 indicates that the performance. Moreover, Table 5 shows that our
backfill-sched algorithm has achieved more efficiency predicted model incrementally reduced the average
on the testbed that used our module compared to the waiting and turnaround time from (0.08 to 0.06 hours)
ones that did not. We measure the backfill-sched and from (3.90 to 3.54 hours) respectively.
performance by measuring the density of jobs
attempts to schedule over the time.
These results were achieved because using our
model in most cases reduces the amount of resources
required by the user submitted jobs. Hence, the HPC
system has more available resources to fit more jobs in
the system. Thus, the backfill schedule becomes less
needed and the overall system more efficient by using
these available resources.
PEARC ’19, July 28-August 1, 2019, Chicago, IL, USA M. Tanash et al.

Figure 2: Jobs Submission and Running time (Requested vs Actual vs Predicted) for Jobs in Testbed-1. Note dramatic
improvement of Y axis range between graphs.

Figure 3: Utilization (Requested vs Actual vs Predicted) for Jobs in Testbed-1.

Figure 4: Backfill-Sched Performance for Jobs in Testbed-1

Improving HPC System Performance ... Learning PEARC ’19, July 28-August 1, 2019, Chicago, IL, USA

Figure 5: Utilization (Requested vs Actual vs Predicted) For Testbed-2.

Figure 6: Backfill-Sched Performance for Testbed-2)

Table 5: Average Waiting and Turnaround Time of the time and memory equally by jobs submitted by
(Requested vs Actual vs Predicted) For Jobs in Testbed- the users.
2 Figure 8 and 9 shows the comparison of the
Avg Wait Time (Hour) Avg TA Time (Hour) utilization and the performance of backfill-sched for
the system by running jobs in
Requested 0.08 3.90 Testbed-1 on the Slurm Simulator using ((Requested vs
Actual vs Required Time Predicted vs Memory
Actual 0.05 3.23
Predicted vs Required Time and Memory Predicted).
Predicted 0.06 3.54
Our results indicates that both memory prediction
4.3 Predicting Memory Required vs. and time requested prediction are highly valuable and
Predicting Time Required are almost equally important because they achieved
In this results subsection, we will discuss and show the similar performance as shown in the graphs. We
results that answer a question "Which is more achieved peak performance and utilization by
important to predict? Required memory or required combining both of them in one model.
time?"
Figure 7 shows the submission and running times
for two runs of Testbed-1. One run is using our model 5 CONCLUSIONS
where we are predicting only the required memory Our model is an important link between HPC users,
(Red) and the other one predicting the required time scheduler, and HPC resources. The rule of our model is
(Blue). This is mostly caused by inaccurate estimation predicting the amount of memory and time required
for any submitted jobs using supervised machine
PEARC ’19, July 28-August 1, 2019, Chicago, IL, USA M. Tanash et al.

learning algorithms. Our model helps to reduce the average turn-around time for the submitted jobs.
computational time, increase utilization of the HPC As a result, our analysis indicates that our model
system, decrease average waiting time, and decrease

Figure 7: Jobs Submission and Running time ( Predicted Time Required vs Memory)

Figure 8: Utilization (Requested vs Actual vs Required Time Predicted vs Memory Predicted vs Required Time and
Memory
Predicted).
Improving HPC System Performance ... Learning PEARC ’19, July 28-August 1, 2019, Chicago, IL, USA

Figure 9: Backfill-Sched Performance for (Requested vs Actual vs Required Time Predicted vs Memory Predicted vs
Required Time and Memory Predicted)
helps maximize efficiency, increase the capability, and [5] [n. d.]. Slurm Workload Manager - Documentation.
[Link] (Accessed on 01/07/2019).
decrease the power consumption of the cluster.
[6] [n. d.]. TORQUE Resource Manager.
[Link] products/torque/.
6 FUTURE WORK (Accessed on 02/02/2019).
[7] 2019. Getting Started with Scikit-learn for Machine Learning. In
Our future work will include continuing improving our Python® Machine Learning. John Wiley & Sons, Inc., 93–117.
model by applying additional machine learning [Link] ch5
algorithms, testing our module in a real HPC system [8] Dan Andresen, William Hsu, Huichen Yang, and Adedolapo
Okanlawon. 2018. Machine Learning for Predictive Analytics of
and including a bigger testbed to achieve more Compute Cluster Jobs. CoRR abs/1806.01116 (2018).
accurate results. arXiv:1806.01116 [Link]
In addition, our machine learning approach will [9] Josep Ll. Berral, Íñigo Goiri, Ramón Nou, Ferran Julià, Jordi
Guitart, Ricard Gavaldà, and Jordi Torres. 2010. Towards energy-
incorporate both the classification (logit and other
aware scheduling in data centers using machine learning. In
discriminative modes) and regression (probit Proceedings of the 1st International Conference on Energy-
estimation and other maximum likelihood estimation) Efficient Computing and Networking - e-Energy '10. ACM Press.
into a decision support system. This system will provide https:
//[Link]/10.1145/1791314.1791349
a test bed for personalized recommendations based on [10] Bruce Bugbee, Caleb Phillips, Hilary Egan, Ryan Elmore, Kenny
the task outlined in this work, of learning a predictive Gruchalla, and Avi Purkayastha. 2017. Prediction and
distribution, and potentially for the future use of characterization of application power use in a high-performance
reinforcement learning using historical data and/or a computing environment. Statistical Analysis and Data Mining:
The ASA Data Science Journal 10, 3 (Feb. 2017), 155–165. https:
simulative model based on this distribution. This in //[Link]/10.1002/sam.11339
turn will enable experimental evaluation of the pros, [11] N.R. Council, D.E.L. Studies, D.E.P. Sciences, and C.P.I.H.E.C.I.F.S.
cons, and effectiveness of off-loading large jobs to the Engineering.
2008. The Potential Impact of High-End Capability Computing on
cloud service. As a use case of data science it will also
Four Illustrative Fields of Science and Engineering. National
facilitate exploration of variables that are exogenous to Academies Press. [Link] [Link]/books?
a single job, such as a user’s history of job submission id=2XadAgAAQBAJ
and rates of success or failure by mode (memory vs. [12] Fenoy GarcÃŋa and Carlos. 2014. Improving HPC applications
scheduling with predictions based on automatically collected
CPU). This can also potentially provide insights into the historical data. [Link]
effectiveness of training and the skill acquisition curve [Link]/handle/2099.1/23049
of a user as related to self-efficacy (as indicated on [13] Eric Gaussier, David Glesser, Valentin Reis, and Denis Trystram.
2015. Improving backfilling by using machine learning to predict
surveys) and as discovered automatically by clustering
running times. In Proceedings of the International Conference for
of users. High Performance Computing, Networking, Storage and Analysis
on - SC '15. ACM Press.
[Link] [14] W. Gentzsch. [n.
ACKNOWLEDGMENTS d.]. Sun Grid Engine: towards creating a compute power grid. In
We are greatly appreciate the HPC staff at Kansas State Proceedings First IEEE/ACM International Symposium on Cluster
University, including Adam Tygart and Kyle Hutson, for Computing and the Grid. IEEE Comput. Soc.
their help and technical support. We also thank the [Link]
[15] S. B. Kotsiantis, I. D. Zaharakis, and P. E. Pintelas. 2007. Machine
authors of the Slurm simulator at SUNY U. of Buffalo learning: a review of classification and combining techniques.
for releasing their work. This research was supported [Link] article/10.1007/s10462-007-9052-3
by NSF awards CHE-1726332, ACI-1440548, CNS- [16] Rajath Kumar and Sathish Vadhiyar. 2013. Identifying Quick
Starters: Towards an
1429316, NIH award P20GM113109, and Kansas State
Integrated Framework for Efficient Predictions of Queue Waiting
University. Times of Batch
Parallel Jobs. In Job Scheduling Strategies for Parallel Processing,
REFERENCES Walfredo Cirne,
Narayan Desai, Eitan Frachtenberg, and Uwe Schwiegelshohn
[1] [n. d.]. Beocat.
(Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 196–215.
[Link]
[17] L. Massaron and A. Boschetti. 2016. Regression Analysis with
e. (Accessed on 03/013/2019).
Python. Packt Publishing. [Link]
[2] [n. d.]. Documentation Index.
id=d2tLDAAAQBAJ
[Link] documentation-
[18] Andréa Matsunaga and José A.B. Fortes. 2010. On the Use of
index/. (Accessed on 02/011/2019).
Machine Learning to Predict the Time and Resources Consumed
[3] [n. d.]. GitHub - ubccr-slurm-simulator/slurm_simulator: Slurm
by Applications. In 2010 10th
Simulator: Slurm Modification to Enable its Simulation.
IEEE/ACM International Conference on Cluster, Cloud and Grid
[Link] slurm_simulator.
Computing. IEEE.
(Accessed on 01/03/2019).
[Link]
[4] [n. d.]. PBS Professional Open Source Project.
[19] Nikolay A. Simakov, Martins D. Innus, Matthew D. Jones, Robert L.
[Link] (Accessed on 02/03/2019).
DeLeon, Joseph P. White, Steven M. Gallo, Abani K. Patra, and
PEARC ’19, July 28-August 1, 2019, Chicago, IL, USA M. Tanash et al.

Thomas R. Furlani. 2018. A Slurm Simulator: Implementation and

Parametric Analysis. In High Performance Computing Systems.
Performance Modeling, Benchmarking, and Simulation,
Stephen Jarvis, Steven Wright, and Simon Hammond (Eds.).
Springer International Publishing, Cham, 197–217.
[20] Warren Smith. 2007. Prediction Services for Distributed
Computing. In 2007 IEEE International Parallel and Distributed
Processing Symposium. IEEE. https:
//[Link]/10.1109/ipdps.2007.370276
[21] Chaowei Yang, David Wong, Qianjun Miao, and Ruixin Yang (Eds.).
2010. Advanced Geoinformation Science. CRC Press.
[Link] [22] Andy B. Yoo, Morris A. Jette,
and Mark Grondona. 2003. SLURM: Simple Linux Utility for
Resource Management. In Job Scheduling Strategies for Parallel
Processing, Dror Feitelson, Larry Rudolph, and Uwe
Schwiegelshohn (Eds.). Springer Berlin Heidelberg, Berlin,
Heidelberg, 44–60.

HPC Job Scheduling with RLScheduler
No ratings yet
HPC Job Scheduling with RLScheduler
14 pages
Prakash Qespera Cpe2016
No ratings yet
Prakash Qespera Cpe2016
26 pages
Job Runtime Prediction of HPC Cluster Based On PC-Transformer
No ratings yet
Job Runtime Prediction of HPC Cluster Based On PC-Transformer
27 pages
Abstract Final
No ratings yet
Abstract Final
2 pages
Slides
No ratings yet
Slides
33 pages
HPC Intro Genentech
No ratings yet
HPC Intro Genentech
42 pages
Cai Nat
No ratings yet
Cai Nat
25 pages
Easy Backfillingpaper 5
No ratings yet
Easy Backfillingpaper 5
19 pages
HPC Workload Management and Scheduling
No ratings yet
HPC Workload Management and Scheduling
93 pages
Cost-Effective Job Scheduling in Wireless Grids
No ratings yet
Cost-Effective Job Scheduling in Wireless Grids
5 pages
Intro To Slurm
No ratings yet
Intro To Slurm
27 pages
Submitting Your MATLAB Jobs Using Slurm To High-Performance Clusters - by Rahul Bhadani - Towards Da
No ratings yet
Submitting Your MATLAB Jobs Using Slurm To High-Performance Clusters - by Rahul Bhadani - Towards Da
1 page
A Novel Resource Aware Scheduling With Multi-Criteria For Heterogeneous
No ratings yet
A Novel Resource Aware Scheduling With Multi-Criteria For Heterogeneous
10 pages
Grid Computing
No ratings yet
Grid Computing
10 pages
Sensors: Metaheuristic Based Scheduling Meta-Tasks in Distributed Heterogeneous Computing Systems
No ratings yet
Sensors: Metaheuristic Based Scheduling Meta-Tasks in Distributed Heterogeneous Computing Systems
12 pages
Dynamic Partitioned Scheduling of Real-Time Tasks On ARM Big - littLE
No ratings yet
Dynamic Partitioned Scheduling of Real-Time Tasks On ARM Big - littLE
14 pages
Job Scheduling in HPC Cluster
No ratings yet
Job Scheduling in HPC Cluster
4 pages
HPC Job Management Commands Guide
No ratings yet
HPC Job Management Commands Guide
1 page
Olaide Thesis
No ratings yet
Olaide Thesis
37 pages
Efficiency Benchmarking and Throughput
No ratings yet
Efficiency Benchmarking and Throughput
20 pages
HPC Rosalind Gettingstarted
No ratings yet
HPC Rosalind Gettingstarted
6 pages
U K J C M M S: Sing Nowledge of OB Haracteristics IN Ultiprogrammed Ultiprocessor Cheduling
No ratings yet
U K J C M M S: Sing Nowledge of OB Haracteristics IN Ultiprogrammed Ultiprocessor Cheduling
202 pages
Scheduling SC CamraReady PDF
No ratings yet
Scheduling SC CamraReady PDF
16 pages
Intelligent Router For LLM Workloads: Improving Performance Through Workload-Aware Scheduling
No ratings yet
Intelligent Router For LLM Workloads: Improving Performance Through Workload-Aware Scheduling
16 pages
BSC Thesis P Kunz
No ratings yet
BSC Thesis P Kunz
53 pages
Cilantro: Online Resource Allocation
No ratings yet
Cilantro: Online Resource Allocation
22 pages
Compiler-Guided Throughput Scheduling
No ratings yet
Compiler-Guided Throughput Scheduling
14 pages
Moon Buggy
No ratings yet
Moon Buggy
22 pages
Report Final
No ratings yet
Report Final
42 pages
Cloud Resource Provisioning via RL
No ratings yet
Cloud Resource Provisioning via RL
16 pages
Job Scheduling in High Perfomance Computing
No ratings yet
Job Scheduling in High Perfomance Computing
6 pages
High Performance Computing Student Projects PDF
No ratings yet
High Performance Computing Student Projects PDF
17 pages
Automated Performance Modeling of HPC Applications Using Machine Learning
No ratings yet
Automated Performance Modeling of HPC Applications Using Machine Learning
15 pages
Applsci 12 08411 v2
No ratings yet
Applsci 12 08411 v2
20 pages
Slurm Talk
No ratings yet
Slurm Talk
40 pages
Web Scale Job Scheduling
No ratings yet
Web Scale Job Scheduling
16 pages
Malleable-Lab: A Tool For Evaluating Adaptive Online Schedulers On Malleable Jobs
No ratings yet
Malleable-Lab: A Tool For Evaluating Adaptive Online Schedulers On Malleable Jobs
8 pages
Samruddhi Osy
No ratings yet
Samruddhi Osy
13 pages
NERSC Perlmutter
No ratings yet
NERSC Perlmutter
20 pages
(IJCST-V12I4P1) :sunkari Mahesh, DR K. Ram Mohan Rao
No ratings yet
(IJCST-V12I4P1) :sunkari Mahesh, DR K. Ram Mohan Rao
7 pages
Machine Learning Feature Based Job Scheduling For Distributed Machine Learning Clusters - 2023
No ratings yet
Machine Learning Feature Based Job Scheduling For Distributed Machine Learning Clusters - 2023
16 pages
Burton Presentation PDF
No ratings yet
Burton Presentation PDF
20 pages
ULHPC Beginner's Guide: Cluster Use
No ratings yet
ULHPC Beginner's Guide: Cluster Use
54 pages
HPC Programmer Productivity - Study
No ratings yet
HPC Programmer Productivity - Study
10 pages
Celery Ipython Mpi4py PDF
No ratings yet
Celery Ipython Mpi4py PDF
8 pages
Use of Fugaku
No ratings yet
Use of Fugaku
18 pages
Fast Distributed Inference Serving For Large Language Models
No ratings yet
Fast Distributed Inference Serving For Large Language Models
14 pages
Computing Scheduling Basics
No ratings yet
Computing Scheduling Basics
45 pages
Declarations and Table of Content (Aryan Giri)
No ratings yet
Declarations and Table of Content (Aryan Giri)
8 pages
JSSPP 2023 Keynote SLURM
No ratings yet
JSSPP 2023 Keynote SLURM
22 pages
An Optimized Shortest Job First Scheduling Algorithm For CPU Scheduling
No ratings yet
An Optimized Shortest Job First Scheduling Algorithm For CPU Scheduling
6 pages
Building An Adaptive Operating System For Predictability and Efficiency
No ratings yet
Building An Adaptive Operating System For Predictability and Efficiency
16 pages
Hipeac Internship Report: Machine Learning For Compilation and Architecture
No ratings yet
Hipeac Internship Report: Machine Learning For Compilation and Architecture
24 pages
Genetic Algorithm Based Optimization Technique For Power Management in Heterogeneous Multi-Tier Web Clusters
No ratings yet
Genetic Algorithm Based Optimization Technique For Power Management in Heterogeneous Multi-Tier Web Clusters
7 pages
Round Robin Scheduling in C Programming
No ratings yet
Round Robin Scheduling in C Programming
19 pages
Load Balancing and Process Management Over Grid Computing
No ratings yet
Load Balancing and Process Management Over Grid Computing
4 pages
Declarations and Table of Content (Ritik Sahu)
No ratings yet
Declarations and Table of Content (Ritik Sahu)
8 pages
Scheduling Irregular Parallel Computations On Hierarchical Caches
No ratings yet
Scheduling Irregular Parallel Computations On Hierarchical Caches
30 pages
Multi-Criteria Genetic Algorithm Applied To Scheduling in Multi-Cluster Environments
No ratings yet
Multi-Criteria Genetic Algorithm Applied To Scheduling in Multi-Cluster Environments
10 pages
Transactions On Computing Education 2019-Sep 13 Vol
No ratings yet
Transactions On Computing Education 2019-Sep 13 Vol
32 pages
DSV Sem Exam
No ratings yet
DSV Sem Exam
15 pages
Machine Learning Insights 2021
No ratings yet
Machine Learning Insights 2021
11 pages
1 Lecture 3: Optimization and Linear Regression
No ratings yet
1 Lecture 3: Optimization and Linear Regression
27 pages
Unit I 2
No ratings yet
Unit I 2
78 pages
Introduction To Generative AI
100% (1)
Introduction To Generative AI
77 pages
Cox Book Review of The Alignment Problem
No ratings yet
Cox Book Review of The Alignment Problem
6 pages
Supervised Learning Vs Unsupervised Learning 1741029967
No ratings yet
Supervised Learning Vs Unsupervised Learning 1741029967
25 pages
Lecture 16 - Classification
No ratings yet
Lecture 16 - Classification
43 pages
Crime Prediction in Nigeria's Higer Institutions
No ratings yet
Crime Prediction in Nigeria's Higer Institutions
13 pages
UT-1-Machine Learning Lecture Notes-1
No ratings yet
UT-1-Machine Learning Lecture Notes-1
120 pages
ML@Chapter 1
No ratings yet
ML@Chapter 1
29 pages
Sellers, Alai-Rosales, & MacDonald (2016)
No ratings yet
Sellers, Alai-Rosales, & MacDonald (2016)
10 pages
Neural Network: Prepared By: Nikita Garg M.Tech (CS)
No ratings yet
Neural Network: Prepared By: Nikita Garg M.Tech (CS)
29 pages
DataSeer 2019 Training Prospectus
No ratings yet
DataSeer 2019 Training Prospectus
25 pages
Machine Learning For Designers
100% (1)
Machine Learning For Designers
79 pages
Introduction To The Theory of Neural Computation
No ratings yet
Introduction To The Theory of Neural Computation
18 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
29 pages
J Adv Manuf Process - 2023 - Plathottam - A Review of Artificial Intelligence Applications in Manufacturing Operations
No ratings yet
J Adv Manuf Process - 2023 - Plathottam - A Review of Artificial Intelligence Applications in Manufacturing Operations
19 pages
Machine Learning
No ratings yet
Machine Learning
40 pages
Big Data Analytics Course Guide
No ratings yet
Big Data Analytics Course Guide
78 pages
Emerging Trends in Computer Engineering - Information Technology
No ratings yet
Emerging Trends in Computer Engineering - Information Technology
140 pages
Solutions 1
No ratings yet
Solutions 1
17 pages
Module 1
No ratings yet
Module 1
65 pages
Analysis of Qatar's Electricity Landscape - Paper
No ratings yet
Analysis of Qatar's Electricity Landscape - Paper
18 pages
Deep Learning Notes All Units
No ratings yet
Deep Learning Notes All Units
69 pages
Machine Learning Notes For Exam
No ratings yet
Machine Learning Notes For Exam
29 pages
AI Models for Stress Detection
No ratings yet
AI Models for Stress Detection
11 pages
Unit 2 Modelling Textbook Worksheet
No ratings yet
Unit 2 Modelling Textbook Worksheet
11 pages
Machine Learning in Antenna Design
No ratings yet
Machine Learning in Antenna Design
9 pages

Slurm MachineLearning

Uploaded by

Slurm MachineLearning

Uploaded by

Improving HPC System Performance by Predicting Job

Resources via Supervised Machine Learning

Figure 1: Work Flow Diagram for our Model.

Table 1: Features Selected

Figure 3: Utilization (Requested vs Actual vs Predicted) for Jobs in Testbed-1.

Figure 4: Backfill-Sched Performance for Jobs in Testbed-1

Figure 5: Utilization (Requested vs Actual vs Predicted) For Testbed-2.

Figure 6: Backfill-Sched Performance for Testbed-2)

Thomas R. Furlani. 2018. A Slurm Simulator: Implementation and

You might also like