0% found this document useful (0 votes)

82 views8 pages

Using The Cluster

The document provides instructions for using the computing cluster at Hopkins Biostatistics, including submitting batch and interactive jobs, checking job status, and specifying memory needs. Users can log into the access machine enigma2 and submit jobs to the cluster nodes using commands like qsub and qrsh. It is important to specify memory requirements like -l mem_free=4000M to control which nodes jobs are run on.

Uploaded by

welleryholmes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views8 pages

Using The Cluster

Uploaded by

welleryholmes

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

Using the cluster

Running interactive jobs on cluster nodes

Submitting batch jobs from enigma2
Two steps to submitting an R BATCH job to the cluster
Specifying your job's memory needs
Checking your job's memory usage
Checking the status of your job
Job status via email
Using the express queue
Questions and comments

Other links:
A tutorial lecture on using the Cluster and Sun Grid Engine (SGE)
Troubleshooting

Using the cluster

The computing cluster for Hopkins Biostatistics can be accessed via the machine
enigma2.biostat.jhsph.edu. If you do not have an account on this machine,
please contact Jiong Yang (jyang at jhsph.edu) or Marvin Newhouse (marv at
jhu.edu) to have one created for you.

The cluster is the computational workhorse for the department and all users are
encouraged to run jobs on it. As stated above, the machine enigma2 is the access host
for the cluster. You will not be logging directly into the compute nodes of the cluster
rather, you will logon to enigma2 and then submit jobs to the cluster nodes.

As of 2010-08-20 the cluster has 376 64-bit cores with an aggregate of 1.78TB RAM.
However we are continually upgrading the capacity of the machine. The most up-to-
date configuration information is here.

The above configuration may change depending on maintenance needs and not all
nodes are available for all types of jobs. For example, due to licensing restrictions, one
8GB node is reserved for SAS.

Everything related to job submission, scheduling, and execution on the cluster is under
the control of Sun's Grid Engine software (SGE). The Grid Engine project sponsored by
Sun Microsystems is an open source community effort to facilitate the adoption of
distributed computing solutions. Among other things, we use SGE to limit the total
number of jobs each user is allowed to run simultaneously on the cluster (currently 16,
but subject to change). However, you may submit more jobs than the limit all of which
will be queued to run as your other jobs finish. When the cluster nodes are all at
maximum capacity, jobs waiting to run will be subject to a functional share priority
algorithm as we have defined it using SGE.

IMPORTANT
The logon machine (enigma2) is only used for login. Do NOT run jobs on enigma2!
This machine is not for doing any sort of computation. Rather, it is ONLY for text
editing and for submitting jobs to the cluster. Any long-running jobs found running on
enigma2 (R BATCH jobs, for example) will be KILLED WITHOUT NOTICE. You
will lose any data and/or computations associated with the running job.

(Back to top)

Running interactive jobs on cluster nodes

While you should not run long-running interactive jobs on enigma2, you can run such
interactive jobs on cluster nodes. This can be done via a special program called qrsh.
This program essentially opens up a remote shell on a cluster node. The easiest thing to
do is login to enigma2 and type
qrsh
You will be logged into a "random" cluster node and get an interactive shell prompt,
just as if you logged into enigma2. Now you can run whatever program you want. For
example, you can run R. However, you must remember to logout ('exit' or 'CTRL D').
Otherwise, you will be taking up a slot in the queue which will not be available to
others.

While you are logged into a cluster node via qrsh , if you run

qstat -u YOUR_USER_ID
you'll see something like the following:
job-ID prior name user state submit/start at
queue slots
----------------------------------------------------------------------
---------------------------------
15194 1.53962 Pf_3D7 rpeng r 08/15/2007 16:04:16
[email protected] 1
15299 2.00790 BootA10600 rpeng r 08/17/2007 15:26:06
[email protected] 1
15290 2.35449 QRLOGIN rpeng r 08/17/2007 15:20:00
[email protected] 1
The job labeled QRLOGIN is the interactive session ( for more info see Checking the
status of your job).

You may also specify memory requirements or special queues on your qrsh command
just as you do on the qsub command (see below). For interactive work we strongly
encourage users to work on the cluster via qrsh (rather than use enigma2).

NOTE: Do not run background or 'nohup' jobs while using qrsh. Sun Grid Engine
(SGE) must know about your job/session so that it can manage and account for cluster
resources. Additionally, SGE assumes one slot (corresponding to one CPU core) for
each qrsh session. If you still have running programs and no session appears for you in
qstat, then you have done something that is not appropriate for the way the HPSCC
Cluster is managed. If jobs are found running on cluster nodes with no associated
SGE entry, they will be killed.
NOTE: If you encounter an error while running a program interactively on a cluster
node and your program crashes, it still might be in the cluster's process queue. If you
don't quit out of your program normally, make sure to check the cluster queue (via
qstat, see below) and see if your (interactive) job is still there. If it is, get the job-ID
and kill the job using qdel.

(Back to top)

Submitting batch jobs from enigma2

As we have indicated, the cluster uses Sun's Grid Engine (SGE) to control scheduling of
jobs. SGE comes with a number of command line tools to help you manage your jobs
on the cluster. Most of the time you can get away with just knowing a few commands.
The most immediately relevant ones are:

1. qsub: submit a batch job to the cluster

2. qstat: query the status of your jobs on the cluster (or look at all jobs running
on the cluster)
3. qdel: delete a job from the cluster/queue.

Every job that you submit to the cluster from enigma2 must be wrapped in a shell
script. That is, you cannot just start a program from the command line (e.g. nice +19
R) like you could on your own machine or some other server. ( Of course, if appropriate,
you can run an interactive job on the cluster as indicated above. ) Not to worry, though;
wrapping your program with a shell script is not as difficult as it might sound! Below
are instructions for how to run an R batch job on the cluster.

(Back to top)

Two steps to submitting an R BATCH job to the cluster

Using the Sun Grid Engine to submit an R BATCH job to the cluster is very simple.

1. First (on enigma2), assuming you have an R program in a file named

mycommands.R, you need to create a new file that will invoke and run your R
program when it is submitted to the cluster. Let's call this new file batch.sh.
You should put this batch.sh file in the same directory as your mycommands.R
file.

To run an R BATCH job on the cluster using the mycommands.R file, your
batch.sh file need only have this one line in it, like this:

R CMD BATCH mycommands.R

The file might have other lines in it to specify SGE job options or commands to
run before or after the "R CMD BATCH ..." line. The technical name for this
file is "shell script". Knowing this might help you communicate with the system
administrator.
2. Once you've written your short batch.sh file you can submit it to the cluster via
the command
3. qsub -cwd batch.sh

The -cwd option tells SGE to execute the batch.sh script on the cluster from
the current working directory (otherwise, it will run from your home directory,
which is probably not what you want).

That's all you have to do! There are a few things to note:

• You do not have to put an & at the end of the line (don't worry if you don't know
what the & might be used for). qsub automatically sends your job to the cluster
and returns to your command line prompt so that you can do other things.
• After submitting your job with qsub, you can use the qstat command to see the
status of your job(s).

(Back to top)

Specifying your job's memory needs

When submitting your job(s), if you do not specify any memory requirements, SGE will
choose the cluster node(s) with the lowest CPU load WITHOUT REGARD TO
MEMORY AVAILABILITY (subject to other scheduling parameters which we have
defined).

It is, therefore, IMPORTANT to specify your expected memory requirements when

submitting cluster jobs. After calculating approximately how much memory your job
will need, you should add a memory resource requirement to your qsub (or qrsh )
command.

Use a command like:

qsub -cwd -l mem_free=[[memory needed]] batch.sh
where [[memory needed]] is the amount of memory (in megabytes M, or gigabytes G)
that your job will require.

For example, if your job will require 4GB of memory, you should type

qsub -cwd -l mem_free=4000M batch.sh

... or
qsub -cwd -l mem_free=4G batch.sh

NOTES:
-l is a 'minus' follwed by the 'lower-case letter L'.

See a more detailed explanation of what -l mem_free implies below in this section.

To see a summary of available nodes and their memory capacity and current load, use
the command qhostw .
After submitting your job with qsub, use the qu or qstat command to see which queue
(node) your job actually went to (see Checking the status of your job). In the output of
qu , the next to last column lists the queue name.

****************************
We now recommend that all users also use the h_vmem parameter to
place a limit on the amount of memory that their jobs might use (so
that they do
not crash the node they might be running on ... with other users' jobs
crashing as well).

Here are some notes explaining the use of mem_free and h_vmem, as well as
h_fsize :

------------------------------------------------

mem_free
You should use approx what you think your job will need (or a little more) on the
mem_free request. This does not reserve memory for your job. It simply puts your job
on a node with that amount of memory currently available (see example under h_vmem
below)
h_vmem
To avoid running away at the high end use the h_vmem parameter to limit your job's
total memory use. (We are now encouraging all users to use this parameter to stop a
runaway job from crashing the node)

Something like:

qsub -l mem_free=12G,h_vmem=16G batch.sh

... or similarly on a qrsh command.
h_fsize
Some users might also want to limit the size of files that can be created by their job (to
avoid the consequences of any bug in their program that might, under certain
conditions, cause a file to grow without bounds ... NOT A GOOD THING)

Something like:

qsub -l mem_free=12G,h_vmem=16G,h_fsize=1G batch.sh

... or similarly on a qrsh command would, additionally, limit the size of any file
created by the job to not more than 1 GB.

NOTE: No spaces in the comma delimited list of resources and limits.

(Back to top)

Checking your job's memory usage

While your qsub job is running you can see it's memory usage using the command
qstat -j NNNNN | grep vmem
where NNNNN is your specific cluster job number ... look at the "vmem" and
"maxvmem" entries.
To make it easier to monitor memory usage for your currently running jobs, we have
created the command qmem . If you have no jobs running on the cluster qmem will print
nothing, but if you do, the results will look something like:

[enigma2]$ qmem
10506 rpeng node=33 vmem=289.1M, maxvmem=294.3M
howMany10.sh
14257 rpeng node=8 vmem=231.5M, maxvmem=238.0M s.all.sh
16695 rpeng node=25 vmem= 1.8G, maxvmem= 1.8G
mergedoc1.3.sh
17464 rpeng node=15 vmem=272.9M, maxvmem=284.0M
simulateVariance.sh
17555 rpeng node=12 vmem= 0.0A, maxvmem= 0.0A QRLOGIN
17584 rpeng node=6 vmem=315.1M, maxvmem=334.3M
calculateVaried-emp.genSampScheme.sh

To see your job's memory usage upon job completion, use email notification, which
works for aborted jobs as well. See the job status via email discussion for instructions
on how to use email notification.

Note: qrsh sessions will not report memory usage using the above method. You will
simply see "N/A" in the entries for vmem and maxvmem.

(Back to top)

Checking the status of your job

After submitting your job you can use qstat to look at the status of your job.

By default, under our version of SGE, qstat with no arguments shows cluster jobs for
all users. To restrict the output to show only your jobs, use the -u USERID argument.
For example:

qstat -u rpeng
would only display active/pending jobs for user rpeng.

However, we have created the command qu to easily accomplish the same thing (view
only your jobs). If you have no jobs running on the cluster qu will print nothing, but if
you do, the results will look something like:

[enigma2]$ qu
job-ID prior name user state submit/start at
queue slots
----------------------------------------------------------------------
---------------------------------
15194 1.53962 Pf_3D7 rpeng r 08/15/2007 16:04:16
[email protected] 1
15299 2.00790 BootA10600 rpeng r 08/17/2007 15:26:06
[email protected] 1
15290 2.35449 QRLOGIN rpeng r 08/17/2007 15:20:00
[email protected] 1
Under the state column you can see the status of your job. Some of the codes are

• r: the job is running

• t: the job is being transferred to a cluster node
• qw: the job is queued (and not running yet)
• Eqw: an error occurred with the job

You can look at the manual page for qstat (type man qstat at the prompt) to get more
information on the state codes.

Another important thing to note is the job-ID for your job. You need to know this if
you ever want to make changes to your job. For example, to delete your job from the
cluster, you can run

qdel 40
where 40 is the job-ID I got from running qstat.

(Back to top)

Job status via email

If you wish to be notified via email when your job's status changes, include options like
the following when submitting your jobs:
qsub -m e -M [email protected] your_job.sh
which means send email to given address(es) when the job ends.

If you want to automatically have such options (or others) always added to your job(s),
simply put them in a file named .sge_request in your home directory. You can also
have working-directory-specific .sge_request files (see the man page for sge_request
- man sge_request).

Lines like this in your .sge_request file:

-M [email protected]
-m e
will cause an email to be sent, when your job ends, for every cluster job that you start
(including, for what it's worth, a qrsh 'job').

You could use -m n on individual qsub job command lines to suppress email
notification for certain jobs.

Or better yet, ... you might only put the -M [email protected] in the
.sge_request file and simply use the -m e option on jobs for which you want email
notification.

Note: You may also invoke the options shown above (and others) by including special
lines at the top of your job shell scripts. Lines beginning with #$ are interpreted as qsub
options for that job. For example, if the first few lines of your script look like the
following:

#!/bin/bash
#$ -M [email protected]
#$ -m e
The lines beginning with #$ would cause SGE to send email to '[email protected]'
when the job ends.
#$ -m be
would cause an email to be sent when the job begins ('b') and ends ('e'). See the manual
page for qsub (type man qsub at a shell prompt ) to get more information.

(Back to top)

Using the express queue (Not enabled as of 08-20-2010)

A special queue has been created (currently consisting of 2 slots on one node) for
"express" jobs. Use this express queue to avoid "traffic jams" on the rest of the cluster
when you need to run a relatively quick job, whether it be with qrsh or qsub. The
express queue can be selected by using the -l express option on your qrsh or qsub
command. Each job (or interactive session) run using the express queue is limited to 30
minutes of cpu time and 3 hours of clock time.

The express queue node(s) are reserved for express jobs unless there are no more 2GB,
4GB, or 8GB slots available in the standard queues; in which case, the express queue
nodes may be used to satisfy standard queue requests.

Remember, to see a summary of available hosts and their current memory capacity and
load, use the command qhostw .

(Back to top)

Questions and/or comments

Please send any questions or comments about this document to to BITSUPPORT (

bitsupport at jhsph.edu ).

This document was last modified on 2010-Sep-03

Sun Grid Engine Tutorial
No ratings yet
Sun Grid Engine Tutorial
14 pages
Iceberg
No ratings yet
Iceberg
165 pages
How To Use The CBI Cluster
No ratings yet
How To Use The CBI Cluster
10 pages
SGE Basic Commands Guide
No ratings yet
SGE Basic Commands Guide
3 pages
FLUENT Cluster
No ratings yet
FLUENT Cluster
11 pages
Basic Usage Command Line Interface Shell Scripts
No ratings yet
Basic Usage Command Line Interface Shell Scripts
7 pages
HPC Intro Genentech
No ratings yet
HPC Intro Genentech
42 pages
03 SGE Training
No ratings yet
03 SGE Training
39 pages
HPC Cluster Guide for Physics Students
No ratings yet
HPC Cluster Guide for Physics Students
8 pages
HPC User Manual-Updated
No ratings yet
HPC User Manual-Updated
4 pages
Slurm Talk
No ratings yet
Slurm Talk
40 pages
LSF For Users: Mike Page SCD Consulting Services Group
No ratings yet
LSF For Users: Mike Page SCD Consulting Services Group
26 pages
05 RSB Cluster
No ratings yet
05 RSB Cluster
14 pages
Bunya User Guide 2022 12 06
No ratings yet
Bunya User Guide 2022 12 06
10 pages
HPC 2013 Cluster User Guide
No ratings yet
HPC 2013 Cluster User Guide
4 pages
Using A CPU Farm
No ratings yet
Using A CPU Farm
22 pages
HPC Rosalind Gettingstarted
No ratings yet
HPC Rosalind Gettingstarted
6 pages
SGE Roll: Users Guide: Version 5.4 Edition
No ratings yet
SGE Roll: Users Guide: Version 5.4 Edition
16 pages
Sun Grid Engine Advanced Administration: Daniel Templeton
No ratings yet
Sun Grid Engine Advanced Administration: Daniel Templeton
243 pages
GPU Cluster Job Submission Guide
No ratings yet
GPU Cluster Job Submission Guide
4 pages
Windows HPC Guide for Researchers
No ratings yet
Windows HPC Guide for Researchers
23 pages
1 Cluster Computing
No ratings yet
1 Cluster Computing
42 pages
TUM Batch Farm Job Management Guide
No ratings yet
TUM Batch Farm Job Management Guide
28 pages
PBS Pro Configuration and Job Submission
No ratings yet
PBS Pro Configuration and Job Submission
13 pages
QSHELL
100% (1)
QSHELL
154 pages
Hercules Instructions
No ratings yet
Hercules Instructions
12 pages
Submitting Your MATLAB Jobs Using Slurm To High-Performance Clusters - by Rahul Bhadani - Towards Da
No ratings yet
Submitting Your MATLAB Jobs Using Slurm To High-Performance Clusters - by Rahul Bhadani - Towards Da
1 page
PBS Tutorial: Fangrui Ma Universit of Nebraska-Lincoln October 26th, 2007
No ratings yet
PBS Tutorial: Fangrui Ma Universit of Nebraska-Lincoln October 26th, 2007
6 pages
2013luv Supercomputers
No ratings yet
2013luv Supercomputers
12 pages
Batch System On The HPC-BW Cluster
No ratings yet
Batch System On The HPC-BW Cluster
7 pages
Int Eng Ilt Cmodetrbl Exerciseguide
No ratings yet
Int Eng Ilt Cmodetrbl Exerciseguide
64 pages
Cluster Guide Adv
No ratings yet
Cluster Guide Adv
12 pages
Zaratan HPC User Guide
No ratings yet
Zaratan HPC User Guide
4 pages
Distributed & Parallel Computing Cluster: Patrick Mcguigan
No ratings yet
Distributed & Parallel Computing Cluster: Patrick Mcguigan
42 pages
ANSYS GRAHAM Guideline
No ratings yet
ANSYS GRAHAM Guideline
4 pages
Running ANSYS on Lichtenberg Cluster
No ratings yet
Running ANSYS on Lichtenberg Cluster
4 pages
Big Data Analytics Lab Guide: Spark SQL
No ratings yet
Big Data Analytics Lab Guide: Spark SQL
6 pages
Media 1080170 SMXX
No ratings yet
Media 1080170 SMXX
10 pages
Qshell On Iseries
No ratings yet
Qshell On Iseries
188 pages
Mscluster 08 02 2024
No ratings yet
Mscluster 08 02 2024
14 pages
Eijkhout HPCtutorials
No ratings yet
Eijkhout HPCtutorials
230 pages
An Introduction To Parallel Computing With MPI Computing Lab I
No ratings yet
An Introduction To Parallel Computing With MPI Computing Lab I
9 pages
Parallel Programming Using MPI
No ratings yet
Parallel Programming Using MPI
69 pages
MATLAB Parallel Computing Toolbox Guide
No ratings yet
MATLAB Parallel Computing Toolbox Guide
12 pages
Vcs Cluster Command Line
No ratings yet
Vcs Cluster Command Line
3 pages
RCE Quick-Start Guide: January 2021
No ratings yet
RCE Quick-Start Guide: January 2021
37 pages
Use of Fugaku
No ratings yet
Use of Fugaku
18 pages
HBase and ZooKeeper Overview
No ratings yet
HBase and ZooKeeper Overview
96 pages
TheCampusCluster sp2013
No ratings yet
TheCampusCluster sp2013
45 pages
New HPC Introduction
No ratings yet
New HPC Introduction
100 pages
High Availabilty
No ratings yet
High Availabilty
20 pages
HPC Carpentry: Practical Skills Guide
No ratings yet
HPC Carpentry: Practical Skills Guide
237 pages
HPC Job Management Commands Guide
No ratings yet
HPC Job Management Commands Guide
1 page
Rsa NW 11.5 Nwconsole User Guide
No ratings yet
Rsa NW 11.5 Nwconsole User Guide
29 pages
Shell Basics
100% (2)
Shell Basics
10 pages
HPC Introduction Lecture 2
No ratings yet
HPC Introduction Lecture 2
55 pages
Cai Nat
No ratings yet
Cai Nat
25 pages
Computing Cluster Design Guide
No ratings yet
Computing Cluster Design Guide
168 pages
Sge 6 Release Notes
100% (2)
Sge 6 Release Notes
18 pages
Comparative Study of Distributed Resource Management Systems - Sge, LSF, Pbs Pro, and Loadleveler
No ratings yet
Comparative Study of Distributed Resource Management Systems - Sge, LSF, Pbs Pro, and Loadleveler
19 pages
Sun Grid Engine 6.1 Quick Reference
No ratings yet
Sun Grid Engine 6.1 Quick Reference
2 pages

Using The Cluster

Uploaded by

Using The Cluster

Uploaded by

Using the cluster

Running interactive jobs on cluster nodes

Using the cluster

Running interactive jobs on cluster nodes

Submitting batch jobs from enigma2

1. qsub: submit a batch job to the cluster

Two steps to submitting an R BATCH job to the cluster

1. First (on enigma2), assuming you have an R program in a file named

R CMD BATCH mycommands.R

Specifying your job's memory needs

It is, therefore, IMPORTANT to specify your expected memory requirements when

Use a command like:

qsub -cwd -l mem_free=4000M batch.sh

qsub -l mem_free=12G,h_vmem=16G batch.sh

qsub -l mem_free=12G,h_vmem=16G,h_fsize=1G batch.sh

NOTE: No spaces in the comma delimited list of resources and limits.

Checking your job's memory usage

Checking the status of your job

• r: the job is running

Job status via email

Lines like this in your .sge_request file:

Using the express queue (Not enabled as of 08-20-2010)

Questions and/or comments

Please send any questions or comments about this document to to BITSUPPORT (

This document was last modified on 2010-Sep-03

You might also like