0% found this document useful (0 votes)
52 views20 pages

Condor Cluster Computing Guide

Computing using CONDOR allows for automated job submission, checkpointing, error notification, and parallel computing optimization. It interfaces with computer clusters to distribute jobs. CONDOR creates .cmd files to specify job parameters like the executable, input/output files, environment variables, and notifications. Users can submit single jobs or multiple jobs in a single submit file using parameters like Process. Checkpointing saves application state to restart failed jobs. CONDOR commands like condor_status, condor_q, and condor_rm help manage submitted jobs.

Uploaded by

calvk79
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views20 pages

Condor Cluster Computing Guide

Computing using CONDOR allows for automated job submission, checkpointing, error notification, and parallel computing optimization. It interfaces with computer clusters to distribute jobs. CONDOR creates .cmd files to specify job parameters like the executable, input/output files, environment variables, and notifications. Users can submit single jobs or multiple jobs in a single submit file using parameters like Process. Checkpointing saves application state to restart failed jobs. CONDOR commands like condor_status, condor_q, and condor_rm help manage submitted jobs.

Uploaded by

calvk79
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Computing using CONDOR

Pradipta Ray

CONDOR
To avoid supervision of a program run : automated submission, checkpointing, notification of error, etc To parallellize the computation of multiple processes : scheduling, optimizing and bookkeeping for parallel computing CONDOR : a toolkit of programs which lets you do these things, an interface with the cluster

A few things to know about cluster computing in general


Progress is only as fast as computing at the bottleneck point : if all the parallelization requires synchronization or assimilation at the end, then even if some processes finish early, they will wait on the slowest processes Typically disk I/O is the bottleneck, specifically non-local disk I/O

streams

software.rc.fas.harvard.edu

Standard job
> foobar -a -n <foobar.in 1>foobar.out 2>foobar.err

executable parameter list stdin stdout stderr

.cmd file
A way to tell the condor system these things

Executable = foobar Universe = vanilla getenv = true input = foobar.in output = foobar.out error = foobar.err Log = foobar.log arguments = "-a -n Queue

Provide the absolute path for all files if possible unless the files are in the local directory use which to figure out paths

What is Universe ?
The mode is which condor runs your job vanilla is default standard allows checkpointing

Checkpointing
Consists of storing a snapshot of the current application state, and later on, use it for restarting the execution in case of failure.
wikipedia

Good programming asks for the programmer to also manually create checkpoints in the code if you are using someone elses code, condors checkpointing option is useful.

What is getenv ?
getenv = true copies all the environment variables that are set in the user's shell to the job running on condor

Many options to give


Some other useful ones nice_user = true notify_user = [email protected] notify = Error If you use notify = Always, your mailbox will be flooded !

How to submit a job


condor_submit foobar.cmd As simple as that !

Tracking the system and the job


Check its logfile. The following commands are also helpful: condor_status lists available nodes and their status. condor_q lists the job queue. condor_rm deletes a job from the queue.

What if I have hundreds of jobs ?


Use a script to generate multiple submit files
bad idea, chances of files with same names being overwritten or disk i/o competition high unless you are careful condor also allows multiple job submission using a single submit file, will optimize scheduling process

Multiple submission
Start with single submission Universe = vanilla Notification = Complete Executable = /bin/echo Arguments = "test job" Requirements = CPU_Speed >= 1 Rank = CPU_Speed Image_Size = 428 Meg Priority = +20 GetEnv = True Initialdir = /experiment/u/user Input = /dev/null Output = /experiment/u/user/myjob.out Error = /experiment/u/user/myjob.err Log = /experiment/u/user/myjob.log Notify_user = [email protected] +Experiment = "experiment" +Job_Type = "cas" Queue

Submitting multiple jobs


Universe = vanilla # #Common elements in file removed for brevity. # +Job_Type = "cas" Queue 100 Problems with the common files

What if they need different parameters


Universe = vanilla # # Common elements in file removed for brevity. # Output = /experiment/u/user/myjob.out.$(Process) Error = /experiment/u/user/myjob.err Log = /experiment/u/user/myjob.log.$(Process) Notify_user = [email protected] +Experiment = "experiment" +Job_Type = "cas" Queue 100

Process number goes from 0 to N-1 Index the Input file as well, each program runs with different input

Mixed submission
Universe = vanilla Notification = Complete Executable = /bin/echo Arguments = "test job" Requirements = CPU_Speed >= 1 Rank = CPU_Speed Image_Size = 428 Meg Priority = +20 GetEnv = True Initialdir = /experiment/u/user Input = /dev/null Output = /experiment/u/user/myjob.out.$(Process) Error = /experiment/u/user/myjob.err Log = /experiment/u/user/myjob.log.$(Process) Notify_user = [email protected] +Experiment = "experiment" +Job_Type = "cas" Queue 90 Arguments = "$(Process)" Requirements = CPU_Speed == 2 Queue 9 Executable = myjob.sh Arguments = 99 Requirements = CPU_Speed >= 3 Queue

If you are feeling lazy


condor-exec myprog a1 a2 a3 Try not to do this, please ! Write your own condor script.

Stopping your job from getting killed


Program performs illegal action : check log and stderr redirected file Requires more RAM than allowed : request_memory = 2*1024 tells Condor this job wants 2 GB (2048 MB) of RAM. If you leave out the request_memory line, the default is 1024 MB. Note that if you over-estimate, you limit the number of machines your job can run on, but if you under-estimate and the job outgrows its memory request, Condor may kill it. Requires more hard disk space than available : typically only causes

Optimizing and do s and donts for our cluster


Over to Zhenyu

You might also like