Computing using CONDOR
Pradipta Ray
CONDOR
To avoid supervision of a program run : automated submission, checkpointing, notification of error, etc To parallellize the computation of multiple processes : scheduling, optimizing and bookkeeping for parallel computing CONDOR : a toolkit of programs which lets you do these things, an interface with the cluster
A few things to know about cluster computing in general
Progress is only as fast as computing at the bottleneck point : if all the parallelization requires synchronization or assimilation at the end, then even if some processes finish early, they will wait on the slowest processes Typically disk I/O is the bottleneck, specifically non-local disk I/O
streams
software.rc.fas.harvard.edu
Standard job
> foobar -a -n <foobar.in 1>foobar.out 2>foobar.err
executable parameter list stdin stdout stderr
.cmd file
A way to tell the condor system these things
Executable = foobar Universe = vanilla getenv = true input = foobar.in output = foobar.out error = foobar.err Log = foobar.log arguments = "-a -n Queue
Provide the absolute path for all files if possible unless the files are in the local directory use which to figure out paths
What is Universe ?
The mode is which condor runs your job vanilla is default standard allows checkpointing
Checkpointing
Consists of storing a snapshot of the current application state, and later on, use it for restarting the execution in case of failure.
wikipedia
Good programming asks for the programmer to also manually create checkpoints in the code if you are using someone elses code, condors checkpointing option is useful.
What is getenv ?
getenv = true copies all the environment variables that are set in the user's shell to the job running on condor
Many options to give
Some other useful ones nice_user = true notify_user =
[email protected] notify = Error If you use notify = Always, your mailbox will be flooded !
How to submit a job
condor_submit foobar.cmd As simple as that !
Tracking the system and the job
Check its logfile. The following commands are also helpful: condor_status lists available nodes and their status. condor_q lists the job queue. condor_rm deletes a job from the queue.
What if I have hundreds of jobs ?
Use a script to generate multiple submit files
bad idea, chances of files with same names being overwritten or disk i/o competition high unless you are careful condor also allows multiple job submission using a single submit file, will optimize scheduling process
Multiple submission
Start with single submission Universe = vanilla Notification = Complete Executable = /bin/echo Arguments = "test job" Requirements = CPU_Speed >= 1 Rank = CPU_Speed Image_Size = 428 Meg Priority = +20 GetEnv = True Initialdir = /experiment/u/user Input = /dev/null Output = /experiment/u/user/myjob.out Error = /experiment/u/user/myjob.err Log = /experiment/u/user/myjob.log Notify_user = [email protected] +Experiment = "experiment" +Job_Type = "cas" Queue
Submitting multiple jobs
Universe = vanilla # #Common elements in file removed for brevity. # +Job_Type = "cas" Queue 100 Problems with the common files
What if they need different parameters
Universe = vanilla # # Common elements in file removed for brevity. # Output = /experiment/u/user/myjob.out.$(Process) Error = /experiment/u/user/myjob.err Log = /experiment/u/user/myjob.log.$(Process) Notify_user =
[email protected] +Experiment = "experiment" +Job_Type = "cas" Queue 100
Process number goes from 0 to N-1 Index the Input file as well, each program runs with different input
Mixed submission
Universe = vanilla Notification = Complete Executable = /bin/echo Arguments = "test job" Requirements = CPU_Speed >= 1 Rank = CPU_Speed Image_Size = 428 Meg Priority = +20 GetEnv = True Initialdir = /experiment/u/user Input = /dev/null Output = /experiment/u/user/myjob.out.$(Process) Error = /experiment/u/user/myjob.err Log = /experiment/u/user/myjob.log.$(Process) Notify_user = [email protected] +Experiment = "experiment" +Job_Type = "cas" Queue 90 Arguments = "$(Process)" Requirements = CPU_Speed == 2 Queue 9 Executable = myjob.sh Arguments = 99 Requirements = CPU_Speed >= 3 Queue
If you are feeling lazy
condor-exec myprog a1 a2 a3 Try not to do this, please ! Write your own condor script.
Stopping your job from getting killed
Program performs illegal action : check log and stderr redirected file Requires more RAM than allowed : request_memory = 2*1024 tells Condor this job wants 2 GB (2048 MB) of RAM. If you leave out the request_memory line, the default is 1024 MB. Note that if you over-estimate, you limit the number of machines your job can run on, but if you under-estimate and the job outgrows its memory request, Condor may kill it. Requires more hard disk space than available : typically only causes
Optimizing and do s and donts for our cluster
Over to Zhenyu