Condor Basics

High-throughput Computing (HTC)

HTC is the use of many computing resources over a long period of time to accomplish a computational task. There are differences between High-throughput computing(HTC) and High-performance computing (HPC).

HPC tasks are characterized as needing a large amount of computing power for short periods of time. HTC tasks also require large amounts of computing, but for much longer times (days and weeks). HPC environments are often measured in operations per sec (FLOPS). The HTC field is more interested in how many jobs can be completed over a period of time, instead of how fast an individual job can complete.

Condor

Condor is a system for managing compute-intensive jobs.

Condor universes (modes of operation)

Condor has two main modes of operation, called "universes".

Condor at Departement ITET

Rules to use Condor

The reason for the following rules is to protect the servers from heavy load which can disturb other users. Condor is a powerfull tool. Therefore we ask you to be careful and to respect the Rules. If we notice an overcharge of a server caused by condor-jobs, we have the possibility to freeze and stop your jobs.

Condor Examples

How to display information about the state of condor

Show the state of all machines connected to the condor pool

The command condor_status shows the state of all machines connected to the condor-pool.

Explanation

Show the machines where my jobs are ready to run at the moment

If the command does not return a result, then all machines are busy. Nevertheless, you can submit jobs, when no machine is available at the moment. Your jobs are added to the queue, and are executed as soon as a machine is idle.

How to submit a job

Condors actions are controlled by the submit description file. To start condor you simply run condor_submit submit_file

Example: Uname

This simple example runs the "uname -a" command which outputs a single line with the name of the machine and the operating system version. To run this example, cut and paste this into a file in your home directory called (for example) uname.condor then run condor_submit uname.condor.

##
## Condor "uname -a" example
##      Filename: uname.condor
##
##################################################################
universe        = vanilla
# Program and arguments
#
executable      = /bin/uname
arguments       = -a
#  stdin, stdout, stderr, and log files
#       (note: These default to /dev/null if unspecified)
log             = Uname.log
output          = Uname.out
error           = Uname.err
#input          = /dev/null
queue 1

After the job finishes you should get an email detailing various statistics, and the files Uname.err should contain nothing, Uname.out should contain a line about the host system that ran your job and Uname.log should have some details about what Condor did with the job.

Example: Mandelbrot

Next is an example that uses Condor to create a sequence of images from the Mandelbrot set (the famous fractal) using MATLAB. The submit description file is called (for lack of a better name) mandelbrot.condor.

To run this example, make a sub-directory (under your home directory, not /scratch or /tmp!), cd into it and copy in the following four files (preserving the names): mandelbrot.condor, mandelbrot.sh, mandelbrot.m, and mandelplot.m. Make mandelbort.sh executable (chmod +x mandelbort.sh). Lastly, run condor_submit mandelbrot.condor from inside this directory. Now if you run condor_q you should see several jobs queued up.

Condor will send you email as the various jobs finish. If you want to check on their progress, you can use the condor_q command. The condor_status command can show you the status of the condor network such as how many machines are available how many are in use, etc. Since these jobs will only take a few seconds to run, you can also use tail -f Mandel.log to watch the logfile as the jobs are processed.

You can define new variables in the submit description file. In the Mandelbrot example above, I create several variables which extract various image parameters from the Condor Process variable which, in this example, runs from 0 to 29. In this example, the variables I create are prefixed with "my_".

For this example, you can view the results with something like gm display Mandel.{?,??}.jpg (depending on what shell you use).

How to remove (delete) a job from the queue

A job can be removed from the queue at any time by using the condor_rm command. If the job that is being removed is currently running, the job is killed without a checkpoint, and its queue entry is removed. Job can be removed only from the machine from which it was started. Condor also provides each user with the capability of assigning priorities to each submitted job. These job priorities are local to each queue and range from -20 to +20, with higher values meaning better priority. The default priority of a job is 0, but can be changed using the condor_prio command.

Why does my job not run

If Condor seems unwilling to start a job, you can use condor_q -analyze to see more detail. You can check the number of available machines using condor_status -avail. condor_rm is useful for canceling jobs.

Exceptions and Limitations

Condor cannot run ETH SEPP package programs directly. To see if a program is installed as SEPP package enter the command "which <prog_name>":

gfreudig@rista:~$ which matlab
/usr/sepp/bin/matlab
gfreudig@rista:~$ which gcc
/usr/bin/gcc
gfreudig@rista:~$ 

Programs within /usr/sepp/bin must be wrapped with shell scripts and cannot be called directly in the condor submit file.

The submit directory has to be accesible from all machines


CategoryBTCH

Services/Condor (last edited 2013-10-03 08:32:52 by bonaccos)