Differences between revisions 7 and 8
Revision 7 as of 2013-03-22 11:50:32
Size: 14999
Editor: gfreudig
Comment:
Revision 8 as of 2013-03-27 11:45:55
Size: 15102
Editor: gfreudig
Comment:
Deletions are marked like this. Additions are marked like this.
Line 130: Line 130:
 * Eqw - error to set the job running ( for example if the specified output directory doesn't exist )

Introduction

At ITET the Condor Batch Queueing System is used since long time for running compute-intensive jobs. It uses the free resources on the tardis-PCs of the student rooms and on numerous PCs and compute servers at ITET institutes. Interactive work is privileged over batch computing, so running jobs could be killed by new interactive load or by shutdown/restart of a PC.

The new installed SUN Grid Engine is an alternative to the Condor Batch Computing and reserved for staff of the contributing institutes. It consists of a master host, where the scheduler resides and 3 execution hosts, where the batch jobs are running. The execution hosts are powerfull servers, which resides in server rooms and are exclusively reserved for batch processing. Interactive logins are disabled.

The SUN Grid Engine (SGE)

SGE is an open source batch-queuing system, originally developed and supported by Sun Microsystems. The newest version is named Oracle Grid Engine and is no longer free, so we use the last free version from SUN. A future switch to an open source fork is to be expected. The SGE is a robust batch scheduler that can handle large workloads across entire organizations. SGE is designed for the more traditional cluster environment and compute farms, while condor is designed for cycle stealing. SGE has the better scheduling algorithms.

SGE Arton Grid

Hardware

At the moment the computing power of the new SGE based Arton Grid is based on the following compute servers :

Server

CPU

Frequency

Cores

Memory

Operating System

arton01

Dual Octa-Core Intel Xeon E5-2690

2.90 GHz

16

128 GB

Debian6 (64 bit)

arton02

Dual Octa-Core Intel Xeon E5-2690

2.90 GHz

16

128 GB

Debian6 (64 bit)

arton03

Dual Octa-Core Intel Xeon E5-2690

2.90 GHz

16

128 GB

Debian6 (64 bit)

The scheduler resides on the server zaan.

Using SGE

At a basic level, Sun Grid Engine (SGE) is very easy to use. The following sections will describe the commands you need to submit simple jobs to the Grid Engine. The command that will be most useful to you are as follows

  • qsub - submit a job to the batch scheduler
  • qstat - examine the job queue
  • qhost - status execution hosts
  • qdel - delete a job from the queue

Setting environment

The above commands are working if the environment variables for the Arton Grid are set. This is done by sourcing the following scripts :

> source /home/sgeadmin/ITETCELL/common/settings.sh      # bash shell
> source /home/sgeadmin/ITETCELL/common/settings.csh     # tcsh shell

After sourcing you have the following variables set:

SGE_CELL=ITETCELL
SGE_EXECD_PORT=6445
SGE_QMASTER_PORT=6444
SGE_ROOT=/usr/pack/sge6-2u6-bf
SGE_CLUSTER_NAME=d-itet

and the SGE directories are added to the PATH and MANPATH variables.

If you're using bash you could define an alias for the sourcing command or put the sourcing command in your .bashrc file.

alias sge='. /home/sgeadmin/ITETCELL/common/settings.sh'
or
source /home/sgeadmin/ITETCELL/common/settings.sh

To submit jobs your computer must be configured as an allowed submit host in the SGE configuration. If you get an error message like

sgeisg1@faktotum:~$ qsub primes.sh
Unable to run job: denied: host "faktotum.ee.ethz.ch" is no submit host.
Exiting.

write an email to support@ee.ethz.ch .

qsub : Submitting a job

Please do not use qsub to submit a binary directly. The qsub command has the following syntax:

> qsub [options] job_script [job_script arguments]

Fixed options should be placed in the job-script with lines starting with #$. Only specify temporary options outside the job script.

Assume there is a c programm primes.c which is compiled to an executable named primes with "gcc -o primes primes.c". A simple job-script primes.sh to run primes on the Arton grid looks like this:

#
# primes.sh job-script for qsub
#
# Set shell, otherwise the default shell would be used
#$ -S /bin/sh
#
# Make sure that the .e (error) and .o (output) file arrive in the
# working directory
#$ -cwd
#
#Merge the standard out and standard error to one file
#$ -j y
#
#   Set mail address and send a mail on job's start and end
#$ -M <your mail-address>
#$ -m be
#
/bin/echo Running on host: `hostname`
/bin/echo In directory: `pwd`
/bin/echo Starting on: `date`
#
# binary to execute
./primes
echo finished at: `date`

Now submit the job:

sgeisg1@rista:~/sge$ qsub primes.sh
Your job 424 ("primes.sh") has been submitted

On success the scheduler shows you the job-ID of your submitted job.

When the job has finished, you find the output file of the job in the submit directory with a name of <job-script name>.o<job-ID>

Like in condor its also possible to start an array job. The job above would run 10 times if you put the option #$ -t 1-10 in the job-script. The repeated execution makes only sense if something is changed in the executed programm with the array task count number.The array count number can be referenced through the variable SGE_TASK_ID. You can do some calculations with the SGE_TASK_ID in the job-script and passing SGE_TASK_ID dependent parameters or the SGE_TASK_ID itself to the executable. A simple solution, where the called programm uses different parameter sets according to the passed integer would look like this:

.
#$ -t 1-10
# binary to execute
./<executable> $SGE_TASK_ID

The following table describes the most common options for qsub:

option

description

-cwd

execute the job from the current directory and not relative to your home directory

-e <stderr file>

path to the job's stderr output file (relative to home directory or to the current directory if the -cwd switch is used)

-hold_jid <job ids>

do not to start the job until the specified jobs have been finished successfully

-i <stdin file>

path to the job's stdin input file

-j y

merge the job's stderr with its stdout

-m <b|e|a>

Let Grid Engine send a mail on job's status (b : begin,e : end,a : abort)

-M <mail-address>

mail address for job status mails

-N <jobname>

specifies the job name, default is the name of the submitted scripts

-o <stdout file

path to the job's stdout output file (relative to home directory or to the current directory if the -cwd switch is used)

-q <queue-name>

execute the job in the specified queue (not necessary for standard jobs)

-S <path to shell>

specifies the shell Grid Engine should start your job with. Default is /bin/zsh

-t <from-to:step>

Submit an array job.The task within this array can be accessed in the job via the environment variable $SGE_TASK_ID.

-V

inherit the current shell environment to the job

A detailed explanation of all available options is shown by the man-page of qsub.

qstat : Examine the job queue

With the command qstat you get informed about the status of your submitted jobs:

sgeisg1@rista:~/sge$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
    425 0.55500 primes.sh  sgeisg1      r     03/14/2013 16:08:32 standard.q@arton01.ee.ethz.ch      1        
    426 0.55500 primes.sh  sgeisg1      r     03/14/2013 16:11:02 standard.q@arton01.ee.ethz.ch      1        
    427 0.00000 aprimes_5. sgeisg1      qw    03/14/2013 16:11:06                                    1 1-5:1

The possible states of a job are:

  • r - running
  • qw - queue wait
  • Eqw - error to set the job running ( for example if the specified output directory doesn't exist )
  • t - transfer to execution host ( only a short time )

The output above says that two jobs of me are running and one array job is waiting. The column queue shows the name of the queue and the execution host, where the job is running. If the array job can be executed it is expanded and the output of qstat changes to

sgeisg1@rista:~/sge$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
    425 0.55500 primes.sh  sgeisg1      r     03/14/2013 16:08:32 standard.q@arton01.ee.ethz.ch      1        
    426 0.55500 primes.sh  sgeisg1      r     03/14/2013 16:11:02 standard.q@arton01.ee.ethz.ch      1        
    427 0.55500 aprimes_5. sgeisg1      r     03/14/2013 16:11:17 standard.q@arton01.ee.ethz.ch      1 1
    427 0.55500 aprimes_5. sgeisg1      r     03/14/2013 16:11:17 standard.q@arton01.ee.ethz.ch      1 2
    427 0.55500 aprimes_5. sgeisg1      r     03/14/2013 16:11:17 standard.q@arton01.ee.ethz.ch      1 3
    427 0.55500 aprimes_5. sgeisg1      r     03/14/2013 16:11:17 standard.q@arton01.ee.ethz.ch      1 4
    427 0.55500 aprimes_5. sgeisg1      r     03/14/2013 16:11:17 standard.q@arton01.ee.ethz.ch      1 5

You see that the job-ID of all jobs belonging to the array job is equal, they are distinguished by by the task-ID.

To show the jobs of all users enter the command:

> qstat -u "*"

qdel : Deleting jobs

With qdel you can remove your waiting and running jobs from the scheduler queue. qstat gives you an overview of your jobs with the associated job-IDs . A job can be deleted with

> qdel  <job-ID>

To operate on an array job you can use the following commands

> qdel <job-ID>        # all jobs (waiting or running) of the array job are deleted
> qdel <job-ID>.n      # the job with task-ID n is deleted
> qdel <job-ID>.n1-n2  # the jobs with task-ID in the range n1-n2 are deleted

qhost : Status of execution hosts

The execution host status can be obtained by using the qhost command. An example listing is shown below.

sgeisg1@rista:~/sge$ qhost
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
arton01                 lx24-amd64     16 10.10  125.9G  406.9M  125.0G     0.0
arton02                 lx24-amd64     16  8.90  125.9G  457.8M  125.0G     0.0
arton03                 lx24-amd64     16  0.12  125.9G  392.0M  125.0G     0.0

The LOAD value is identical to the second of the value triple reported by the uptime command or by the top process monitor. If LOAD is higher than NCPU more processes than available cores are able to run and will probably get CPU time values below 100%.

sgeisg1@arton01:~$ uptime
 16:28:37 up 13 days,  6:47,  1 user,  load average: 15.87, 10.10, 5.29

With qhost -q you get a more detailed status of the execution hosts.

sgeisg1@rista:~/sge$ qhost -q
HOSTNAME                ARCH         NCPU  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
-------------------------------------------------------------------------------
global                  -               -     -       -       -       -       -
arton01                 lx24-amd64     16  5.58  125.9G  409.5M  125.0G     0.0
   multicore.q          BP    0/0/4         
   standard.q           BP    0/16/16       
   standard_l.q         BP    0/0/4         
arton02                 lx24-amd64     16  5.68  125.9G  461.4M  125.0G     0.0
   multicore.q          BP    0/0/4         
   standard.q           BP    0/13/16       
   standard_l.q         BP    0/0/4         
arton03                 lx24-amd64     16  0.13  125.9G  401.1M  125.0G     0.0
   multicore.q          BP    0/0/4         
   standard.q           BP    0/0/16        

Now its time to talk about the different queues seen in the output above.

Queue design

With the small number of identical execution hosts and 128 GB memory the Arton Grid starts with a simple queue design. The following table shows the characteristics of the 3 available queues:

Queue

wall clock time

total slots

fill order

standard.q

24h

48

arton01,arton02,arton03

multicore.q

24h

12

arton03,arton02,arton01

standard_l.q

96h

8

arton01,arton02

The parameters have the following meanings:

wall clock time : maximal time for a job to be in the running state
total slots : maximal number of jobs in the queue on all execution hosts
fill order : how the scheduler fills up the queues

If you are dealing with normal sequential jobs and wall clock times <24h you should submit your jobs without specifying the execution queue.

If you have jobs with a wall clock time >24h and <96h you should place them in the standard_l.q with the option "-q standard_l.q".

The multicore.q only exists to have a queue with a different fill order than the standard.q. The jobs/slots in queue standard.q are filled up in sequence arton01 -> arton02 -> arton03 while in the queue multicore.q the fill order is arton03 -> arton02 -> arton01. With this strategy the multicore.q is the better place to run multithreaded jobs, which could use more cores on an execution host. If there are free slots in the standard.q they will concentrate on the right side of the execution host list arton01 -> arton02 -> arton03. So its always good to submit a multithreaded job to queue multicore.q with the option "-q multicore.q".

Matlab on SGE

Mandelbrot sample array job

This sample array job is the SGE version of the mandelbrot example in the condor service description. In contrast to the condor version the task-ID dependent parameter calculations to get different fractal images is done in the matlab file "mandelplot.m". To get run this sample job download the 3 files mandelplot.m, mandelbrot.m and mandelbrot.sge to a directory under your home. To submit the job enter

> qsub mandelbrot.sge

With qstat you can track the execution of your job. If the output of the running jobs by qstat disappears the job has completed. Now you find 10 jpeg-files and 10 job output-files in the submit directory. The last line in the job-script file

/usr/sepp/bin/matlab -nojvm -nodisplay -nodesktop -nosplash -r "mandelplot($SGE_TASK_ID, 10,'mandel_$SGE_TASK_ID.jpg');exit"

shows how the array job variable SGE_TASK_ID is used to call matlab for executing the command mandelplot. The task-ID itself and an output file name depending on the task-ID are passed to the mandelplot function.

References