Introduction

At ITET the Condor Batch Queueing System is used since long time for running compute-intensive jobs. It uses the free resources on the tardis-PCs of the student rooms and on numerous PCs and compute servers at ITET institutes. Interactive work is privileged over batch computing, so running jobs could be killed by new interactive load or by shutdown/restart of a PC.

The new Son of Grid Engine on the powerfull ITET arton compute servers is an alternative to the Condor batch computing system and reserved for staff of the contributing institutes. It consists of a master host, where the scheduler resides and the arton execution hosts, where the batch jobs are executed. The execution hosts are powerfull servers, which resides in server rooms and are exclusively reserved for batch processing. Interactive logins are disabled.

Son of Grid Engine (SGE)

SGE is an open source fork of the SUN Grid Engine batch-queuing system, originally developed and supported by Sun Microsystems. With the takeover of Sun Microsystems by Oracle SUN Grid Engine becomes Oracle Grid Engine and is no longer free. The SGE is a robust batch scheduler that can handle large workloads across entire organizations. SGE is designed for the more traditional cluster environment and compute farms, while condor is designed for cycle stealing. SGE has the better scheduling algorithms.

SGE Arton Grid

Hardware

At the moment the computing power of the SGE based Arton Grid is based on the following 11 compute servers (execution hosts) :

Server

CPU

Frequency

Cores

Memory

Operating System

arton01 - 03

Dual Octa-Core Intel Xeon E5-2690

2.90 GHz

16

128 GB

Debian 7 (64 bit)

arton04 - 08

Dual Deca-Core Intel Xeon E5-2690 v2

3.00 GHz

20

128 GB

Debian 7 (64 bit)

arton09 - 10

Dual Deca-Core Intel Xeon E5-2690 v2

3.00 GHz

20

256 GB

Debian 7 (64 bit)

arton11

Dual Deca-Core Intel Xeon E5-2690 v2

3.00 GHz

20

768 GB

Debian 7 (64 bit)


The local disks (/scratch) of arton09, arton10 and arton11 are fast SSD-disks (6 GBit/s) with a size of 720 GByte.

The SGE job scheduler runs on the linux server zaan.

Using SGE

At a basic level, Son of Grid Engine (SGE) is very easy to use. The following sections will describe the commands you need to run and manage your batch jobs on the Grid Engine. The commands that will be most useful to you are as follows

Setting environment

The above commands are only working if the environment variables for the Arton Grid are set. This is done by sourcing one of the following scripts :

> source /home/sgeadmin/ITETCELL/common/settings.sh      # bash shell
> source /home/sgeadmin/ITETCELL/common/settings.csh     # tcsh shell

After sourcing you have the following variables set:

> env | grep SGE              # bash shell
> setenv | grep SGE           # tcsh shell
SGE_CELL=ITETCELL
SGE_EXECD_PORT=6445
SGE_QMASTER_PORT=6444
SGE_ROOT=/usr/pack/sonofge-8.1.6-fg
SGE_CLUSTER_NAME=d-itet

and the SGE directories are added to the PATH and MANPATH variables.

If you're using bash you could define an alias for the sourcing command or put the sourcing command in your .bashrc file.

alias sge='. /home/sgeadmin/ITETCELL/common/settings.sh'
or
source /home/sgeadmin/ITETCELL/common/settings.sh

To submit jobs your computer must be configured as an allowed submit host in the SGE configuration. If you get an error message like

sgeisg1@faktotum:~$ qsub primes.sh
Unable to run job: denied: host "faktotum.ee.ethz.ch" is no submit host.
Exiting.

write an email to support@ee.ethz.ch .

qsub : Submitting a job

Please do not use qsub to submit a binary directly. The qsub command has the following syntax:

> qsub [options] job_script [job_script arguments]

The job_script is a standard UNIX shell script. The fixed options for the Grid Engine Scheduler are placed in the job_script in lines starting with #$. The UNIX shell interpreter read this lines as comment lines and ignores them. Only temporary options should be placed outside the job_script. To test your job-script you can simply run it interactively.

Assume there is a c program primes.c which is compiled to an executable binary named primes with "gcc -o primes primes.c". The program runs 5 seconds and calculates prime numbers. The found prime numbers and a final summary report are written to standard output. A sample job_script primes.sh to perform a batch run of the binary primes on the Arton grid looks like this ( don't forget to change the mail address ! ):

#
#!/bin/bash
#
# primes.sh job-script for qsub
#
# Make sure that the .e (error) and .o (output) file arrive in the
# working directory
#$ -cwd
#
#Merge the standard out and standard error to one file
#$ -j y
#
#
#Show error message if job is not able to run with existing ressource configuration
#$ -w e
#
#   Set mail address and send a mail on job's start, end and abort
#$ -M <your mail-address>
#$ -m bea
#
/bin/echo Running on host: `hostname`
/bin/echo In directory: `pwd`
/bin/echo Starting on: `date`
#
# binary to execute
./primes
echo finished at: `date`

Now submit the job:

sgeisg1@rista:~/sge$ qsub primes.sh
Your job 424 ("primes.sh") has been submitted

On success the scheduler shows you the job-ID of your submitted job.

When the job has finished, you find the output file of the job in the submit directory with a name of <job-script name>.o<job-ID>

Like in condor its also possible to start an array job. The job above would run 10 times if you put the option #$ -t 1-10 in the job-script. The repeated execution makes only sense if something is changed in the executed program with the array task count number.The array count number can be referenced through the variable $SGE_TASK_ID. You can pass the value of $SGE_TASK_ID or some derived parameters to the executable. A simple solution to pass an $SGE_TASK_ID dependent input filename would look like this:

.
#$ -t 1-10
# binary to execute
<path-to-executable> data$SGE_TASK_ID.dat

Every run of the program in the array job with a different task-id produces a separate output file.

The following table describes the most common available options for qsub to be placed in the job-script in lines starting with #$ :

option

description

-cwd

execute the job from the current directory and not relative to your home directory

-e <stderr file>

path to the job's stderr output file (relative to home directory or to the current directory if the -cwd switch is used)

-hold_jid <job ids>

do not to start the job until the specified jobs have been finished successfully

-i <stdin file>

path to the job's stdin input file

-j y

merge the job's stderr with its stdout

-l resource=expression

ressource request ( for use with resource h_rt and h_vmem )

-m <b|e|a>

Let Grid Engine send a mail on job's status (b : begin,e : end,a : abort)

-M <mail-address>

mail address for job status mails

-N <jobname>

specifies the job name, default is the name of the submitted scripts

-o <stdout file>

path to the job's stdout output file (relative to home directory or to the current directory if the -cwd switch is used)

-q <queue-name>

execute the job in the specified queue (not necessary for standard jobs)

-S <path to shell>

specifies the shell Grid Engine should start your job with. Default is /bin/zsh

-t <from-to:step>

Submit an array job.The task within this array can be accessed in the job via the environment variable $SGE_TASK_ID.

-tc <max_running_tasks>

limits the number of concurrently running tasks of an array job

-w e

show error message and reject job with invalid requests

-V

inherit the current shell environment to the job

A detailed explanation of all available options is shown by the man-page of qsub.

qstat : Examine the job queue

With the command qstat you get informed about the status of your submitted jobs:

sgeisg1@rista:~/sge$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
    425 0.55500 primes.sh  sgeisg1      r     03/14/2013 16:08:32 standard.q@arton01.ee.ethz.ch      1        
    426 0.55500 primes.sh  sgeisg1      r     03/14/2013 16:11:02 standard.q@arton01.ee.ethz.ch      1        
    427 0.00000 aprimes_5. sgeisg1      qw    03/14/2013 16:11:06                                    1 1-5:1

The possible states of a job are:

The output above says that two jobs of me are running and one array job is waiting. The column queue shows the name of the queue and the execution host, where the job is running. If the array job can be executed it is expanded and the output of qstat changes to

sgeisg1@rista:~/sge$ qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
    425 0.55500 primes.sh  sgeisg1      r     03/14/2013 16:08:32 standard.q@arton01.ee.ethz.ch      1        
    426 0.55500 primes.sh  sgeisg1      r     03/14/2013 16:11:02 standard.q@arton01.ee.ethz.ch      1        
    427 0.55500 aprimes_5. sgeisg1      r     03/14/2013 16:11:17 standard.q@arton01.ee.ethz.ch      1 1
    427 0.55500 aprimes_5. sgeisg1      r     03/14/2013 16:11:17 standard.q@arton01.ee.ethz.ch      1 2
    427 0.55500 aprimes_5. sgeisg1      r     03/14/2013 16:11:17 standard.q@arton01.ee.ethz.ch      1 3
    427 0.55500 aprimes_5. sgeisg1      r     03/14/2013 16:11:17 standard.q@arton01.ee.ethz.ch      1 4
    427 0.55500 aprimes_5. sgeisg1      r     03/14/2013 16:11:17 standard.q@arton01.ee.ethz.ch      1 5

You see that the job-ID of all jobs belonging to the array job is equal, they are distinguished by by the task-ID.

To show the jobs of all users enter the command:

> qstat -u "*"

qdel : Deleting jobs

With qdel you can remove your waiting and running jobs from the scheduler queue. qstat gives you an overview of your jobs with the associated job-IDs . A job can be deleted with

> qdel  <job-ID>

To operate on an array job you can use the following commands

> qdel <job-ID>        # all jobs (waiting or running) of the array job are deleted
> qdel <job-ID>.n      # the job with task-ID n is deleted
> qdel <job-ID>.n1-n2  # the jobs with task-ID in the range n1-n2 are deleted

qhost : Status of execution hosts

The execution host status can be obtained by using the qhost command. An example listing is shown below.

gfreudig@rista:~$ qhost
gfreudig@rista:~/BTCH/Sge/jobs/stress$ qhost
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
arton01                 lx-amd64       32    2   16   32  2.96  126.1G  806.5M  125.0G     0.0
arton02                 lx-amd64       32    2   16   32  3.20  126.1G    1.0G  125.0G     0.0
arton03                 lx-amd64       32    2   16   32  0.21  126.1G  634.0M  125.0G     0.0
arton04                 lx-amd64       40    2   20   40  0.08  126.1G  694.8M  125.0G     0.0
arton05                 lx-amd64       40    2   20   40  0.02  126.1G    1.6G  125.0G     0.0
arton06                 lx-amd64       40    2   20   40  3.24  126.1G    2.9G  125.0G     0.0
arton07                 lx-amd64       40    2   20   40  2.60  126.1G  588.2M  125.0G     0.0
arton08                 lx-amd64       40    2   20   40  0.17  126.1G  573.4M  125.0G     0.0
gfreudig@rista:~/BTCH/Sge/jobs/stress$ 

The LOAD value is identical to the second of the value triple reported by the uptime command or by the top process monitor. If LOAD is higher than NCPU more processes than available cores are able to run and will probably get CPU time values below 100%.

sgeisg1@arton01:~$ uptime
 16:28:37 up 13 days,  6:47,  1 user,  load average: 15.87, 10.10, 5.29

With qhost -q you get a more detailed status of the execution hosts with the actual slot allocation table of the queues.

gfreudig@rista:~/BTCH/Sge/jobs/stress$ qhost -q
HOSTNAME                ARCH         NCPU NSOC NCOR NTHR  LOAD  MEMTOT  MEMUSE  SWAPTO  SWAPUS
----------------------------------------------------------------------------------------------
global                  -               -    -    -    -     -       -       -       -       -
arton01                 lx-amd64       32    2   16   32  2.96  126.1G  806.5M  125.0G     0.0
   standard.q           BP    0/5/20        
   long.q               B     0/0/4         
arton02                 lx-amd64       32    2   16   32  3.20  126.1G    1.0G  125.0G     0.0
   standard.q           BP    0/1/20        
   long.q               B     0/0/4         
arton03                 lx-amd64       32    2   16   32  0.20  126.1G  634.1M  125.0G     0.0
   standard.q           BP    0/0/20        
   long.q               B     0/0/4         
arton04                 lx-amd64       40    2   20   40  0.08  126.1G  694.8M  125.0G     0.0
   standard.q           BP    0/0/25        
   long.q               B     0/0/5         
arton05                 lx-amd64       40    2   20   40  0.02  126.1G    1.6G  125.0G     0.0
   standard.q           BP    0/0/25        
   long.q               B     0/0/5         
arton06                 lx-amd64       40    2   20   40  3.24  126.1G    2.9G  125.0G     0.0
   standard.q           BP    0/1/25        
   long.q               B     0/0/5         
arton07                 lx-amd64       40    2   20   40  2.60  126.1G  588.2M  125.0G     0.0
   standard.q           BP    0/4/25        
   long.q               B     0/0/5         
arton08                 lx-amd64       40    2   20   40  0.17  126.1G  573.4M  125.0G     0.0
   standard.q           BP    0/0/25        
   long.q               B     0/0/5         
gfreudig@rista:~/BTCH/Sge/jobs/stress$ 

Now its time to talk about the different job queues seen in the output above.

Queue design


The Arton Grid has 4 different job queues :

standard.q

Default queue for jobs with standard ressource requests

array.q

Default queue for array jobs with standard ressource requests instantiated only on arton01-arton08

long.q

Queue for jobs requiring more than 24h wallclock time ( max 96h )

hmem.q

Queue for jobs requiring more than 8Gbyte of virtual memory


The following table shows the configured ressource limits of wallclock time (h_rt) and virtual memory (h_vmem) of the queue instances on the different compute host groups :

Servers

standard.q

array.q

long.q

hmem.q

arton01 - 03

slots=16,h_rt=24h,h_vmem=8G

slots=16,h_rt=24h,h_vmem=8G

slots=8,h_rt=96h,h_vmem=8G

slots=4,h_rt=24h,h_vmem=64G

arton04 - 08

slots=20,h_rt=24h,h_vmem=8G

slots=20,h_rt=24h,h_vmem=8G

slots=10,h_rt=96h,h_vmem=8G

slots=5,h_rt=24h,h_vmem=64G

arton09 - 10

slots=20,h_rt=24h,h_vmem=16G

-

slots=10,h_rt=96h,h_vmem=16G

slots=10,h_rt=24h,h_vmem=128G

arton11

slots=20,h_rt=24h,h_vmem=48G

-

slots=8,h_rt=96h,h_vmem=128G

slots=10,h_rt=24h,h_vmem=768G



The slots parameter of a queue is the maximum number of jobs the queue accepts. The total number of busy slots (= running jobs) of all queue instances on an execution host can not exceed the number of physical cores on that host.

The wallclock time h_rt is the time difference between actual time and the start time of the job. If you specify no additional resource requests in your job-script or in the qsub command the job will be placed in the standard.q with a default of 1 slot and a virtual memory limit of 8 GByte. To run multithreaded (parallel) programs which could use more cores on an compute server see "Job to core binding".

Jobs who are reaching the wallclock time limit of an execution queue are killed by the grid engine scheduler and you will get a job aborted message.

To place a job in the long.q please enter a qsub with an resource request of h_rt > 24h :

> qsub -l h_rt=50:00:00 <job-script>

The h_rt value must be entered in the format <hours>:<minutes>:<seconds>.

To run a job who needs more than 8 GByte (.p.e. 50 GByte) virtual memory enter :

< qsub -l h_vmem=50G <job-script>

Please specify the requested ressource h_vmem not far away from the really needed value. The grid engine does not track your real use and subtracts your requested value from the total available virtual memory on the execution hosts and therefore the unused virtual memory space is not available to other jobs. To determine the needed virtual memory of your program run it interactively for a short time or have a look at the job summary in the "Job completed" mail of SGE.

To request more than the default for both ressources enter :

qsub -l v_mem=50G -l h_rt=50:00:00 <job-script>

/!\ The total number of available slots for your job is significantly lower with ressource requests over the default (h_rt=24:00:00, h_vmem=8G).

Multicore Jobs / Job to core binding

The modern linux kernels are able to bind a process and all its childs to a fixed number of cores. By default a job submitted to the grid engine is bound to 1 core. For the correct handling of multithreaded jobs the Grid Engine has a parallel environment (pe) named "multicore". To run jobs on multiple cores of an execution host the job must be submitted with a special pe option:

> qsub -pe multicore <n> <job_script>

n is the number of requested cores. This information allows the scheduler of the grid engine to allocate the correct number of slots for the job and also to bind the job to the requested number of cores. Two things you should avoid:

In the first case the multithreading occurs only inside one core (100% cpu), if you have 10 threads every thread could reach a maximum of 10% cpu time. The second case leads to an underused state of the computing capacity of the grid engine. If you request more than 16 cores the request is reduced to 16 cores.

/!\ The request of the parallel environment multicore is not possible for an array job.

Job input/output data storage

Temporary data storage of a job, which is only used while the job is running, should be placed in the /scratch directory of the execution host. The environment variables of the tools should be set accordingly. The Matlab MCR_ROOT_CACHE variable is set automatically by the sge scheduler.
The file system protection of the /scratch directory allows everybody to create files and directories in it. A cron job runs periodically on the execution hosts to prevent the /scratch directory from getting full and cleans it according to given policies. Therefore data you put in the /scratch directory of an execution host is not safe over time.

Small sized input and output data for the jobs is best placed in your home directory. It is available on every execution host through the /home automounter.

If you have problems with the quota limit in your home directory you could transfer data from your home or the /scratch directory of your submit host to the /scratch directories of the arton execution hosts and vice versa. To do this you are allowed to login interactively on arton01 with your personal account. All /scratch directories of the execution hosts are available on arton01 with the /scratch_net automount system. You can access the /scratch directory of arton<nn> under /scratch_net/arton<nn>. So you are able to transfer data between the /scratch_net directories and your home with normal linux file copy and to the scratch of your submission host with scp.

Please do not use the possible login on arton01 to run compute jobs interactively. Our procguard system will detect you. Other data storage concepts for the arton grid are possible and will be investigated, if the above solution proves not to be sufficient.

Matlab on SGE

Mandelbrot sample array job

This sample array job is the SGE version of the mandelbrot example in the condor service description. In contrast to the condor version the task-ID dependent parameter calculations to get different fractal images is done in the matlab file "mandelplot.m". To get run this sample job download the 3 files mandelplot.m, mandelbrot.m and mandelbrot.sge to a directory under your home. To submit the job enter

> qsub mandelbrot.sge

With qstat you can track the execution of your job. If the output of the running jobs by qstat disappears the job has completed. Now you find 10 jpeg-files and 10 job output-files in the submit directory. The last line in the job-script file

/usr/sepp/bin/matlab -nojvm -nodisplay -nodesktop -nosplash -r "mandelplot($SGE_TASK_ID, 10,'mandel_$SGE_TASK_ID.jpg');exit"

shows how the array job variable SGE_TASK_ID is used to call matlab for executing the command mandelplot. The task-ID itself and an output file name depending on the task-ID are passed to the mandelplot function.

References


CategoryBTCH

Services/SGE (last edited 2016-07-20 10:40:48 by gfreudig)