14887
Comment:
|
20947
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
'''SGE Batch Computing System'''<<BR>> |
|
Line 7: | Line 5: |
The new installed SUN Grid Engine is an alternative to the Condor Batch Computing and reserved for staff of the contributing institutes. It consists of a master host, where the scheduler resides and 3 execution hosts, where the batch jobs are running. The execution hosts are powerfull servers, which resides in server rooms and are exclusively reserved for batch processing. Interactive logins are disabled. == The SUN Grid Engine (SGE) == SGE is an open source batch-queuing system, originally developed and supported by Sun Microsystems. The newest version is named Oracle Grid Engine and is no longer free, so we use the last free version from SUN. A future switch to an open source fork is to be expected. The SGE is a robust batch scheduler that can handle large workloads across entire organizations. SGE is designed for the more traditional cluster environment and compute farms, while condor is designed for cycle stealing. SGE has the better scheduling algorithms. |
The new Son of Grid Engine on the powerfull ITET arton compute servers is an alternative to the Condor batch computing system and reserved for staff of the contributing institutes. It consists of a master host, where the scheduler resides and the arton execution hosts, where the batch jobs are executed. The execution hosts are powerfull servers, which resides in server rooms and are exclusively reserved for batch processing. Interactive logins are disabled. == Son of Grid Engine (SGE) == SGE is an open source fork of the SUN Grid Engine batch-queuing system, originally developed and supported by Sun Microsystems. With the takeover of Sun Microsystems by Oracle SUN Grid Engine becomes Oracle Grid Engine and is no longer free. The SGE is a robust batch scheduler that can handle large workloads across entire organizations. SGE is designed for the more traditional cluster environment and compute farms, while condor is designed for cycle stealing. SGE has the better scheduling algorithms. |
Line 14: | Line 12: |
At the moment the computing power of the new SGE based Arton Grid is based on the following compute servers : | At the moment the computing power of the SGE based Arton Grid is based on the following 11 compute servers (execution hosts) : |
Line 16: | Line 14: |
||arton01||||Dual Octa-Core Intel Xeon E5-2690||||2.90 GHz||||16||||128 GB||||Debian6 (64 bit)|| ||arton02||||Dual Octa-Core Intel Xeon E5-2690||||2.90 GHz||||16||||128 GB||||Debian6 (64 bit)|| ||arton03||||Dual Octa-Core Intel Xeon E5-2690||||2.90 GHz||||16||||128 GB||||Debian6 (64 bit)|| The scheduler resides on the server zaan.<<BR>> |
||arton01 - 03||||Dual Octa-Core Intel Xeon E5-2690||||2.90 GHz||||16||||128 GB||||Debian 7 (64 bit)|| ||arton04 - 08||||Dual Deca-Core Intel Xeon E5-2690 v2||||3.00 GHz||||20||||128 GB||||Debian 7 (64 bit)|| ||arton09 - 10||||Dual Deca-Core Intel Xeon E5-2690 v2||||3.00 GHz||||20||||256 GB||||Debian 7 (64 bit)|| ||arton11||||Dual Deca-Core Intel Xeon E5-2690 v2||||3.00 GHz||||20||||768 GB||||Debian 7 (64 bit)|| The compute servers are controlled from the SGE job scheduler running on the linux server zaan.<<BR>> |
Line 21: | Line 20: |
At a basic level, Sun Grid Engine (SGE) is very easy to use. The following sections will describe the commands you need to submit simple jobs to the Grid Engine. The command that will be most useful to you are as follows<<BR>> | At a basic level, Son of Grid Engine (SGE) is very easy to use. The following sections will describe the commands you need to run and manage your batch jobs on the Grid Engine. The commands that will be most useful to you are as follows<<BR>> |
Line 27: | Line 26: |
The above commands are working if the environment variables for the Arton Grid are set. This is done by sourcing the following script :<<BR>> {{{ sgeisg1@rista:~$ . /home/sgeadmin/ITETCELL/common/settings.sh |
The above commands are only working if the environment variables for the Arton Grid are set. This is done by sourcing the following scripts :<<BR>> {{{ > source /home/sgeadmin/ITETCELL/common/settings.sh # bash shell > source /home/sgeadmin/ITETCELL/common/settings.csh # tcsh shell |
Line 33: | Line 33: |
sgeisg1@rista:~$ env | grep SGE | > env | grep SGE # bash shell > setenv | grep SGE # tcsh shell |
Line 37: | Line 38: |
SGE_ROOT=/usr/pack/sge6-2u6-bf | SGE_ROOT=/usr/pack/sonofge-8.1.6-fg |
Line 40: | Line 41: |
You could define an alias for the sourcing command or put the sourcing command in your .bashrc file. | and the SGE directories are added to the PATH and MANPATH variables.<<BR>><<BR>> If you're using bash you could define an alias for the sourcing command or put the sourcing command in your .bashrc file. |
Line 58: | Line 60: |
Fixed options should be placed in the job-script with lines starting with '''#$'''. Only specify options outside the job script for temporary options.<<BR>><<BR>> Assume there is a c programm [[attachment:primes.c]] which is compiled to an executable named primes with "gcc -o primes primes.c". A simple job-script primes.sh to run primes on the Arton grid looks like this: {{{ #!/bin/sh |
The job_script is a standard UNIX shell script. The fixed options for the Grid Engine Scheduler are placed in the job_script in lines starting with '''#$'''. The UNIX shell interpreter read this lines as comment lines and ignores them. Only temporary options should be placed outside the job_script. To test your job-script you can simply run it interactively.<<BR>><<BR>> Assume there is a c program [[attachment:primes.c]] which is compiled to an executable named primes with "gcc -o primes primes.c". The program runs 5 seconds and calculates prime numbers. The found prime numbers and a final summary report are written to standard output. A sample job_script primes.sh to perform a batch run of primes on the Arton grid looks like this ( don't forget to change the mail address ! ): {{{ # #!/bin/bash |
Line 64: | Line 67: |
# # Set shell, otherwise the default shell would be used #$ -S /bin/sh |
|
Line 94: | Line 94: |
Like in condor its also possible to start an array job. The job above would run 10 times if you put the option '''#$ -t 1-10''' in the job-script. The repeated execution makes only sense if something is changed in the executed programm with the array task count number.The array count number can be referenced through the variable SGE_TASK_ID. You can do some calculations with the SGE_TASK_ID in the job-script and passing SGE_TASK_ID dependent parameters or the SGE_TASK_ID itself to the executable. A simple solution, where the called programm uses different parameter sets according to the passed integer would look like this: | Like in condor its also possible to start an array job. The job above would run 10 times if you put the option '''#$ -t 1-10''' in the job-script. The repeated execution makes only sense if something is changed in the executed program with the array task count number.The array count number can be referenced through the variable $SGE_TASK_ID. You can pass the value of $SGE_TASK_ID or some derived parameters to the executable. A simple solution to pass an $SGE_TASK_ID dependent input filename would look like this: |
Line 99: | Line 99: |
./<executable> $SGE_TASK_ID }}} The following table describes the most common options for qsub:<<BR>> |
<path-to-executable> data$SGE_TASK_ID.dat }}} Every run of the program in the array job with a different task-id produces a separate output file.<<BR>><<BR>> The following table describes the most common available options for qsub to be placed in the job-script in lines starting with #$ :<<BR>> |
Line 111: | Line 112: |
||-o <stdout file||||path to the job's stdout output file (relative to home directory or to the current directory if the -cwd switch is used)|| | ||-o <stdout file>||||path to the job's stdout output file (relative to home directory or to the current directory if the -cwd switch is used)|| |
Line 115: | Line 116: |
||-tc <max_running_tasks>||||limits the number of concurrently running tasks of an array job|| | |
Line 131: | Line 133: |
* hqw - queue wait held state ( job manually set in hold state or wait for termination of a dependent job set with the -hold-jid option ) * Eqw - error to set the job running ( for example if the specified output directory doesn't exist ) |
|
Line 165: | Line 169: |
sgeisg1@rista:~/sge$ qhost HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------- global - - - - - - - arton01 lx24-amd64 16 10.10 125.9G 406.9M 125.0G 0.0 arton02 lx24-amd64 16 8.90 125.9G 457.8M 125.0G 0.0 arton03 lx24-amd64 16 0.12 125.9G 392.0M 125.0G 0.0 |
gfreudig@rista:~$ qhost gfreudig@rista:~/BTCH/Sge/jobs/stress$ qhost HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS ---------------------------------------------------------------------------------------------- global - - - - - - - - - - arton01 lx-amd64 32 2 16 32 2.96 126.1G 806.5M 125.0G 0.0 arton02 lx-amd64 32 2 16 32 3.20 126.1G 1.0G 125.0G 0.0 arton03 lx-amd64 32 2 16 32 0.21 126.1G 634.0M 125.0G 0.0 arton04 lx-amd64 40 2 20 40 0.08 126.1G 694.8M 125.0G 0.0 arton05 lx-amd64 40 2 20 40 0.02 126.1G 1.6G 125.0G 0.0 arton06 lx-amd64 40 2 20 40 3.24 126.1G 2.9G 125.0G 0.0 arton07 lx-amd64 40 2 20 40 2.60 126.1G 588.2M 125.0G 0.0 arton08 lx-amd64 40 2 20 40 0.17 126.1G 573.4M 125.0G 0.0 gfreudig@rista:~/BTCH/Sge/jobs/stress$ |
Line 178: | Line 189: |
With qhost -q you get a more detailed status of the execution hosts. {{{ sgeisg1@rista:~/sge$ qhost -q HOSTNAME ARCH NCPU LOAD MEMTOT MEMUSE SWAPTO SWAPUS ------------------------------------------------------------------------------- global - - - - - - - arton01 lx24-amd64 16 5.58 125.9G 409.5M 125.0G 0.0 multicore.q BP 0/0/4 standard.q BP 0/16/16 standard_l.q BP 0/0/4 arton02 lx24-amd64 16 5.68 125.9G 461.4M 125.0G 0.0 multicore.q BP 0/0/4 standard.q BP 0/13/16 standard_l.q BP 0/0/4 arton03 lx24-amd64 16 0.13 125.9G 401.1M 125.0G 0.0 multicore.q BP 0/0/4 standard.q BP 0/0/16 }}} Now its time to talk about the different queues seen in the output above.<<BR>><<BR>> |
With qhost -q you get a more detailed status of the execution hosts with the actual slot allocation table of the queues. {{{ gfreudig@rista:~/BTCH/Sge/jobs/stress$ qhost -q HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS ---------------------------------------------------------------------------------------------- global - - - - - - - - - - arton01 lx-amd64 32 2 16 32 2.96 126.1G 806.5M 125.0G 0.0 standard.q BP 0/5/20 long.q B 0/0/4 arton02 lx-amd64 32 2 16 32 3.20 126.1G 1.0G 125.0G 0.0 standard.q BP 0/1/20 long.q B 0/0/4 arton03 lx-amd64 32 2 16 32 0.20 126.1G 634.1M 125.0G 0.0 standard.q BP 0/0/20 long.q B 0/0/4 arton04 lx-amd64 40 2 20 40 0.08 126.1G 694.8M 125.0G 0.0 standard.q BP 0/0/25 long.q B 0/0/5 arton05 lx-amd64 40 2 20 40 0.02 126.1G 1.6G 125.0G 0.0 standard.q BP 0/0/25 long.q B 0/0/5 arton06 lx-amd64 40 2 20 40 3.24 126.1G 2.9G 125.0G 0.0 standard.q BP 0/1/25 long.q B 0/0/5 arton07 lx-amd64 40 2 20 40 2.60 126.1G 588.2M 125.0G 0.0 standard.q BP 0/4/25 long.q B 0/0/5 arton08 lx-amd64 40 2 20 40 0.17 126.1G 573.4M 125.0G 0.0 standard.q BP 0/0/25 long.q B 0/0/5 gfreudig@rista:~/BTCH/Sge/jobs/stress$ }}} Now its time to talk about the different job queues seen in the output above.<<BR>><<BR>> |
Line 198: | Line 223: |
With the small number of identical execution hosts and 128 GB memory the Arton Grid starts with a simple queue design. The following table shows the characteristics of the 3 available queues:<<BR>> ||'''Queue'''||||'''wall clock time'''||||'''total slots'''||||'''fill order'''|| ||standard.q||||24h||||48||||arton01,arton02,arton03|| ||multicore.q||||24h||||12||||arton03,arton02,arton01|| ||standard_l.q||||96h||||8||||arton01,arton02|| |
<<BR>> The Arton Grid starts has 3 different job queues :<<BR>><<BR>> ||standard.q||||Default queue for all jobs with standard ressource requests|| ||long.q ||||Queue for jobs requiring more than 24h wallclock time ( max 96h )|| ||hmem.q ||||Queue for jobs requiring more than 8Gbyte of virtual memory|| <<BR>> The following table shows the configured ressource limits of wallclock time (h_rt) and virtual memory (h_vmem) of the queue instances on the different compute host groups :<<BR>><<BR>> ||'''Servers'''||||'''standard.q'''||||'''long.q'''||||'''hmem.q'''|| ||arton01 - 03||||slots=16,h_rt=24h,h_vmem=8G||||slots=8,h_rt=96h,h_vmem=8G||||slots=4,h_rt=24h,h_vmem=64G|| ||arton04 - 08||||slots=20,h_rt=24h,h_vmem=8G||||slots=10,h_rt=24h,h_vmem=8G||||slots=5,h_rt=24h,h_vmem=64G|| ||arton09 - 10||||slots=20,h_rt=24h,h_vmem=16G||||slots=10,h_rt=24h,h_vmem=16G||||slots=10,h_rt=24h,h_vmem=128G|| ||arton11||||slots=10,h_rt=24h,h_vmem=48G||||no instance||||slots=10,h_rt=24h,h_vmem=768G|| <<BR>><<BR>> |
Line 204: | Line 237: |
wall clock time : maximal computation time of a job<<BR>> | wall clock time : maximal time for a job to be in the running state<<BR>> |
Line 206: | Line 239: |
fill order : how the scheduler fills up the queues<<BR>><<BR>> | <<BR>> |
Line 208: | Line 241: |
If you have jobs with a wall clock time >24h and <96h you should place them in the standard_l.q with the option "-q standard_l.q".<<BR>><<BR>> The multicore.q only exists to have a queue with a different fill order than the standard.q. The jobs/slots in queue standard.q are filled up in sequence arton01 -> arton02 -> arton03 while in the queue multicore.q the fill order is arton03 -> arton02 -> arton01. With this strategy the multicore.q is the better place to run multithreaded jobs, which could use more cores on an execution host. If there are free slots in the standard.q they will concentrate on the right side of the execution host list arton01 -> arton02 -> arton03. So its always good to submit a multithreaded job to queue multicore.q with the option "-q multicore.q". |
If you have jobs with a wall clock time >24h and <96h you should place them in the long.q with the option "-q long.q".<<BR>><<BR>> Jobs who are reaching the wall time limit of an execution queue are killed by the grid engine scheduler.<<BR>><<BR>> === Job to core binding === The new linux kernels are able to bind processes to a fixed number of cores. By default a job submitted to the grid engine is bound to 1 core. With slots per queue slightly over the physical number of cores of an execution host it's not possible to dramatically overload an execution host. For the correct handling of multithreaded jobs the Grid Engine has a parallel environment (pe) named "multicore". To run jobs on multiple cores of an execution host the job must be submitted with a special pe option:<<BR>> {{{ > qsub -pe multicore <n> <job_script> }}} n is the number of requested cores. This information allows the scheduler of the grid engine to allocate the correct number of slots for the job and also to bind the job to the requested number of cores. Two things you should avoid: * Submitting multithreaded jobs without the -pe option * Submitting jobs with the -pe option which are using the requested cores very unefficient In the first case the multithreading occurs only inside one core (100% cpu), if you have 10 threads every thread could reach a maximum of 10% cpu time. The second case leads to an underused state of the computing capacity of the grid engine. If you request more than 12 cores the request is reduced to 12 cores. === Job input/output data storage === Temporary data storage of a job, which is only used while the job is running, should be placed in the /scratch directory of the execution host. The environment variables of the tools should be set accordingly. The Matlab MCR_ROOT_CACHE variable is set automatically by the sge scheduler.<<BR>> The file system protection of the /scratch directory allows everybody to create files and directories in it. A cron job runs periodically on the execution hosts to prevent the /scratch directory from getting full and cleans it according to given policies. Therefore data you put in the /scratch directory of an execution host is not safe over time.<<BR>><<BR>> Small sized input and output data for the jobs is best placed in your home directory. It is available on every execution host through the /home automounter.<<BR>><<BR>> If you have problems with the quota limit in your home directory you could transfer data from your home or the /scratch directory of your submit host to the /scratch directories of the arton execution hosts and vice versa. To do this you are allowed to login interactively on arton01 with your personal account. All /scratch directories of the execution hosts are available on arton01 with the /scratch_net automount system. You can access the /scratch directory of arton<nn> under /scratch_net/arton<nn>. So you are able to transfer data between the /scratch_net directories and your home with normal linux file copy and to the scratch of your submission host with scp.<<BR>><<BR>> Please do not use the possible login on arton01 to run compute jobs interactively. Our procguard system will detect you. Other data storage concepts for the arton grid are possible and will be investigated, if the above solution proves not to be sufficient.<<BR>> |
Line 222: | Line 271: |
---- [[CategoryBTCH]] |
Introduction
At ITET the Condor Batch Queueing System is used since long time for running compute-intensive jobs. It uses the free resources on the tardis-PCs of the student rooms and on numerous PCs and compute servers at ITET institutes. Interactive work is privileged over batch computing, so running jobs could be killed by new interactive load or by shutdown/restart of a PC.
The new Son of Grid Engine on the powerfull ITET arton compute servers is an alternative to the Condor batch computing system and reserved for staff of the contributing institutes. It consists of a master host, where the scheduler resides and the arton execution hosts, where the batch jobs are executed. The execution hosts are powerfull servers, which resides in server rooms and are exclusively reserved for batch processing. Interactive logins are disabled.
Son of Grid Engine (SGE)
SGE is an open source fork of the SUN Grid Engine batch-queuing system, originally developed and supported by Sun Microsystems. With the takeover of Sun Microsystems by Oracle SUN Grid Engine becomes Oracle Grid Engine and is no longer free. The SGE is a robust batch scheduler that can handle large workloads across entire organizations. SGE is designed for the more traditional cluster environment and compute farms, while condor is designed for cycle stealing. SGE has the better scheduling algorithms.
SGE Arton Grid
Hardware
At the moment the computing power of the SGE based Arton Grid is based on the following 11 compute servers (execution hosts) :
Server |
CPU |
Frequency |
Cores |
Memory |
Operating System |
|||||
arton01 - 03 |
Dual Octa-Core Intel Xeon E5-2690 |
2.90 GHz |
16 |
128 GB |
Debian 7 (64 bit) |
|||||
arton04 - 08 |
Dual Deca-Core Intel Xeon E5-2690 v2 |
3.00 GHz |
20 |
128 GB |
Debian 7 (64 bit) |
|||||
arton09 - 10 |
Dual Deca-Core Intel Xeon E5-2690 v2 |
3.00 GHz |
20 |
256 GB |
Debian 7 (64 bit) |
|||||
arton11 |
Dual Deca-Core Intel Xeon E5-2690 v2 |
3.00 GHz |
20 |
768 GB |
Debian 7 (64 bit) |
The compute servers are controlled from the SGE job scheduler running on the linux server zaan.
Using SGE
At a basic level, Son of Grid Engine (SGE) is very easy to use. The following sections will describe the commands you need to run and manage your batch jobs on the Grid Engine. The commands that will be most useful to you are as follows
- qsub - submit a job to the batch scheduler
- qstat - examine the job queue
- qhost - status execution hosts
- qdel - delete a job from the queue
Setting environment
The above commands are only working if the environment variables for the Arton Grid are set. This is done by sourcing the following scripts :
> source /home/sgeadmin/ITETCELL/common/settings.sh # bash shell > source /home/sgeadmin/ITETCELL/common/settings.csh # tcsh shell
After sourcing you have the following variables set:
> env | grep SGE # bash shell > setenv | grep SGE # tcsh shell SGE_CELL=ITETCELL SGE_EXECD_PORT=6445 SGE_QMASTER_PORT=6444 SGE_ROOT=/usr/pack/sonofge-8.1.6-fg SGE_CLUSTER_NAME=d-itet
and the SGE directories are added to the PATH and MANPATH variables.
If you're using bash you could define an alias for the sourcing command or put the sourcing command in your .bashrc file.
alias sge='. /home/sgeadmin/ITETCELL/common/settings.sh' or source /home/sgeadmin/ITETCELL/common/settings.sh
To submit jobs your computer must be configured as an allowed submit host in the SGE configuration. If you get an error message like
sgeisg1@faktotum:~$ qsub primes.sh Unable to run job: denied: host "faktotum.ee.ethz.ch" is no submit host. Exiting.
write an email to support@ee.ethz.ch .
qsub : Submitting a job
Please do not use qsub to submit a binary directly. The qsub command has the following syntax:
> qsub [options] job_script [job_script arguments]
The job_script is a standard UNIX shell script. The fixed options for the Grid Engine Scheduler are placed in the job_script in lines starting with #$. The UNIX shell interpreter read this lines as comment lines and ignores them. Only temporary options should be placed outside the job_script. To test your job-script you can simply run it interactively.
Assume there is a c program primes.c which is compiled to an executable named primes with "gcc -o primes primes.c". The program runs 5 seconds and calculates prime numbers. The found prime numbers and a final summary report are written to standard output. A sample job_script primes.sh to perform a batch run of primes on the Arton grid looks like this ( don't forget to change the mail address ! ):
# #!/bin/bash # # primes.sh job-script for qsub # # Make sure that the .e (error) and .o (output) file arrive in the # working directory #$ -cwd # #Merge the standard out and standard error to one file #$ -j y # # Set mail address and send a mail on job's start and end #$ -M <your mail-address> #$ -m be # /bin/echo Running on host: `hostname` /bin/echo In directory: `pwd` /bin/echo Starting on: `date` # # binary to execute ./primes echo finished at: `date`
Now submit the job:
sgeisg1@rista:~/sge$ qsub primes.sh Your job 424 ("primes.sh") has been submitted
On success the scheduler shows you the job-ID of your submitted job.
When the job has finished, you find the output file of the job in the submit directory with a name of <job-script name>.o<job-ID>
Like in condor its also possible to start an array job. The job above would run 10 times if you put the option #$ -t 1-10 in the job-script. The repeated execution makes only sense if something is changed in the executed program with the array task count number.The array count number can be referenced through the variable $SGE_TASK_ID. You can pass the value of $SGE_TASK_ID or some derived parameters to the executable. A simple solution to pass an $SGE_TASK_ID dependent input filename would look like this:
. #$ -t 1-10 # binary to execute <path-to-executable> data$SGE_TASK_ID.dat
Every run of the program in the array job with a different task-id produces a separate output file.
The following table describes the most common available options for qsub to be placed in the job-script in lines starting with #$ :
option |
description |
|
-cwd |
execute the job from the current directory and not relative to your home directory |
|
-e <stderr file> |
path to the job's stderr output file (relative to home directory or to the current directory if the -cwd switch is used) |
|
-hold_jid <job ids> |
do not to start the job until the specified jobs have been finished successfully |
|
-i <stdin file> |
path to the job's stdin input file |
|
-j y |
merge the job's stderr with its stdout |
|
-m <b|e|a> |
Let Grid Engine send a mail on job's status (b : begin,e : end,a : abort) |
|
-M <mail-address> |
mail address for job status mails |
|
-N <jobname> |
specifies the job name, default is the name of the submitted scripts |
|
-o <stdout file> |
path to the job's stdout output file (relative to home directory or to the current directory if the -cwd switch is used) |
|
-q <queue-name> |
execute the job in the specified queue (not necessary for standard jobs) |
|
-S <path to shell> |
specifies the shell Grid Engine should start your job with. Default is /bin/zsh |
|
-t <from-to:step> |
Submit an array job.The task within this array can be accessed in the job via the environment variable $SGE_TASK_ID. |
|
-tc <max_running_tasks> |
limits the number of concurrently running tasks of an array job |
|
-V |
inherit the current shell environment to the job |
A detailed explanation of all available options is shown by the man-page of qsub.
qstat : Examine the job queue
With the command qstat you get informed about the status of your submitted jobs:
sgeisg1@rista:~/sge$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 425 0.55500 primes.sh sgeisg1 r 03/14/2013 16:08:32 standard.q@arton01.ee.ethz.ch 1 426 0.55500 primes.sh sgeisg1 r 03/14/2013 16:11:02 standard.q@arton01.ee.ethz.ch 1 427 0.00000 aprimes_5. sgeisg1 qw 03/14/2013 16:11:06 1 1-5:1
The possible states of a job are:
- r - running
- qw - queue wait
- hqw - queue wait held state ( job manually set in hold state or wait for termination of a dependent job set with the -hold-jid option )
- Eqw - error to set the job running ( for example if the specified output directory doesn't exist )
- t - transfer to execution host ( only a short time )
The output above says that two jobs of me are running and one array job is waiting. The column queue shows the name of the queue and the execution host, where the job is running. If the array job can be executed it is expanded and the output of qstat changes to
sgeisg1@rista:~/sge$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID ----------------------------------------------------------------------------------------------------------------- 425 0.55500 primes.sh sgeisg1 r 03/14/2013 16:08:32 standard.q@arton01.ee.ethz.ch 1 426 0.55500 primes.sh sgeisg1 r 03/14/2013 16:11:02 standard.q@arton01.ee.ethz.ch 1 427 0.55500 aprimes_5. sgeisg1 r 03/14/2013 16:11:17 standard.q@arton01.ee.ethz.ch 1 1 427 0.55500 aprimes_5. sgeisg1 r 03/14/2013 16:11:17 standard.q@arton01.ee.ethz.ch 1 2 427 0.55500 aprimes_5. sgeisg1 r 03/14/2013 16:11:17 standard.q@arton01.ee.ethz.ch 1 3 427 0.55500 aprimes_5. sgeisg1 r 03/14/2013 16:11:17 standard.q@arton01.ee.ethz.ch 1 4 427 0.55500 aprimes_5. sgeisg1 r 03/14/2013 16:11:17 standard.q@arton01.ee.ethz.ch 1 5
You see that the job-ID of all jobs belonging to the array job is equal, they are distinguished by by the task-ID.
To show the jobs of all users enter the command:
> qstat -u "*"
qdel : Deleting jobs
With qdel you can remove your waiting and running jobs from the scheduler queue. qstat gives you an overview of your jobs with the associated job-IDs . A job can be deleted with
> qdel <job-ID>
To operate on an array job you can use the following commands
> qdel <job-ID> # all jobs (waiting or running) of the array job are deleted > qdel <job-ID>.n # the job with task-ID n is deleted > qdel <job-ID>.n1-n2 # the jobs with task-ID in the range n1-n2 are deleted
qhost : Status of execution hosts
The execution host status can be obtained by using the qhost command. An example listing is shown below.
gfreudig@rista:~$ qhost gfreudig@rista:~/BTCH/Sge/jobs/stress$ qhost HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS ---------------------------------------------------------------------------------------------- global - - - - - - - - - - arton01 lx-amd64 32 2 16 32 2.96 126.1G 806.5M 125.0G 0.0 arton02 lx-amd64 32 2 16 32 3.20 126.1G 1.0G 125.0G 0.0 arton03 lx-amd64 32 2 16 32 0.21 126.1G 634.0M 125.0G 0.0 arton04 lx-amd64 40 2 20 40 0.08 126.1G 694.8M 125.0G 0.0 arton05 lx-amd64 40 2 20 40 0.02 126.1G 1.6G 125.0G 0.0 arton06 lx-amd64 40 2 20 40 3.24 126.1G 2.9G 125.0G 0.0 arton07 lx-amd64 40 2 20 40 2.60 126.1G 588.2M 125.0G 0.0 arton08 lx-amd64 40 2 20 40 0.17 126.1G 573.4M 125.0G 0.0 gfreudig@rista:~/BTCH/Sge/jobs/stress$
The LOAD value is identical to the second of the value triple reported by the uptime command or by the top process monitor. If LOAD is higher than NCPU more processes than available cores are able to run and will probably get CPU time values below 100%.
sgeisg1@arton01:~$ uptime 16:28:37 up 13 days, 6:47, 1 user, load average: 15.87, 10.10, 5.29
With qhost -q you get a more detailed status of the execution hosts with the actual slot allocation table of the queues.
gfreudig@rista:~/BTCH/Sge/jobs/stress$ qhost -q HOSTNAME ARCH NCPU NSOC NCOR NTHR LOAD MEMTOT MEMUSE SWAPTO SWAPUS ---------------------------------------------------------------------------------------------- global - - - - - - - - - - arton01 lx-amd64 32 2 16 32 2.96 126.1G 806.5M 125.0G 0.0 standard.q BP 0/5/20 long.q B 0/0/4 arton02 lx-amd64 32 2 16 32 3.20 126.1G 1.0G 125.0G 0.0 standard.q BP 0/1/20 long.q B 0/0/4 arton03 lx-amd64 32 2 16 32 0.20 126.1G 634.1M 125.0G 0.0 standard.q BP 0/0/20 long.q B 0/0/4 arton04 lx-amd64 40 2 20 40 0.08 126.1G 694.8M 125.0G 0.0 standard.q BP 0/0/25 long.q B 0/0/5 arton05 lx-amd64 40 2 20 40 0.02 126.1G 1.6G 125.0G 0.0 standard.q BP 0/0/25 long.q B 0/0/5 arton06 lx-amd64 40 2 20 40 3.24 126.1G 2.9G 125.0G 0.0 standard.q BP 0/1/25 long.q B 0/0/5 arton07 lx-amd64 40 2 20 40 2.60 126.1G 588.2M 125.0G 0.0 standard.q BP 0/4/25 long.q B 0/0/5 arton08 lx-amd64 40 2 20 40 0.17 126.1G 573.4M 125.0G 0.0 standard.q BP 0/0/25 long.q B 0/0/5 gfreudig@rista:~/BTCH/Sge/jobs/stress$
Now its time to talk about the different job queues seen in the output above.
Queue design
The Arton Grid starts has 3 different job queues :
standard.q |
Default queue for all jobs with standard ressource requests |
|
long.q |
Queue for jobs requiring more than 24h wallclock time ( max 96h ) |
|
hmem.q |
Queue for jobs requiring more than 8Gbyte of virtual memory |
The following table shows the configured ressource limits of wallclock time (h_rt) and virtual memory (h_vmem) of the queue instances on the different compute host groups :
Servers |
standard.q |
long.q |
hmem.q |
|||
arton01 - 03 |
slots=16,h_rt=24h,h_vmem=8G |
slots=8,h_rt=96h,h_vmem=8G |
slots=4,h_rt=24h,h_vmem=64G |
|||
arton04 - 08 |
slots=20,h_rt=24h,h_vmem=8G |
slots=10,h_rt=24h,h_vmem=8G |
slots=5,h_rt=24h,h_vmem=64G |
|||
arton09 - 10 |
slots=20,h_rt=24h,h_vmem=16G |
slots=10,h_rt=24h,h_vmem=16G |
slots=10,h_rt=24h,h_vmem=128G |
|||
arton11 |
slots=10,h_rt=24h,h_vmem=48G |
no instance |
slots=10,h_rt=24h,h_vmem=768G |
The parameters have the following meanings:
wall clock time : maximal time for a job to be in the running state
total slots : maximal number of jobs in the queue on all execution hosts
If you are dealing with normal sequential jobs and wall clock times <24h you should submit your jobs without specifying the execution queue.
If you have jobs with a wall clock time >24h and <96h you should place them in the long.q with the option "-q long.q".
Jobs who are reaching the wall time limit of an execution queue are killed by the grid engine scheduler.
Job to core binding
The new linux kernels are able to bind processes to a fixed number of cores. By default a job submitted to the grid engine is bound to 1 core. With slots per queue slightly over the physical number of cores of an execution host it's not possible to dramatically overload an execution host. For the correct handling of multithreaded jobs the Grid Engine has a parallel environment (pe) named "multicore". To run jobs on multiple cores of an execution host the job must be submitted with a special pe option:
> qsub -pe multicore <n> <job_script>
n is the number of requested cores. This information allows the scheduler of the grid engine to allocate the correct number of slots for the job and also to bind the job to the requested number of cores. Two things you should avoid:
- Submitting multithreaded jobs without the -pe option
- Submitting jobs with the -pe option which are using the requested cores very unefficient
In the first case the multithreading occurs only inside one core (100% cpu), if you have 10 threads every thread could reach a maximum of 10% cpu time. The second case leads to an underused state of the computing capacity of the grid engine. If you request more than 12 cores the request is reduced to 12 cores.
Job input/output data storage
Temporary data storage of a job, which is only used while the job is running, should be placed in the /scratch directory of the execution host. The environment variables of the tools should be set accordingly. The Matlab MCR_ROOT_CACHE variable is set automatically by the sge scheduler.
The file system protection of the /scratch directory allows everybody to create files and directories in it. A cron job runs periodically on the execution hosts to prevent the /scratch directory from getting full and cleans it according to given policies. Therefore data you put in the /scratch directory of an execution host is not safe over time.
Small sized input and output data for the jobs is best placed in your home directory. It is available on every execution host through the /home automounter.
If you have problems with the quota limit in your home directory you could transfer data from your home or the /scratch directory of your submit host to the /scratch directories of the arton execution hosts and vice versa. To do this you are allowed to login interactively on arton01 with your personal account. All /scratch directories of the execution hosts are available on arton01 with the /scratch_net automount system. You can access the /scratch directory of arton<nn> under /scratch_net/arton<nn>. So you are able to transfer data between the /scratch_net directories and your home with normal linux file copy and to the scratch of your submission host with scp.
Please do not use the possible login on arton01 to run compute jobs interactively. Our procguard system will detect you. Other data storage concepts for the arton grid are possible and will be investigated, if the above solution proves not to be sufficient.
Matlab on SGE
Mandelbrot sample array job
This sample array job is the SGE version of the mandelbrot example in the condor service description. In contrast to the condor version the task-ID dependent parameter calculations to get different fractal images is done in the matlab file "mandelplot.m". To get run this sample job download the 3 files mandelplot.m, mandelbrot.m and mandelbrot.sge to a directory under your home. To submit the job enter
> qsub mandelbrot.sge
With qstat you can track the execution of your job. If the output of the running jobs by qstat disappears the job has completed. Now you find 10 jpeg-files and 10 job output-files in the submit directory. The last line in the job-script file
/usr/sepp/bin/matlab -nojvm -nodisplay -nodesktop -nosplash -r "mandelplot($SGE_TASK_ID, 10,'mandel_$SGE_TASK_ID.jpg');exit"
shows how the array job variable SGE_TASK_ID is used to call matlab for executing the command mandelplot. The task-ID itself and an output file name depending on the task-ID are passed to the mandelplot function.