Size: 3128
Comment:
|
Size: 15671
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 4: | Line 4: |
At ITET the Condor Batch Queueing System is used since long time for running compute-intensive jobs. It uses the free resources on the tardis-PCs of the student rooms and on numerous PCs and compute servers at ITET institutes. Interactive work is privileged over batch computing, so running jobs could be killed by new interactive load or by shutdown/restart of a PC.<<BR>><<BR>> The SLURM system installed on the powerfull ITET arton compute servers is an alternative to the Condor batch computing system and '''reserved for staff of the contributing institutes (IBT,IFA,TIK,IKT,APS)'''. It consists of a master host, where the scheduler resides and the arton compute nodes, where the batch jobs are executed. The compute nodes are powerfull servers, which resides in server rooms and are exclusively reserved for batch processing. Interactive logins are disabled. |
At ITET the Condor Batch Queueing System has been used for a long time and is still used for running compute-intensive jobs. It uses the free resources on the tardis-PCs of the student rooms and on numerous PCs and compute servers at ITET institutes. Interactive work is privileged over batch computing, so running jobs could be killed by new interactive load or by shutdown/restart of a PC.<<BR>><<BR>> The SLURM system installed on the powerful ITET arton compute servers is an alternative to the Condor batch computing system and '''reserved for staff of the contributing institutes (APS,IBT,IFA,NARI,TIK,WINS)'''. It consists of a master host, where the scheduler resides and the arton compute nodes, where the batch jobs are executed. The compute nodes are powerful servers located in server rooms, they are exclusively reserved for batch processing. Interactive logins are disabled. |
Line 8: | Line 8: |
SLURM ('''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement) is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and computer clusters. Slurm's design is very modular with about 100 optional plugins. In 2010, the developers of Slurm founded SchedMD (https://www.schedmd.com), which maintains the canonical source, provides development, level 3 commercial support and training services and also provide a very good online documentation to Slurm ( https://slurm.schedmd.com ). | SLURM ('''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement) is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and compute clusters. Slurm's design is very modular with about 100 optional plugins. In 2010, the developers of Slurm founded [[https://www.schedmd.com|SchedMD]], which maintains the canonical source, provides development, level 3 commercial support and training services and also provide a very good online documentation to [[https://slurm.schedmd.com|Slurm]]. {{{ gfreudig@trollo:~/Batch$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST cpu.normal.32* up 1-00:00:00 2 idle arton02,zampano cpu.normal.64 up 1-00:00:00 1 idle arton09 cpu.normal.256 up 1-00:00:00 1 idle arton09 array.normal up 1-00:00:00 2 idle arton02,zampano gpu.normal up 1-00:00:00 1 mix artongpu01 gfreudig@trollo:~/Batch$ }}} |
Line 20: | Line 31: |
The local disks (/scratch) of arton09, arton10 and arton11 are fast SSD-disks (6 GBit/s) with a size of 720 GByte.<<BR>><<BR>> | The local disks (`/scratch`) of `arton09`, `arton10` and `arton11` are fast SSD-disks (6 GBit/s) with a size of 720 GByte.<<BR>><<BR>> |
Line 23: | Line 34: |
The artons cpu nodes offer the same software environment as all D-ITET managed Linux clients, gpu nodes have a restricted software ( no desktops installed ). | The artons cpu nodes offer the same software environment as all D-ITET managed Linux clients, gpu nodes have a restricted software (no desktops installed). |
Line 27: | Line 38: |
* sbatch - submit a job to the batch scheduler * squeue - examine running and waiting jobs * sinfo - status compute nodes * scancel - delete a running job |
* `sbatch` - submit a job to the batch scheduler * `squeue` - examine running and waiting jobs * `sinfo` - status compute nodes * `scancel` - delete a running job ==== Setting environment ==== The above commands only work if the environment variables for SLURM are set. Please add the following two lines to your `~/.bashrc`:<<BR>> {{{#!highlight bash numbers=disable export PATH=/usr/pack/slurm-19.05.0-sr/amd64-debian-linux9/bin:$PATH export SLURM_CONF=/home/sladmitet/slurm/slurm.conf }}} ==== sbatch -> Submitting a job ==== `sbatch` doesn't allow to submit a binary program directly, please wrap the program to run into a surrounding bash script. The `sbatch` command has the following syntax:<<BR>> {{{ > sbatch [options] job_script [job_script arguments] }}} The `job_script` is a standard UNIX shell script. The fixed options for the SLURM Scheduler are placed in the `job_script` in lines starting with '''#SBATCH'''. The UNIX shell interprets these lines as comments and ignores them. Only temporary options should be placed outside the `job_script`. To test your `job-script` you can simply run it interactively.<<BR>><<BR>> Assume there is a c program [[attachment:primes.c]] which is compiled to an executable binary named `primes` with "gcc -o primes primes.c". The program runs 5 seconds and calculates prime numbers. The found prime numbers and a final summary are sent to standard output. A sample `job_script` `primes.sh` to perform a batch run of the binary primes on the Arton grid looks like this: {{{#!highlight bash numbers=disable #!/bin/sh # #SBATCH --mail-type=ALL # mail configuration: NONE, BEGIN, END, FAIL, REQUEUE, ALL #SBATCH --output=log/%j.out # where to store the output ( %j is the JOBID ) /bin/echo Running on host: `hostname` /bin/echo In directory: `pwd` /bin/echo Starting on: `date` /bin/echo SLURM_JOB_ID: $SLURM_JOB_ID # # binary to execute ./primes echo finished at: `date` exit 0; }}} You can test the script by running it interactively in a terminal: {{{ gfreudig@trollo:~/Batch$ ./primes.sh }}} If the script runs successfully you can now submit it as a batch job to the SLURM arton grid: {{{ gfreudig@trollo:~/Batch$ sbatch primes.sh sbatch: Start executing function slurm_job_submit...... sbatch: Job partition set to : cpu.normal.32 (normal memory) Submitted batch job 931 gfreudig@trollo:~/Batch$ }}} After the job has finished, you will find the output file of the job in the log subdirectory with a name of `<JOBID>.out`.<<BR>> /!\ The directory for the job output must exist, it is not created automatically! <<BR>><<BR>> Similar to condor it is also possible to start an array job. The above job would run 10 times if you added the option '''#SBATCH --array=0-9''' to the job-script. A repeated execution only makes sense if the executed program adapts its behaviour according to the changing array task count number. The array count number can be referenced through the variable '''$SLURM_ARRAY_TASK_ID'''. You can pass the value of `$SLURM_ARRAY_TASK_ID` or some derived parameters to the executable. <<BR>> Here is a simple example of passing an input filename parameter changing with `$SLURM_ARRAY_TASK_ID` to the executable: {{{#!highlight bash numbers=disable . #SBATCH --array=0-9 # # binary to execute <path-to-executable> data$SLURM_ARRAY_TASK_ID.dat }}} Every run of the program in the array job with a different task-id will produce a separate output file.<<BR>><<BR>> The following table shows the most common options available for '''sbatch''' to be used in the `job_script` in lines starting with `#SBATCH`<<BR>> ||'''option'''||||'''description'''|| ||--mail-type=...||||Possible Values: NONE, BEGIN, END, FAIL, REQUEUE, ALL|| ||--mem=<n>G||||the job needs a maximum of <n> GByte ( if omitted the default of 12G is used )|| ||--cpus-per-task=<n>||||number of cores to be used for the job|| ||--gres=gpu:1||||number of GPUs needed for the job ( limited to 1 ! )|| ||--nodes=<n>||||number of compute nodes to be used for the job|| /!\ The --nodes option should only be used for MPI jobs ! ==== squeue -> Show running/waiting jobs ==== The squeue command shows the actual list of running and pending jobs in the system. As you can see in the following sample output the default format is quite minimalistic: {{{ gfreudig@trollo:~/Batch$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 951 cpu.norma primes.s gfreudig R 0:11 1 arton02 950 cpu.norma primes_4 gfreudig R 0:36 1 arton02 949 cpu.norma primes.s fgtest01 R 1:22 1 arton02 948 gpu.norma primes.s fgtest01 R 1:39 1 artongpu01 gfreudig@trollo:~/Batch$ }}} More detailed information can be obtained by issuing the following command: {{{ gfreudig@trollo:~/Batch$ squeue -O jobarrayid:10,state:10,partition:16,reasonlist:18,username:10,tres-alloc:45,timeused:8,command:50 JOBID STATE PARTITION NODELIST(REASON) USER TRES_ALLOC TIME COMMAND 951 RUNNING cpu.normal.32 arton02 gfreudig cpu=1,mem=32G,node=1,billing=1 1:20 /home/gfreudig/BTCH/Slurm/jobs/single/primes.sh 600 950 RUNNING cpu.normal.32 arton02 gfreudig cpu=4,mem=8G,node=1,billing=4 1:45 /home/gfreudig/BTCH/Slurm/jobs/multi/primes_4.sh 600 949 RUNNING cpu.normal.32 arton02 fgtest01 cpu=1,mem=8G,node=1,billing=1 2:31 /home/fgtest01/BTCH/Slurm/jobs/single/primes.sh 600 948 RUNNING gpu.normal artongpu01 fgtest01 cpu=1,mem=8G,node=1,billing=1,gres/gpu=1 2:48 /home/fgtest01/BTCH/Slurm/jobs/single/primes.sh 600 gfreudig@trollo:~/Batch$ }}} Defining an alias in your `.bashrc` with {{{#!highlight bash numbers=disable alias sq1='squeue -O jobarrayid:10,state:10,partition:16,reasonlist:18,username:10,tres-alloc:45,timeused:8,command:50' }}} puts the command `sq1` at your fingertips. ==== scancel -> Deleting a job ==== With `scancel` you can remove your waiting and running jobs from the scheduler queue by their associated `JOBID`. The command `squeue` lists your jobs including their `JOBID`s. A job can then be deleted with {{{ > scancel <JOBID> }}} To operate on an array job you can use the following commands {{{ > scancel <JOBID> # all jobs (waiting or running) of the array job are deleted > scancel <JOBID>_n # the job with task-ID n is deleted > scancel <JOBID>_[n1-n2] # the jobs with task-ID in the range n1-n2 are deleted }}} ==== sinfo -> Show partition configuration ==== The partition status can be obtained by using the `sinfo` command. An example listing is shown below. {{{ gfreudig@trollo:~/Batch$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST cpu.normal.32* up 1-00:00:00 2 idle arton[01-11] cpu.normal.64 up 1-00:00:00 1 idle arton[09-11] cpu.normal.256 up 1-00:00:00 1 idle arton11 array.normal up 1-00:00:00 2 idle arton[01-08] gpu.normal up 1-00:00:00 1 idle artongpu01 gfreudig@trollo:~/Batch$ }}} For normal jobs (single, multicore) you can not choose the partition for the job to run in the `sbatch` command, the partition is selected by the scheduler according to your memory request. Array jobs are put in the `array.normal` partition, gpu jobs in the `gpu.normal` partition. The following table shows the job memory limits in different partitions:<<BR>> ||'''PARTITION'''||||'''max.Memory'''|| ||cpu.normal.32||||32 GB|| ||cpu.normal.64||||64 GB|| ||cpu.normal.256||||256 GB|| ||array.normal||||32 GB|| ||gpu.normal||||64 GB|| Only a job with a `--mem` request of a maximum of 32 GByte can run in the `cpu.normal.32` partition, which contains all 11 artons. === Multicore jobs/ job to core binding === A modern linux kernel is able to bind a process and all its children to a fixed number of cores. By default a job submitted to the SLURM arton grid is bound to to the numbers of requested cores/cpus. The default number of requested cpus is 1, if you have an application which is able to run multithreaded on several cores you must use the '''--cpus-per-task''' option in the sbatch command to get a binding to more than one core. To check for processes with core bindings, use the command `hwloc-ps -c`: {{{ gfreudig@trollo:~/Batch$ ssh arton02 hwloc-ps -c 43369 0x00010001 slurmstepd: [984.batch] 43374 0x00010001 /bin/sh 43385 0x00010001 codebin/primes gfreudig@trollo:~/Batch$ }}} === Job input/output data storage === Temporary data storage of a job used only while the job is running, should be placed in the `/scratch` directory of the compute nodes. Set the environment variables of the tools you use accordingly. The Matlab `MCR_ROOT_CACHE` variable is set automatically by the SLURM scheduler.<<BR>> The file system protection of the `/scratch` directory allows everybody to create files and directories in it. A cron job runs periodically on the execution hosts to prevent the `/scratch` directory from filling up and cleans it governed by pre-set policies. Therefore data you place in the `/scratch` directory of a compute node cannot be assumed to stay there forever.<<BR>><<BR>> Small sized input and output data for the jobs is best placed in your home directory. It is available on every compute node through the `/home` automounter.<<BR>><<BR>> If you have problems with the quota limit in your home directory you could transfer data from your home or the `/scratch` directory of your submit host to the `/scratch` directories of the arton compute nodes and vice versa. For this purpose interactive logins with personal accounts are allowed on `arton01`. All `/scratch` directories of the compute nodes are available on `arton01` through the `/scratch_net` automount system. You can access the `/scratch` directory of `arton<nn>` under `/scratch_net/arton<nn>`. This allows you to transfer data between the `/scratch_net` directories and your home with normal linux file copy and to the `/scratch` of your submission host with scp.<<BR>><<BR>> Do not log in on `arton01` to run compute jobs interactively. Such jobs will be detected by our procguard system and killed.. Other data storage concepts for the arton grid are possible and will be investigated, if the above solution proves not to be sufficient.<<BR>> In the near future ISG will provide a network attached scratch storage system (Netscratch) which will be accessible on all managed linux clients and compute nodes. === Matlab Distributed Computing Environment (MDCE) === The Matlab Parallel Computing Toolbox (PCT) can be configured with an interface to the SLURM cluster. To work with MDCE please import [[attachment:Slurm.settings]] in Matlab GUI (Parallel -> Create and manage Clusters -> Import ). You will now see a cluster profile Slurm beside the standard local(default) profile in the profile list. The local profile is for using more workers on one machine while with the Slurm profile up to 32 worker processes distributed over all slurm compute nodes can be used.<<BR>> /!\ Please temporaray reduce the number of workers to 4 in the Slurm profile when performing the profile "Validation" function in the Matlab Cluster Manager.<<BR>> /!\ Don't forget to set the Slurm environment variables before starting matlab !<<BR>> For the Slurm profile to work you must have a working ssh key pair in ~/.ssh (id_rsa, id_rsa.pub). The Slurm cluster profile can be used with matlab programs in batch mode but it's also possible to use the profile in an interactive matlab session on your client. If you open a Slurm parpool the workers are started automatically in the cluster.<<BR>> /!\ In interactice mode please always close your parpool if you are performing no calculations on the workers.<<BR>> You can find a sample code for the 3 Matlab PCT methods parfor, spmd, tasks using the local or Slurm cluster profile in [[attachment:PCTRefJobs.tar.gz]]. |
Contents
Introduction
At ITET the Condor Batch Queueing System has been used for a long time and is still used for running compute-intensive jobs. It uses the free resources on the tardis-PCs of the student rooms and on numerous PCs and compute servers at ITET institutes. Interactive work is privileged over batch computing, so running jobs could be killed by new interactive load or by shutdown/restart of a PC.
The SLURM system installed on the powerful ITET arton compute servers is an alternative to the Condor batch computing system and reserved for staff of the contributing institutes (APS,IBT,IFA,NARI,TIK,WINS). It consists of a master host, where the scheduler resides and the arton compute nodes, where the batch jobs are executed. The compute nodes are powerful servers located in server rooms, they are exclusively reserved for batch processing. Interactive logins are disabled.
SLURM
SLURM (Simple Linux Utility for Resource Management) is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and compute clusters. Slurm's design is very modular with about 100 optional plugins. In 2010, the developers of Slurm founded SchedMD, which maintains the canonical source, provides development, level 3 commercial support and training services and also provide a very good online documentation to Slurm.
gfreudig@trollo:~/Batch$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST cpu.normal.32* up 1-00:00:00 2 idle arton02,zampano cpu.normal.64 up 1-00:00:00 1 idle arton09 cpu.normal.256 up 1-00:00:00 1 idle arton09 array.normal up 1-00:00:00 2 idle arton02,zampano gpu.normal up 1-00:00:00 1 mix artongpu01 gfreudig@trollo:~/Batch$
SLURM Arton Grid
Hardware
At the moment the computing power of the SLURM Arton Grid is based on the following 11 cpu compute servers and 1 gpu compute server (compute nodes) :
Server |
CPU |
Frequency |
Cores |
GPUs |
Memory |
Operating System |
||||||
arton01 - 03 |
Dual Octa-Core Intel Xeon E5-2690 |
2.90 GHz |
16 |
- |
128 GB |
Debian 9 |
||||||
arton04 - 08 |
Dual Deca-Core Intel Xeon E5-2690 v2 |
3.00 GHz |
20 |
- |
128 GB |
Debian 9 |
||||||
arton09 - 10 |
Dual Deca-Core Intel Xeon E5-2690 v2 |
3.00 GHz |
20 |
- |
256 GB |
Debian 9 |
||||||
arton11 |
Dual Deca-Core Intel Xeon E5-2690 v2 |
3.00 GHz |
20 |
- |
768 GB |
Debian 9 |
||||||
artongpu01 |
Dual Octa-Core Intel Xeon Silver 4208 CPU |
2.10 GHz |
16 |
2 |
128GB |
Debian 9 |
The local disks (/scratch) of arton09, arton10 and arton11 are fast SSD-disks (6 GBit/s) with a size of 720 GByte.
The SLURM job scheduler runs on the linux server itetmaster01.
Software
The artons cpu nodes offer the same software environment as all D-ITET managed Linux clients, gpu nodes have a restricted software (no desktops installed).
Using SLURM
At a basic level, SLURM is very easy to use. The following sections will describe the commands you need to run and manage your batch jobs on the Grid Engine. The commands that will be most useful to you are as follows
sbatch - submit a job to the batch scheduler
squeue - examine running and waiting jobs
sinfo - status compute nodes
scancel - delete a running job
Setting environment
The above commands only work if the environment variables for SLURM are set. Please add the following two lines to your ~/.bashrc:
export PATH=/usr/pack/slurm-19.05.0-sr/amd64-debian-linux9/bin:$PATH
export SLURM_CONF=/home/sladmitet/slurm/slurm.conf
sbatch -> Submitting a job
sbatch doesn't allow to submit a binary program directly, please wrap the program to run into a surrounding bash script. The sbatch command has the following syntax:
> sbatch [options] job_script [job_script arguments]
The job_script is a standard UNIX shell script. The fixed options for the SLURM Scheduler are placed in the job_script in lines starting with #SBATCH. The UNIX shell interprets these lines as comments and ignores them. Only temporary options should be placed outside the job_script. To test your job-script you can simply run it interactively.
Assume there is a c program primes.c which is compiled to an executable binary named primes with "gcc -o primes primes.c". The program runs 5 seconds and calculates prime numbers. The found prime numbers and a final summary are sent to standard output. A sample job_script primes.sh to perform a batch run of the binary primes on the Arton grid looks like this:
#!/bin/sh
#
#SBATCH --mail-type=ALL # mail configuration: NONE, BEGIN, END, FAIL, REQUEUE, ALL
#SBATCH --output=log/%j.out # where to store the output ( %j is the JOBID )
/bin/echo Running on host: `hostname`
/bin/echo In directory: `pwd`
/bin/echo Starting on: `date`
/bin/echo SLURM_JOB_ID: $SLURM_JOB_ID
#
# binary to execute
./primes
echo finished at: `date`
exit 0;
You can test the script by running it interactively in a terminal:
gfreudig@trollo:~/Batch$ ./primes.sh
If the script runs successfully you can now submit it as a batch job to the SLURM arton grid:
gfreudig@trollo:~/Batch$ sbatch primes.sh sbatch: Start executing function slurm_job_submit...... sbatch: Job partition set to : cpu.normal.32 (normal memory) Submitted batch job 931 gfreudig@trollo:~/Batch$
After the job has finished, you will find the output file of the job in the log subdirectory with a name of <JOBID>.out.
The directory for the job output must exist, it is not created automatically!
Similar to condor it is also possible to start an array job. The above job would run 10 times if you added the option #SBATCH --array=0-9 to the job-script. A repeated execution only makes sense if the executed program adapts its behaviour according to the changing array task count number. The array count number can be referenced through the variable $SLURM_ARRAY_TASK_ID. You can pass the value of $SLURM_ARRAY_TASK_ID or some derived parameters to the executable.
Here is a simple example of passing an input filename parameter changing with $SLURM_ARRAY_TASK_ID to the executable:
.
#SBATCH --array=0-9
#
# binary to execute
<path-to-executable> data$SLURM_ARRAY_TASK_ID.dat
Every run of the program in the array job with a different task-id will produce a separate output file.
The following table shows the most common options available for sbatch to be used in the job_script in lines starting with #SBATCH
option |
description |
|
--mail-type=... |
Possible Values: NONE, BEGIN, END, FAIL, REQUEUE, ALL |
|
--mem=<n>G |
the job needs a maximum of <n> GByte ( if omitted the default of 12G is used ) |
|
--cpus-per-task=<n> |
number of cores to be used for the job |
|
--gres=gpu:1 |
number of GPUs needed for the job ( limited to 1 ! ) |
|
--nodes=<n> |
number of compute nodes to be used for the job |
The --nodes option should only be used for MPI jobs !
squeue -> Show running/waiting jobs
The squeue command shows the actual list of running and pending jobs in the system. As you can see in the following sample output the default format is quite minimalistic:
gfreudig@trollo:~/Batch$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 951 cpu.norma primes.s gfreudig R 0:11 1 arton02 950 cpu.norma primes_4 gfreudig R 0:36 1 arton02 949 cpu.norma primes.s fgtest01 R 1:22 1 arton02 948 gpu.norma primes.s fgtest01 R 1:39 1 artongpu01 gfreudig@trollo:~/Batch$
More detailed information can be obtained by issuing the following command:
gfreudig@trollo:~/Batch$ squeue -O jobarrayid:10,state:10,partition:16,reasonlist:18,username:10,tres-alloc:45,timeused:8,command:50 JOBID STATE PARTITION NODELIST(REASON) USER TRES_ALLOC TIME COMMAND 951 RUNNING cpu.normal.32 arton02 gfreudig cpu=1,mem=32G,node=1,billing=1 1:20 /home/gfreudig/BTCH/Slurm/jobs/single/primes.sh 600 950 RUNNING cpu.normal.32 arton02 gfreudig cpu=4,mem=8G,node=1,billing=4 1:45 /home/gfreudig/BTCH/Slurm/jobs/multi/primes_4.sh 600 949 RUNNING cpu.normal.32 arton02 fgtest01 cpu=1,mem=8G,node=1,billing=1 2:31 /home/fgtest01/BTCH/Slurm/jobs/single/primes.sh 600 948 RUNNING gpu.normal artongpu01 fgtest01 cpu=1,mem=8G,node=1,billing=1,gres/gpu=1 2:48 /home/fgtest01/BTCH/Slurm/jobs/single/primes.sh 600 gfreudig@trollo:~/Batch$
Defining an alias in your .bashrc with
alias sq1='squeue -O jobarrayid:10,state:10,partition:16,reasonlist:18,username:10,tres-alloc:45,timeused:8,command:50'
puts the command sq1 at your fingertips.
scancel -> Deleting a job
With scancel you can remove your waiting and running jobs from the scheduler queue by their associated JOBID. The command squeue lists your jobs including their JOBIDs. A job can then be deleted with
> scancel <JOBID>
To operate on an array job you can use the following commands
> scancel <JOBID> # all jobs (waiting or running) of the array job are deleted > scancel <JOBID>_n # the job with task-ID n is deleted > scancel <JOBID>_[n1-n2] # the jobs with task-ID in the range n1-n2 are deleted
sinfo -> Show partition configuration
The partition status can be obtained by using the sinfo command. An example listing is shown below.
gfreudig@trollo:~/Batch$ sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST cpu.normal.32* up 1-00:00:00 2 idle arton[01-11] cpu.normal.64 up 1-00:00:00 1 idle arton[09-11] cpu.normal.256 up 1-00:00:00 1 idle arton11 array.normal up 1-00:00:00 2 idle arton[01-08] gpu.normal up 1-00:00:00 1 idle artongpu01 gfreudig@trollo:~/Batch$
For normal jobs (single, multicore) you can not choose the partition for the job to run in the sbatch command, the partition is selected by the scheduler according to your memory request. Array jobs are put in the array.normal partition, gpu jobs in the gpu.normal partition. The following table shows the job memory limits in different partitions:
PARTITION |
max.Memory |
|
cpu.normal.32 |
32 GB |
|
cpu.normal.64 |
64 GB |
|
cpu.normal.256 |
256 GB |
|
array.normal |
32 GB |
|
gpu.normal |
64 GB |
Only a job with a --mem request of a maximum of 32 GByte can run in the cpu.normal.32 partition, which contains all 11 artons.
Multicore jobs/ job to core binding
A modern linux kernel is able to bind a process and all its children to a fixed number of cores. By default a job submitted to the SLURM arton grid is bound to to the numbers of requested cores/cpus. The default number of requested cpus is 1, if you have an application which is able to run multithreaded on several cores you must use the --cpus-per-task option in the sbatch command to get a binding to more than one core. To check for processes with core bindings, use the command hwloc-ps -c:
gfreudig@trollo:~/Batch$ ssh arton02 hwloc-ps -c 43369 0x00010001 slurmstepd: [984.batch] 43374 0x00010001 /bin/sh 43385 0x00010001 codebin/primes gfreudig@trollo:~/Batch$
Job input/output data storage
Temporary data storage of a job used only while the job is running, should be placed in the /scratch directory of the compute nodes. Set the environment variables of the tools you use accordingly. The Matlab MCR_ROOT_CACHE variable is set automatically by the SLURM scheduler.
The file system protection of the /scratch directory allows everybody to create files and directories in it. A cron job runs periodically on the execution hosts to prevent the /scratch directory from filling up and cleans it governed by pre-set policies. Therefore data you place in the /scratch directory of a compute node cannot be assumed to stay there forever.
Small sized input and output data for the jobs is best placed in your home directory. It is available on every compute node through the /home automounter.
If you have problems with the quota limit in your home directory you could transfer data from your home or the /scratch directory of your submit host to the /scratch directories of the arton compute nodes and vice versa. For this purpose interactive logins with personal accounts are allowed on arton01. All /scratch directories of the compute nodes are available on arton01 through the /scratch_net automount system. You can access the /scratch directory of arton<nn> under /scratch_net/arton<nn>. This allows you to transfer data between the /scratch_net directories and your home with normal linux file copy and to the /scratch of your submission host with scp.
Do not log in on arton01 to run compute jobs interactively. Such jobs will be detected by our procguard system and killed.. Other data storage concepts for the arton grid are possible and will be investigated, if the above solution proves not to be sufficient.
In the near future ISG will provide a network attached scratch storage system (Netscratch) which will be accessible on all managed linux clients and compute nodes.
Matlab Distributed Computing Environment (MDCE)
The Matlab Parallel Computing Toolbox (PCT) can be configured with an interface to the SLURM cluster. To work with MDCE please import Slurm.settings in Matlab GUI (Parallel -> Create and manage Clusters -> Import ). You will now see a cluster profile Slurm beside the standard local(default) profile in the profile list. The local profile is for using more workers on one machine while with the Slurm profile up to 32 worker processes distributed over all slurm compute nodes can be used.
Please temporaray reduce the number of workers to 4 in the Slurm profile when performing the profile "Validation" function in the Matlab Cluster Manager.
Don't forget to set the Slurm environment variables before starting matlab !
For the Slurm profile to work you must have a working ssh key pair in ~/.ssh (id_rsa, id_rsa.pub). The Slurm cluster profile can be used with matlab programs in batch mode but it's also possible to use the profile in an interactive matlab session on your client. If you open a Slurm parpool the workers are started automatically in the cluster.
In interactice mode please always close your parpool if you are performing no calculations on the workers.
You can find a sample code for the 3 Matlab PCT methods parfor, spmd, tasks using the local or Slurm cluster profile in PCTRefJobs.tar.gz.