Differences between revisions 5 and 100 (spanning 95 versions)
Revision 5 as of 2019-09-06 10:25:31
Size: 6381
Editor: gfreudig
Comment:
Revision 100 as of 2020-09-08 11:00:28
Size: 26181
Editor: stroth
Comment:
Deletions are marked like this. Additions are marked like this.
Line 4: Line 4:
At ITET the Condor Batch Queueing System is used since long time for running compute-intensive jobs. It uses the free resources on the tardis-PCs of the student rooms and on numerous PCs and compute servers at ITET institutes. Interactive work is privileged over batch computing, so running jobs could be killed by new interactive load or by shutdown/restart of a PC.<<BR>><<BR>>
The SLURM system installed on the powerfull ITET arton compute servers is an alternative to the Condor batch computing system and '''reserved for staff of the contributing institutes (IBT,IFA,TIK,IKT,APS)'''. It consists of a master host, where the scheduler resides and the arton compute nodes, where the batch jobs are executed. The compute nodes are powerfull servers, which resides in server rooms and are exclusively reserved for batch processing. Interactive logins are disabled.
At ITET the Condor Batch Queueing System has been used for a long time and is still used for running compute-intensive jobs. It uses the free resources on the tardis-PCs of the student rooms and on numerous PCs and compute servers at ITET institutes. Interactive work is privileged over batch computing, so running jobs could be killed by new interactive load or by shutdown/restart of a PC.<<BR>><<BR>>
The SLURM system installed on the powerful ITET arton compute servers is an alternative to the Condor batch computing system. It consists of a master host, where the scheduler resides and the compute nodes, where batch jobs are executed. The compute nodes are powerful servers located in server rooms, they are exclusively reserved for batch processing. Interactive logins are disabled.

=== Access ===
Access to the SLURM grid is reserved for staff of the contributing institutes '''APS, IBT, IFA, MINS, NARI, TIK'''. Access is granted on request, please contact [[mailto:support@ee.ethz.ch|ISG.EE support]].<<BR>>
If your circumstances differ and you'd still like to use the cluster, please contact [[mailto:support@ee.ethz.ch|ISG.EE support]] as well and ask for an offer. Time-limited test accounts for up to 2 weeks are also available on request.
Line 8: Line 12:
SLURM ('''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement) is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and computer clusters. Slurm's design is very modular with about 100 optional plugins. In 2010, the developers of Slurm founded SchedMD (https://www.schedmd.com), which maintains the canonical source, provides development, level 3 commercial support and training services and also provide a very good online documentation to Slurm ( https://slurm.schedmd.com ).

== SLURM Arton Grid ==
SLURM ('''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement) is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and compute clusters. Slurm's design is very modular with about 100 optional plugins. In 2010, the developers of Slurm founded [[https://www.schedmd.com|SchedMD]], which maintains the canonical source, provides development, level 3 commercial support and training services and also provide a very good online documentation to [[https://slurm.schedmd.com|Slurm]].

== SLURM Grid ==
Line 12: Line 17:
At the moment the computing power of the SLURM Arton Grid is based on the following 11 cpu compute servers and 1 gpu compute server (compute nodes) :<<BR>><<BR>>
||'''Server'''||||'''CPU'''||||'''Frequency'''||||'''Cores'''||||'''GPUs'''||||'''Memory'''||||'''Operating System'''||
||arton01 - 03||||Dual Octa-Core Intel Xeon E5-2690||||2.90 GHz||||16||||-||||128 GB||||Debian 9||
||arton04 - 08||||Dual Deca-Core Intel Xeon E5-2690 v2||||3.00 GHz||||20||||-||||128 GB||||Debian 9||
||arton09 - 10||||Dual Deca-Core Intel Xeon E5-2690 v2||||3.00 GHz||||20||||-||||256 GB||||Debian 9||
||arton11||||Dual Deca-Core Intel Xeon E5-2690 v2||||3.00 GHz||||20||||-||||768 GB||||Debian 9||
||artongpu01||||Dual Octa-Core Intel Xeon Silver 4208 CPU||||2.10 GHz||||16||||2||||128GB||||Debian 9||
<<BR>>
The local disks (/scratch) of arton09, arton10 and arton11 are fast SSD-disks (6 GBit/s) with a size of 720 GByte.<<BR>><<BR>>
The SLURM job scheduler runs on the linux server `itetmaster01`.<<BR>>
At the moment the computing power of the SLURM grid is based on the following 11 cpu compute nodes and 1 gpu compute node:
||'''Server'''||'''CPU'''||'''Frequency'''||'''Cores'''||'''Memory'''||'''/scratch SSD'''||'''GPUs'''||'''GPU Memory'''||'''Operating System'''||
||arton01 - 03||Dual Octa-Core Intel Xeon E5-2690||2.90 GHz||16||128 GB||-||-||-||Debian 9||
||arton04 - 08||Dual Deca-Core Intel Xeon E5-2690 v2||3.00 GHz||20||128 GB||-||-||-||Debian 9||
||arton09 - 10||Dual Deca-Core Intel Xeon E5-2690 v2||3.00 GHz||20||256 GB||&#10003;||-||-||Debian 9||
||arton11||Dual Deca-Core Intel Xeon E5-2690 v2||3.00 GHz||20||736 GB||&#10003;||-||-||Debian 9||
||artongpu01||Dual Octa-Core Intel Xeon Silver 4208||2.10 GHz||16||128GB||&#10003;||4 RTX 2080 Ti||11 GB||Debian 9||
 * `artongpu01` is meant to be a test system to try out GPU calculations
<<BR>>

The following 4 gpu nodes are reserved for exclusive use by TIK:
||'''Server'''||'''CPU'''||'''Frequency'''||'''Cores'''||'''Memory'''||'''/scratch SSD'''||'''GPUs'''||'''GPU Memory'''||'''Operating System'''||
||tikgpu01||Dual Tetrakaideca-Core Xeon E5-2680 v4||2.40GHz||28||512GB||&#10003;||5 Titan Xp, 2 GTX Titan X||12 GB||Debian 9||
||tikgpu02||Dual Tetrakaideca-Core Xeon E5-2680 v4||2.40GHz||28||512GB||&#10003;||8 Titan Xp||12 GB||Debian 9||
||tikgpu03||Dual Tetrakaideca-Core Xeon E5-2680 v4||2.40GHz||28||512GB||&#10003;||8 Titan Xp||12 GB||Debian 9||
||tikgpu04||Dual Hectakaideca-Core Xeon Gold 6242 v4||2.80GHz||32||384GB||&#10003;||8 Titan RTX||24 GB||Debian 9||

The SLURM job scheduler runs on the linux server `itetmaster01`.
Line 23: Line 37:
The artons cpu nodes offer the same software environment as all D-ITET managed Linux clients, gpu nodes have a restricted software ( no desktops installed ). The nodes offer the same software environment as all D-ITET managed Linux clients, gpu nodes have a restricted software (no desktops installed, minimal dependencies needed for driver support).
Line 26: Line 40:
At a basic level, SLURM is very easy to use. The following sections will describe the commands you need to run and manage your batch jobs on the Grid Engine. The commands that will be most useful to you are as follows<<BR>>
 * sbatch - submit a job to the batch scheduler
 * squeue - examine running and waiting jobs
 * sinfo - status compute nodes
 * scancel - delete a running job
At a basic level, SLURM is very easy to use. The following sections will describe the commands you need to run and manage your batch jobs. The commands that will be most useful to you are as follows<<BR>>
 * `sbatch` - submit a job to the batch scheduler
 * `squeue` - examine running and waiting jobs
 * `sinfo` - status compute nodes
 * `scancel` - delete a running job
Line 32: Line 47:
The above commands are only working if the environment variables for SLURM are set. Please put the following to lines in your ~/.bashrc :<<BR>>
{{{
export PATH=/usr/pack/slurm-19.05.0-sr/amd64-debian-linux9/bin:$PATH
The above commands only work if the environment variables for SLURM are set. Please issue the following commands in your bash shell to start working with the cluster immediately or add them to your `~/.bashrc` to have the slurm commands available for new instances of bash:<<BR>>
{{{#!highlight bash numbers=disable
export PATH=/usr/pack/slurm-19.05.x-sr/amd64-debian-linux9/bin:$PATH
Line 37: Line 52:
==== sbatch : Submitting a job ====
sbatch doesn't allow to submit a binary program directly, please put the program to run in a surrounding bash script. The sbatch command has the following syntax:<<BR>>

==== sbatch ->
Submitting a job ====
`sbatch` doesn't allow to submit a binary program directly, please wrap the program to run into a surrounding bash script. The `sbatch` command has the following syntax:<<BR>>
Line 42: Line 58:
The job_script is a standard UNIX shell script. The fixed options for the SLURM Scheduler are placed in the job_script in lines starting with '''#SBATCH'''. The UNIX shell interpreter read this lines as comment lines and ignores them. Only temporary options should be placed outside the job_script. To test your job-script you can simply run it interactively.<<BR>><<BR>>
Assume there is a c program [[attachment:primes.c]] which is compiled to an executable binary named primes with "gcc -o primes primes.c". The program runs 5 seconds and calculates prime numbers. The found prime numbers and a final summary report are written to standard output. A sample job_script primes.sh to perform a batch run of the binary primes on the Arton grid looks like this:
{{{
#!/bin/sh
#
#SBATCH  --mail-type=ALL # mail configuration: NONE, BEGIN, END, FAIL, REQUEUE, ALL
#SBATCH  --output=log/%j.out # where to store the output ( %j is the JOBID )
The `job_script` is a standard UNIX shell script. The fixed options for the SLURM Scheduler are placed in the `job_script` in lines starting with '''#SBATCH'''. The UNIX shell interprets these lines as comments and ignores them. Only temporary options should be placed outside the `job_script`. To test your `job-script` you can simply run it interactively.<<BR>><<BR>>
Assume there is a c program [[attachment:primes.c]] which is compiled to an executable binary named `primes` with "gcc -o primes primes.c". The program runs 5 seconds and calculates prime numbers. The found prime numbers and a final summary are sent to standard output. A sample `job_script` `primes.sh` to perform a batch run of the binary primes on the Arton grid looks like this:
{{{#!highlight bash numbers=disable
#!/bin/bash

#SBATCH --mail-type=ALL # mail configuration: NONE, BEGIN, END, FAIL, REQUEUE, ALL
#SBATCH --output=log/%j.out # where to store the output (%j is the JOBID), subdirectory must exist
#SBATCH --error=log/%j.err # where to store error messages
Line 53: Line 71:
#
# exit on errors
set -o errexit
Line 59: Line 79:
You cat test the script by running it interactively in a terminal: You can test the script by running it interactively in a terminal:
Line 63: Line 83:
If the script runs successfully you now can submit it as a batch job to the SLURM arton grid: If the script runs successfully you can now submit it as a batch job to the SLURM arton grid:
Line 71: Line 91:
When the job has finished, you find the output file of the job in the log subdirectory with a name of <JOBID>.out .<<BR>>
/!\ The directory for the job output must exist, it is not created automatically !
After the job has finished, you will find the output file of the job in the log subdirectory with a name of `<JOBID>.out`.<<BR>>
/!\ The directory for the job output must exist, it is not created automatically!
Line 74: Line 94:
Like in condor its also possible to start an array job. The job above would run 10 times if you put the option '''#SBATCH   --array=0-9''' in the job-script. The repeated execution makes only sense if something is changed in the executed program with the array task count number.The array count number can be referenced through the variable '''$SLURM_ARRAY_TASK_ID'''. You can pass the value of $SLURM_ARRAY_TASK_ID or some derived parameters to the executable. A simple solution to pass an $SLURM_ARRAY_TASK_ID dependent input filename parameter for the executable would look like this:
{{{
You can only submit jobs to SLURM if your account is configured in the SLURM user database. If it isn't, you'll receive this [[#Batch_job_submission_failed:_Invalid_account|error message]]

==== sbatch -> Submitting an array job ====
Similar to condor it is also possible to start an array job
. The above job would run 10 times if you added the option '''#SBATCH --array=0-9''' to the job-script. A repeated execution only makes sense if the executed program adapts its behaviour according to the changing array task count number. The array count number can be referenced through the variable '''$SLURM_ARRAY_TASK_ID'''. You can pass the value of `$SLURM_ARRAY_TASK_ID` or some derived parameters to the executable.<<BR>>
Here is a simple example of passing an input filename parameter changing with `
$SLURM_ARRAY_TASK_ID` to the executable:
{{{#!highlight bash numbers=disable
Line 82: Line 106:
Every run of the program in the array job with a different task-id will also produce a separate output file.<<BR>><<BR>> Every run of the program in the array job with a different task-id will produce a separate output file.<<BR>>

The option expects a '''range of task-ids''' expressed in the form `--array=n[,k[,...]][-m[:s]]%l`<<BR>>
where `n`, `k`, `m` are discreet task-ids, `s` is a step applied to a range `n`-`m` and `l` applies a limit to the number of simultaneously running tasks. See `man sbatch` for examples.<<BR>>
Specifying '''one''' task-id instead of a range as in `--array=10` results in an array job with a single task with task-id 10.<<BR>>
The following variables will be available in the job context and reflect the option arguments given: '''$SLURM_ARRAY_TASK_MAX''', '''$SLURM_ARRAY_TASK_MIN''', '''$SLURM_ARRAY_TASK_STEP'''.

==== sbatch -> Common options ====
The following table shows the most common options available for '''sbatch''' to be used in the `job_script` in lines starting with `#SBATCH`<<BR>>
||'''option'''||'''description'''||
||--mail-type=...||Possible Values: NONE, BEGIN, END, FAIL, REQUEUE, ALL||
||--mem=<n>G||the job needs a maximum of <n> GByte ( if omitted the default of 6G is used )||
||--cpus-per-task=<n>||number of cores to be used for the job||
||--gres=gpu:1||number of GPUs needed for the job||
||--nodes=<n>||number of compute nodes to be used for the job||
||--hint=<type>||Bind tasks to CPU cores according to application hints (See `man --pager='less +/--hint' srun` and [[https://slurm.schedmd.com/mc_support.html#srun_hints|multi-core support]]||
||--constraint=<feature_name>||Request one or more [[#sinfo_-.3E_Show_available_features|features]], optionally combined by operators||

 * /!\ The `--nodes` option should only be used for MPI jobs !
 * The operators to combine `--constraint` lists are:
 . '''AND (&)''': `#SBATCH --constraint='geforce_rtx_2080_ti&titan_rtx`
 . '''OR (|)''': `#SBATCH --constraint='titan_rtx|titan_xp'`

==== squeue -> Show running/waiting jobs ====
The squeue command shows the actual list of running and pending jobs in the system. As you can see in the following sample output the default format is quite minimalistic:
{{{
gfreudig@trollo:~/Batch$ squeue
             JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
               951 cpu.norma primes.s gfreudig R 0:11 1 arton02
               950 cpu.norma primes_4 gfreudig R 0:36 1 arton02
               949 cpu.norma primes.s fgtest01 R 1:22 1 arton02
               948 gpu.norma primes.s fgtest01 R 1:39 1 artongpu01
gfreudig@trollo:~/Batch$
}}}
More detailed information can be obtained by issuing the following command:
{{{
gfreudig@trollo:~/Batch$ squeue -O jobarrayid:10,state:10,partition:16,reasonlist:18,username:10,tres-alloc:45,timeused:8,command:50
JOBID STATE PARTITION NODELIST(REASON) USER TRES_ALLOC TIME COMMAND
951 RUNNING cpu.normal.32 arton02 gfreudig cpu=1,mem=32G,node=1,billing=1 1:20 /home/gfreudig/BTCH/Slurm/jobs/single/primes.sh 600
950 RUNNING cpu.normal.32 arton02 gfreudig cpu=4,mem=8G,node=1,billing=4 1:45 /home/gfreudig/BTCH/Slurm/jobs/multi/primes_4.sh 600
949 RUNNING cpu.normal.32 arton02 fgtest01 cpu=1,mem=8G,node=1,billing=1 2:31 /home/fgtest01/BTCH/Slurm/jobs/single/primes.sh 600
948 RUNNING gpu.normal artongpu01 fgtest01 cpu=1,mem=8G,node=1,billing=1,gres/gpu=1 2:48 /home/fgtest01/BTCH/Slurm/jobs/single/primes.sh 600
gfreudig@trollo:~/Batch$
}}}
 * `STATE` is explained in the squeue man page in section `JOB STATE CODES`, see `man --pager='less +/^JOB\ STATE\ CODES' squeue` for details
 * `REASON` is explained there as well in section `JOB REASON CODE`, see `man --pager='less +/^JOB\ REASON\ CODES' squeue`
Defining an alias in your `.bashrc` with
{{{#!highlight bash numbers=disable
alias sq1='squeue -O jobarrayid:10,state:10,partition:16,reasonlist:18,username:10,tres-alloc:45,timeused:8,command:50'
}}}
puts the command `sq1` at your fingertips.

/!\ Never call squeue from any kind of loop, i.e. never do `watch squeue`. See `man --pager='less +/^PERFORMANCE' squeue` for an explanation.<<BR>>
To monitor your jobs, set the `sbatch` option `--mail-type` to send you notifications. If you absolutely have to see a live display of your jobs, use the `--iterate` option with a value of several seconds:
{{{#!highlight bash numbers=disable
squeue --user=$USER --iterate=30
}}}

==== squeue -> Show job steps ====
Individual [[#srun_-.3E_Launch_a_command_as_a_job_step|job steps]] are listed with a specific option:
{{{#!highlight bash numbers=disable
squeue -s
}}}

==== scancel -> Deleting a job ====
With `scancel` you can remove your waiting and running jobs from the scheduler queue by their associated `JOBID`. The command `squeue` lists your jobs including their `JOBID`s. A job can then be deleted with
{{{
> scancel <JOBID>
}}}
To operate on an array job you can use the following commands
{{{
> scancel <JOBID> # all jobs (waiting or running) of the array job are deleted
> scancel <JOBID>_n # the job with task-ID n is deleted
> scancel <JOBID>_[n1-n2] # the jobs with task-ID in the range n1-n2 are deleted
}}}

==== sinfo -> Show partition configuration ====
The partition status can be obtained by using the `sinfo` command. An example listing is shown below.
{{{
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
cpu.normal.32* up 2-00:00:00 11 idle arton[01-11]
cpu.normal.64 up 2-00:00:00 3 idle arton[09-11]
cpu.normal.256 up 2-00:00:00 1 idle arton11
array.normal up 2-00:00:00 10 idle arton[01-10]
gpu.normal up 2-00:00:00 1 idle artongpu01
tikgpu.normal up 2-00:00:00 3 mix tikgpu[01-03]
}}}
For normal jobs (single, multicore) you can not choose the partition for the job to run in the `sbatch` command, the partition is selected by the scheduler according to your memory request. Array jobs are put in the `array.normal` partition, [[#GPU_jobs|gpu jobs]] in the `gpu.normal` partition. The following table shows the job memory limits in different partitions:<<BR>>
||'''PARTITION'''||'''Requested Memory'''||
||cpu.normal.32||< 32 GB||
||cpu.normal.64||32 - 64 GB||
||cpu.normal.256||> 64 GB||
||array.normal||< 32 GB||
||gpu.normal||< 64 GB||
||tikgpu.normal||unlimited||
Only a job with a `--mem` request of a maximum of 32 GByte can run in the `cpu.normal.32` partition, which contains all 11 artons.<<BR>>
While `tikgpu.normal` has no memory limit itself, the memory on nodes in this partition is not unlimited.

==== sinfo -> Show resources and utilization ====
Adding selected format parameters to the `sinfo` command shows the resources available on every node and their utilization:
{{{#!highlight bash numbers=disable
sinfo -Node --Format nodelist:12,statecompact:7,memory:7,allocmem:10,freemem:10,cpusstate:15,cpusload:10,gresused:100 |(sed -u 1q; sort -u)
}}}
Restricting the command to selected partitions allows to show only GPU nodes:
{{{#!highlight bash numbers=disable
sinfo -Node --partition=tikgpu.normal,gpu.normal --Format nodelist:12,statecompact:7,memory:7,allocmem:10,freemem:10,cpusstate:15,cpusload:10,gresused:100
}}}

==== sinfo -> Show available features ====
So-called ''features'' are used to constrain jobs to nodes with different hardware capabilities, typically GPU types. To show currently active features issue the following command sequence:
{{{#!highlight bash numbers=disable
sinfo --Format nodehost:20,features_act:80 |grep -v '(null)' |awk 'NR == 1; NR > 1 {print $0 | "sort -n"}'
}}}

==== srun -> Start an interactive shell ====
An interactive session on a compute node is possible for short tests, checking your environment or transferring data to the local scratch of a node available under `/scratch_net/arton[0-11]`. An interactive session lasting for 10 minutes on a GPU node can be started with:
{{{#!highlight bash numbers=disable
srun --time 10 --gres=gpu:1 --pty bash -i
}}}
The ouptut will look similar to the following:
{{{
srun: Start executing function slurm_job_submit......
srun: Your job is a gpu job.
srun: Setting partition to gpu.normal
srun: job 11526 queued and waiting for resources
}}}
Omitting the parameter `--gres=gpu:1` opens an interactive session on a CPU-only node.<<BR>>
Do not use an interactive login to run compute jobs, use this only briefly as outlined above. Restrict job time to the necessary minimum with the `--time` option as shown above. For details see the related section in the `srun` man page by issuing the command `man --pager='less +/--time' srun` in your shell.

==== srun -> Attaching an interactive shell to a running job ====
An interactive shell can be opened inside a running job by specifying its job id:
{{{#!highlight bash numbers=disable
srun --time 10 --jobid=123456 --pty bash -i
}}}
A typical use case is interactive live-monitoring of a running job.

==== srun -> Selecting artongpu01 ====
The following information is only applicable for members of institutes with institute-owned nodes.
To specifically run a GPU job on `artongpu01`, the node has to be selected with the `--nodelist` parameter as in the following example:
{{{#!highlight bash numbers=disable
srun --time 10 --nodelist=artongpu01 --gres=gpu:1 --pty bash -i
}}}

==== srun -> Launch a command as a job step ====
When `srun` is used inside a `sbatch` script it spawns the given command inside a job step. This allows resource monitoring with the `sstat` command (see `man sstat`. Spawning several single-threaded commands and putting them in the background allows to schedule these commands inside the job allocation.<<BR>>
Here's an example how to run overall GPU logging and per-process logging in job steps before starting the actual computing commands.

{{{#!highlight bash numbers=disable
...
set -o errexit

srun --exclusive --ntasks=1 --cpus-per-task=1 nvidia-smi dmon -i ${CUDA_VISIBLE_DEVICES} -d 5 -s ucm -o DT > "${SLURM_JOB_ID}.gpulog" &
srun --exclusive --ntasks=1 --cpus-per-task=1 nvidia-smi pmon -i ${CUDA_VISIBLE_DEVICES} -d 5 -s um -o DT > "${SLURM_JOB_ID}.processlog" &
...
echo finished at: `date`
exit 0;
}}}

==== sstat -> Display status information of a running job ====
The status information shows your job's resource usage while it is running:
{{{#!highlight bash numbers=disable
sstat --jobs=<JOBID> --format=JobID,AveVMSize%15,MaxRSS%15,AveCPU%15
}}}
 * `AveVMSize`: Average virtual memory of all tasks in the job
 * `MaxRSS`: Peak memory usage of all tasks in the job
 * `AveCPU`: Average CPU time of all tasks in the job

==== sacct -> Display accounting information of past jobs ====
Accounting information for past jobs can be displayed with various details (see man page).<<BR>>
The following example lists all jobs of the logged in user since the beginning of the year 2020:
{{{#!highlight bash numbers=disable
sacct --user ${USER} --starttime=2020-01-01 --format=JobID,Start%20,Partition%20,ReqTRES%50,AveVMSize%15,MaxRSS%15,AveCPU%15,Elapsed%15,State%20
}}}
=== GPU jobs ===

==== Selecting the correct GPUs ====
To select the GPU allocated by the scheduler, slurm sets the environment variable `CUDA_VISIBLE_DEVICES` in the context of a job to the GPUs allocated to the job. The numbering always starts at 0 and is consecutively numbered up to the requested amount of GPUs - 1.<<BR>>
It is '''imperative''' to work with this variable exactly as it is set by slurm, anything else leads to unexpected errors.<<BR>>
For details see the section [[https://slurm.schedmd.com/gres.html#GPU_Management|GPU Management]] in the official Slurm documentation.

==== Specifying a GPU type ====
It's possible to specify a GPU type by inserting the type description in the gres allocation:
{{{#!highlight bash numbers=disable
--gres=gpu:titan_rtx:1
}}}
Available GPU type descriptions can be filtered from an appropriate `sinfo` command:
{{{#!highlight bash numbers=disable
sinfo --noheader --Format gres:200 |tr ':' '\n' |sort -u |grep -vE '^(gpu|[0-9,\(]+)'
}}}
Multiple GPU types can be requested by using the [[#sbatch_-.3E_Common_options|--constraint]] option.

=== Multicore jobs/ job to core binding ===
A modern linux kernel is able to bind a process and all its children to a fixed number of cores. By default a job submitted to the SLURM arton grid is bound to to the numbers of requested cores/cpus. The default number of requested cpus is 1, if you have an application which is able to run multithreaded on several cores you must use the '''--cpus-per-task''' option in the sbatch command to get a binding to more than one core. To check for processes with core bindings, use the command `hwloc-ps -c`:
{{{
gfreudig@trollo:~/Batch$ ssh arton02 hwloc-ps -c
43369 0x00010001 slurmstepd: [984.batch]
43374 0x00010001 /bin/sh
43385 0x00010001 codebin/primes
gfreudig@trollo:~/Batch$
}}}
 
=== Job input/output data storage ===
Temporary data storage of a job used only while the job is running, should be placed in the `/scratch` directory of the compute nodes. Set the environment variables of the tools you use accordingly. The Matlab `MCR_ROOT_CACHE` variable is set automatically by the SLURM scheduler.<<BR>>
The file system protection of the `/scratch` directory allows everybody to create files and directories in it. A cron job runs periodically on the execution hosts to prevent the `/scratch` directory from filling up and cleans it governed by pre-set policies. Therefore data you place in the `/scratch` directory of a compute node cannot be assumed to stay there forever.<<BR>><<BR>>
Small sized input and output data for the jobs is best placed in your home directory. It is available on every compute node through the `/home` automounter.<<BR>>
Larger amounts of data should be placed in your personal [[Services/NetScratch|netscratch]] folder and can be accessed on all compute nodes.<<BR>><<BR>>

If you have problems with the quota limit in your home directory you could transfer data from your home or the `/scratch` directory of your submit host to the `/scratch` directories of the arton compute nodes and vice versa. For this purpose interactive logins with personal accounts are allowed on `arton01`. All `/scratch` directories of the compute nodes are available on `arton01` through the `/scratch_net` automount system. You can access the `/scratch` directory of `arton<nn>` under `/scratch_net/arton<nn>`. This allows you to transfer data between the `/scratch_net` directories and your home with normal linux file copy and to the `/scratch` of your submission host with scp.<<BR>><<BR>>
Do not log in on `arton01` to run compute jobs interactively. Such jobs will be detected by our procguard system and killed..
Other data storage concepts for the arton grid are possible and will be investigated, if the above solution proves not to be sufficient.<<BR>>


=== Matlab Distributed Computing Environment (MDCE) ===
The Matlab '''P'''arallel '''C'''omputing '''T'''oolbox (PCT) can be configured with an interface to the SLURM cluster. To work with MDCE please import [[attachment:Slurm.mlsettings]] in Matlab GUI (Parallel -> Create and manage Clusters -> Import ). Adjust the setting "JobStorageLocation" to your requirements. The cluster profile Slurm will now appear besides the standard local(default) profile in the profile list. With the local profile, you can use as many workers on one computer as there are physical cores while the Slurm profile allows to initiate up to 32 worker processes distributed over all slurm compute nodes.<<BR>>
/!\ Please temporay reduce the number of workers to 4 in the Slurm profile when performing the profile "Validation" function in the Matlab Cluster Manager.<<BR>>
/!\ Don't forget to set the Slurm environment variables before starting Matlab!<<BR>>
The Slurm cluster profile can be used with Matlab programs running as Slurm batch jobs but it's also possible to use the profile in an interactive Matlab session on your client. When you open a Slurm parpool, the workers are started automatically as jobs in the cluster.<<BR>>
/!\ In interactive mode please always close your parpool if you aren't performing any calculations on the workers.<<BR>>
Sample code for the 3 Matlab PCT methods parfor, spmd, tasks using the local or Slurm cluster profile is provided in [[attachment:PCTRefJobs.tar.gz]].

=== Frequently Asked Questions ===
If your question isn't listed below, an answer might be listed in the [[https://slurm.schedmd.com/faq.htm|official Slurm FAQ]].

==== Batch job submission failed: Invalid account ====
If you receive one of the following error messages after submitting a job with `sbatch` or using `srun`
{{{
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
}}}
{{{
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified
}}}
your account hasn't been registered with slurm yet. Please contact [[mailto:support@ee.ethz.ch|support]] and ask to be registered.

==== Invalid user for SlurmUser slurm ====
After executing one of the slurm executables like `sbatch` or `sinfo` the following error appears:
{{{
error: Invalid user for SlurmUser slurm, ignored
}}}
The user `slurm` doesn't exist on the host you're running your slurm executable. If this happens on a host managed by ISG.EE, please contact [[mailto:support@ee.ethz.ch|support]], tell us the name of your host and ask us to configure it as a slurm submission host.

==== Node(s) in drain state ====
If `sinfo` shows one or more nodes in ''drain'' state, the reason can be shown with
{{{
sinfo -R
}}}
or in case the reason is cut off with
{{{
sinfo -o '%60E %9u %19H %N'
}}}
Nodes are set to drain by ISG.EE to empty them of jobs in time for scheduled maintenance or by the scheduler itself in case a problem is detected on a node.

Introduction

At ITET the Condor Batch Queueing System has been used for a long time and is still used for running compute-intensive jobs. It uses the free resources on the tardis-PCs of the student rooms and on numerous PCs and compute servers at ITET institutes. Interactive work is privileged over batch computing, so running jobs could be killed by new interactive load or by shutdown/restart of a PC.

The SLURM system installed on the powerful ITET arton compute servers is an alternative to the Condor batch computing system. It consists of a master host, where the scheduler resides and the compute nodes, where batch jobs are executed. The compute nodes are powerful servers located in server rooms, they are exclusively reserved for batch processing. Interactive logins are disabled.

Access

Access to the SLURM grid is reserved for staff of the contributing institutes APS, IBT, IFA, MINS, NARI, TIK. Access is granted on request, please contact ISG.EE support.
If your circumstances differ and you'd still like to use the cluster, please contact ISG.EE support as well and ask for an offer. Time-limited test accounts for up to 2 weeks are also available on request.

SLURM

SLURM (Simple Linux Utility for Resource Management) is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and compute clusters. Slurm's design is very modular with about 100 optional plugins. In 2010, the developers of Slurm founded SchedMD, which maintains the canonical source, provides development, level 3 commercial support and training services and also provide a very good online documentation to Slurm.

SLURM Grid

Hardware

At the moment the computing power of the SLURM grid is based on the following 11 cpu compute nodes and 1 gpu compute node:

Server

CPU

Frequency

Cores

Memory

/scratch SSD

GPUs

GPU Memory

Operating System

arton01 - 03

Dual Octa-Core Intel Xeon E5-2690

2.90 GHz

16

128 GB

-

-

-

Debian 9

arton04 - 08

Dual Deca-Core Intel Xeon E5-2690 v2

3.00 GHz

20

128 GB

-

-

-

Debian 9

arton09 - 10

Dual Deca-Core Intel Xeon E5-2690 v2

3.00 GHz

20

256 GB

-

-

Debian 9

arton11

Dual Deca-Core Intel Xeon E5-2690 v2

3.00 GHz

20

736 GB

-

-

Debian 9

artongpu01

Dual Octa-Core Intel Xeon Silver 4208

2.10 GHz

16

128GB

4 RTX 2080 Ti

11 GB

Debian 9

  • artongpu01 is meant to be a test system to try out GPU calculations


The following 4 gpu nodes are reserved for exclusive use by TIK:

Server

CPU

Frequency

Cores

Memory

/scratch SSD

GPUs

GPU Memory

Operating System

tikgpu01

Dual Tetrakaideca-Core Xeon E5-2680 v4

2.40GHz

28

512GB

5 Titan Xp, 2 GTX Titan X

12 GB

Debian 9

tikgpu02

Dual Tetrakaideca-Core Xeon E5-2680 v4

2.40GHz

28

512GB

8 Titan Xp

12 GB

Debian 9

tikgpu03

Dual Tetrakaideca-Core Xeon E5-2680 v4

2.40GHz

28

512GB

8 Titan Xp

12 GB

Debian 9

tikgpu04

Dual Hectakaideca-Core Xeon Gold 6242 v4

2.80GHz

32

384GB

8 Titan RTX

24 GB

Debian 9

The SLURM job scheduler runs on the linux server itetmaster01.

Software

The nodes offer the same software environment as all D-ITET managed Linux clients, gpu nodes have a restricted software (no desktops installed, minimal dependencies needed for driver support).

Using SLURM

At a basic level, SLURM is very easy to use. The following sections will describe the commands you need to run and manage your batch jobs. The commands that will be most useful to you are as follows

  • sbatch - submit a job to the batch scheduler

  • squeue - examine running and waiting jobs

  • sinfo - status compute nodes

  • scancel - delete a running job

Setting environment

The above commands only work if the environment variables for SLURM are set. Please issue the following commands in your bash shell to start working with the cluster immediately or add them to your ~/.bashrc to have the slurm commands available for new instances of bash:

export PATH=/usr/pack/slurm-19.05.x-sr/amd64-debian-linux9/bin:$PATH
export SLURM_CONF=/home/sladmitet/slurm/slurm.conf

sbatch -> Submitting a job

sbatch doesn't allow to submit a binary program directly, please wrap the program to run into a surrounding bash script. The sbatch command has the following syntax:

> sbatch [options] job_script [job_script arguments]

The job_script is a standard UNIX shell script. The fixed options for the SLURM Scheduler are placed in the job_script in lines starting with #SBATCH. The UNIX shell interprets these lines as comments and ignores them. Only temporary options should be placed outside the job_script. To test your job-script you can simply run it interactively.

Assume there is a c program primes.c which is compiled to an executable binary named primes with "gcc -o primes primes.c". The program runs 5 seconds and calculates prime numbers. The found prime numbers and a final summary are sent to standard output. A sample job_script primes.sh to perform a batch run of the binary primes on the Arton grid looks like this:

#!/bin/bash

#SBATCH --mail-type=ALL                     # mail configuration: NONE, BEGIN, END, FAIL, REQUEUE, ALL
#SBATCH --output=log/%j.out                 # where to store the output (%j is the JOBID), subdirectory must exist
#SBATCH --error=log/%j.err                  # where to store error messages

/bin/echo Running on host: `hostname`
/bin/echo In directory: `pwd`
/bin/echo Starting on: `date`
/bin/echo SLURM_JOB_ID: $SLURM_JOB_ID

# exit on errors
set -o errexit
# binary to execute
./primes
echo finished at: `date`
exit 0;

You can test the script by running it interactively in a terminal:

gfreudig@trollo:~/Batch$ ./primes.sh

If the script runs successfully you can now submit it as a batch job to the SLURM arton grid:

gfreudig@trollo:~/Batch$ sbatch primes.sh 
sbatch: Start executing function slurm_job_submit......
sbatch: Job partition set to : cpu.normal.32 (normal memory)
Submitted batch job 931
gfreudig@trollo:~/Batch$ 

After the job has finished, you will find the output file of the job in the log subdirectory with a name of <JOBID>.out.
/!\ The directory for the job output must exist, it is not created automatically!

You can only submit jobs to SLURM if your account is configured in the SLURM user database. If it isn't, you'll receive this error message

sbatch -> Submitting an array job

Similar to condor it is also possible to start an array job. The above job would run 10 times if you added the option #SBATCH --array=0-9 to the job-script. A repeated execution only makes sense if the executed program adapts its behaviour according to the changing array task count number. The array count number can be referenced through the variable $SLURM_ARRAY_TASK_ID. You can pass the value of $SLURM_ARRAY_TASK_ID or some derived parameters to the executable.
Here is a simple example of passing an input filename parameter changing with $SLURM_ARRAY_TASK_ID to the executable:

.
#SBATCH   --array=0-9
#
# binary to execute
<path-to-executable> data$SLURM_ARRAY_TASK_ID.dat

Every run of the program in the array job with a different task-id will produce a separate output file.

The option expects a range of task-ids expressed in the form --array=n[,k[,...]][-m[:s]]%l
where n, k, m are discreet task-ids, s is a step applied to a range n-m and l applies a limit to the number of simultaneously running tasks. See man sbatch for examples.
Specifying one task-id instead of a range as in --array=10 results in an array job with a single task with task-id 10.
The following variables will be available in the job context and reflect the option arguments given: $SLURM_ARRAY_TASK_MAX, $SLURM_ARRAY_TASK_MIN, $SLURM_ARRAY_TASK_STEP.

sbatch -> Common options

The following table shows the most common options available for sbatch to be used in the job_script in lines starting with #SBATCH

option

description

--mail-type=...

Possible Values: NONE, BEGIN, END, FAIL, REQUEUE, ALL

--mem=<n>G

the job needs a maximum of <n> GByte ( if omitted the default of 6G is used )

--cpus-per-task=<n>

number of cores to be used for the job

--gres=gpu:1

number of GPUs needed for the job

--nodes=<n>

number of compute nodes to be used for the job

--hint=<type>

Bind tasks to CPU cores according to application hints (See man --pager='less +/--hint' srun and multi-core support

--constraint=<feature_name>

Request one or more features, optionally combined by operators

  • /!\ The --nodes option should only be used for MPI jobs !

  • The operators to combine --constraint lists are:

  • AND (&): #SBATCH --constraint='geforce_rtx_2080_ti&titan_rtx

  • OR (|): #SBATCH --constraint='titan_rtx|titan_xp'

squeue -> Show running/waiting jobs

The squeue command shows the actual list of running and pending jobs in the system. As you can see in the following sample output the default format is quite minimalistic:

gfreudig@trollo:~/Batch$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               951 cpu.norma primes.s gfreudig  R       0:11      1 arton02
               950 cpu.norma primes_4 gfreudig  R       0:36      1 arton02
               949 cpu.norma primes.s fgtest01  R       1:22      1 arton02
               948 gpu.norma primes.s fgtest01  R       1:39      1 artongpu01
gfreudig@trollo:~/Batch$ 

More detailed information can be obtained by issuing the following command:

gfreudig@trollo:~/Batch$ squeue -O jobarrayid:10,state:10,partition:16,reasonlist:18,username:10,tres-alloc:45,timeused:8,command:50
JOBID     STATE     PARTITION       NODELIST(REASON)  USER      TRES_ALLOC                                   TIME    COMMAND                                             
951       RUNNING   cpu.normal.32   arton02           gfreudig  cpu=1,mem=32G,node=1,billing=1               1:20    /home/gfreudig/BTCH/Slurm/jobs/single/primes.sh 600 
950       RUNNING   cpu.normal.32   arton02           gfreudig  cpu=4,mem=8G,node=1,billing=4                1:45    /home/gfreudig/BTCH/Slurm/jobs/multi/primes_4.sh 600
949       RUNNING   cpu.normal.32   arton02           fgtest01  cpu=1,mem=8G,node=1,billing=1                2:31    /home/fgtest01/BTCH/Slurm/jobs/single/primes.sh 600 
948       RUNNING   gpu.normal      artongpu01        fgtest01  cpu=1,mem=8G,node=1,billing=1,gres/gpu=1     2:48    /home/fgtest01/BTCH/Slurm/jobs/single/primes.sh 600 
gfreudig@trollo:~/Batch$ 
  • STATE is explained in the squeue man page in section JOB STATE CODES, see man --pager='less +/^JOB\ STATE\ CODES' squeue for details

  • REASON is explained there as well in section JOB REASON CODE, see man --pager='less +/^JOB\ REASON\ CODES' squeue

Defining an alias in your .bashrc with

alias sq1='squeue -O jobarrayid:10,state:10,partition:16,reasonlist:18,username:10,tres-alloc:45,timeused:8,command:50'

puts the command sq1 at your fingertips.

/!\ Never call squeue from any kind of loop, i.e. never do watch squeue. See man --pager='less +/^PERFORMANCE' squeue for an explanation.
To monitor your jobs, set the sbatch option --mail-type to send you notifications. If you absolutely have to see a live display of your jobs, use the --iterate option with a value of several seconds:

squeue --user=$USER --iterate=30

squeue -> Show job steps

Individual job steps are listed with a specific option:

squeue -s

scancel -> Deleting a job

With scancel you can remove your waiting and running jobs from the scheduler queue by their associated JOBID. The command squeue lists your jobs including their JOBIDs. A job can then be deleted with

> scancel <JOBID>

To operate on an array job you can use the following commands

> scancel <JOBID>          # all jobs (waiting or running) of the array job are deleted
> scancel <JOBID>_n        # the job with task-ID n is deleted
> scancel <JOBID>_[n1-n2]  # the jobs with task-ID in the range n1-n2 are deleted

sinfo -> Show partition configuration

The partition status can be obtained by using the sinfo command. An example listing is shown below.

PARTITION      AVAIL  TIMELIMIT  NODES  STATE NODELIST
cpu.normal.32*    up 2-00:00:00     11   idle arton[01-11]
cpu.normal.64     up 2-00:00:00      3   idle arton[09-11]
cpu.normal.256    up 2-00:00:00      1   idle arton11
array.normal      up 2-00:00:00     10   idle arton[01-10]
gpu.normal        up 2-00:00:00      1   idle artongpu01
tikgpu.normal     up 2-00:00:00      3    mix tikgpu[01-03]

For normal jobs (single, multicore) you can not choose the partition for the job to run in the sbatch command, the partition is selected by the scheduler according to your memory request. Array jobs are put in the array.normal partition, gpu jobs in the gpu.normal partition. The following table shows the job memory limits in different partitions:

PARTITION

Requested Memory

cpu.normal.32

< 32 GB

cpu.normal.64

32 - 64 GB

cpu.normal.256

> 64 GB

array.normal

< 32 GB

gpu.normal

< 64 GB

tikgpu.normal

unlimited

Only a job with a --mem request of a maximum of 32 GByte can run in the cpu.normal.32 partition, which contains all 11 artons.
While tikgpu.normal has no memory limit itself, the memory on nodes in this partition is not unlimited.

sinfo -> Show resources and utilization

Adding selected format parameters to the sinfo command shows the resources available on every node and their utilization:

sinfo -Node --Format nodelist:12,statecompact:7,memory:7,allocmem:10,freemem:10,cpusstate:15,cpusload:10,gresused:100 |(sed -u 1q; sort -u)

Restricting the command to selected partitions allows to show only GPU nodes:

sinfo -Node --partition=tikgpu.normal,gpu.normal --Format nodelist:12,statecompact:7,memory:7,allocmem:10,freemem:10,cpusstate:15,cpusload:10,gresused:100

sinfo -> Show available features

So-called features are used to constrain jobs to nodes with different hardware capabilities, typically GPU types. To show currently active features issue the following command sequence:

sinfo --Format nodehost:20,features_act:80 |grep -v '(null)' |awk 'NR == 1; NR > 1 {print $0 | "sort -n"}'

srun -> Start an interactive shell

An interactive session on a compute node is possible for short tests, checking your environment or transferring data to the local scratch of a node available under /scratch_net/arton[0-11]. An interactive session lasting for 10 minutes on a GPU node can be started with:

srun --time 10 --gres=gpu:1 --pty bash -i

The ouptut will look similar to the following:

srun: Start executing function slurm_job_submit......
srun: Your job is a gpu job.
srun: Setting partition to gpu.normal
srun: job 11526 queued and waiting for resources

Omitting the parameter --gres=gpu:1 opens an interactive session on a CPU-only node.
Do not use an interactive login to run compute jobs, use this only briefly as outlined above. Restrict job time to the necessary minimum with the --time option as shown above. For details see the related section in the srun man page by issuing the command man --pager='less +/--time' srun in your shell.

srun -> Attaching an interactive shell to a running job

An interactive shell can be opened inside a running job by specifying its job id:

srun --time 10 --jobid=123456 --pty bash -i

A typical use case is interactive live-monitoring of a running job.

srun -> Selecting artongpu01

The following information is only applicable for members of institutes with institute-owned nodes. To specifically run a GPU job on artongpu01, the node has to be selected with the --nodelist parameter as in the following example:

srun --time 10 --nodelist=artongpu01 --gres=gpu:1 --pty bash -i

srun -> Launch a command as a job step

When srun is used inside a sbatch script it spawns the given command inside a job step. This allows resource monitoring with the sstat command (see man sstat. Spawning several single-threaded commands and putting them in the background allows to schedule these commands inside the job allocation.
Here's an example how to run overall GPU logging and per-process logging in job steps before starting the actual computing commands.

...
set -o errexit

srun --exclusive --ntasks=1 --cpus-per-task=1 nvidia-smi dmon -i ${CUDA_VISIBLE_DEVICES} -d 5 -s ucm -o DT > "${SLURM_JOB_ID}.gpulog" &
srun --exclusive --ntasks=1 --cpus-per-task=1 nvidia-smi pmon -i ${CUDA_VISIBLE_DEVICES} -d 5 -s um -o DT  > "${SLURM_JOB_ID}.processlog" &
...
echo finished at: `date`
exit 0;

sstat -> Display status information of a running job

The status information shows your job's resource usage while it is running:

sstat --jobs=<JOBID> --format=JobID,AveVMSize%15,MaxRSS%15,AveCPU%15
  • AveVMSize: Average virtual memory of all tasks in the job

  • MaxRSS: Peak memory usage of all tasks in the job

  • AveCPU: Average CPU time of all tasks in the job

sacct -> Display accounting information of past jobs

Accounting information for past jobs can be displayed with various details (see man page).
The following example lists all jobs of the logged in user since the beginning of the year 2020:

sacct --user ${USER} --starttime=2020-01-01 --format=JobID,Start%20,Partition%20,ReqTRES%50,AveVMSize%15,MaxRSS%15,AveCPU%15,Elapsed%15,State%20

GPU jobs

Selecting the correct GPUs

To select the GPU allocated by the scheduler, slurm sets the environment variable CUDA_VISIBLE_DEVICES in the context of a job to the GPUs allocated to the job. The numbering always starts at 0 and is consecutively numbered up to the requested amount of GPUs - 1.
It is imperative to work with this variable exactly as it is set by slurm, anything else leads to unexpected errors.
For details see the section GPU Management in the official Slurm documentation.

Specifying a GPU type

It's possible to specify a GPU type by inserting the type description in the gres allocation:

--gres=gpu:titan_rtx:1

Available GPU type descriptions can be filtered from an appropriate sinfo command:

sinfo --noheader --Format gres:200 |tr ':' '\n' |sort -u |grep -vE '^(gpu|[0-9,\(]+)'

Multiple GPU types can be requested by using the --constraint option.

Multicore jobs/ job to core binding

A modern linux kernel is able to bind a process and all its children to a fixed number of cores. By default a job submitted to the SLURM arton grid is bound to to the numbers of requested cores/cpus. The default number of requested cpus is 1, if you have an application which is able to run multithreaded on several cores you must use the --cpus-per-task option in the sbatch command to get a binding to more than one core. To check for processes with core bindings, use the command hwloc-ps -c:

gfreudig@trollo:~/Batch$ ssh arton02 hwloc-ps -c
43369   0x00010001              slurmstepd: [984.batch]
43374   0x00010001              /bin/sh
43385   0x00010001              codebin/primes
gfreudig@trollo:~/Batch$

Job input/output data storage

Temporary data storage of a job used only while the job is running, should be placed in the /scratch directory of the compute nodes. Set the environment variables of the tools you use accordingly. The Matlab MCR_ROOT_CACHE variable is set automatically by the SLURM scheduler.
The file system protection of the /scratch directory allows everybody to create files and directories in it. A cron job runs periodically on the execution hosts to prevent the /scratch directory from filling up and cleans it governed by pre-set policies. Therefore data you place in the /scratch directory of a compute node cannot be assumed to stay there forever.

Small sized input and output data for the jobs is best placed in your home directory. It is available on every compute node through the /home automounter.
Larger amounts of data should be placed in your personal netscratch folder and can be accessed on all compute nodes.

If you have problems with the quota limit in your home directory you could transfer data from your home or the /scratch directory of your submit host to the /scratch directories of the arton compute nodes and vice versa. For this purpose interactive logins with personal accounts are allowed on arton01. All /scratch directories of the compute nodes are available on arton01 through the /scratch_net automount system. You can access the /scratch directory of arton<nn> under /scratch_net/arton<nn>. This allows you to transfer data between the /scratch_net directories and your home with normal linux file copy and to the /scratch of your submission host with scp.

Do not log in on arton01 to run compute jobs interactively. Such jobs will be detected by our procguard system and killed.. Other data storage concepts for the arton grid are possible and will be investigated, if the above solution proves not to be sufficient.

Matlab Distributed Computing Environment (MDCE)

The Matlab Parallel Computing Toolbox (PCT) can be configured with an interface to the SLURM cluster. To work with MDCE please import Slurm.mlsettings in Matlab GUI (Parallel -> Create and manage Clusters -> Import ). Adjust the setting "JobStorageLocation" to your requirements. The cluster profile Slurm will now appear besides the standard local(default) profile in the profile list. With the local profile, you can use as many workers on one computer as there are physical cores while the Slurm profile allows to initiate up to 32 worker processes distributed over all slurm compute nodes.
/!\ Please temporay reduce the number of workers to 4 in the Slurm profile when performing the profile "Validation" function in the Matlab Cluster Manager.
/!\ Don't forget to set the Slurm environment variables before starting Matlab!
The Slurm cluster profile can be used with Matlab programs running as Slurm batch jobs but it's also possible to use the profile in an interactive Matlab session on your client. When you open a Slurm parpool, the workers are started automatically as jobs in the cluster.
/!\ In interactive mode please always close your parpool if you aren't performing any calculations on the workers.
Sample code for the 3 Matlab PCT methods parfor, spmd, tasks using the local or Slurm cluster profile is provided in PCTRefJobs.tar.gz.

Frequently Asked Questions

If your question isn't listed below, an answer might be listed in the official Slurm FAQ.

Batch job submission failed: Invalid account

If you receive one of the following error messages after submitting a job with sbatch or using srun

sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified

srun: error: Unable to allocate resources: Invalid account or account/partition combination specified

your account hasn't been registered with slurm yet. Please contact support and ask to be registered.

Invalid user for SlurmUser slurm

After executing one of the slurm executables like sbatch or sinfo the following error appears:

error: Invalid user for SlurmUser slurm, ignored

The user slurm doesn't exist on the host you're running your slurm executable. If this happens on a host managed by ISG.EE, please contact support, tell us the name of your host and ask us to configure it as a slurm submission host.

Node(s) in drain state

If sinfo shows one or more nodes in drain state, the reason can be shown with

sinfo -R

or in case the reason is cut off with

sinfo -o '%60E %9u %19H %N'

Nodes are set to drain by ISG.EE to empty them of jobs in time for scheduled maintenance or by the scheduler itself in case a problem is detected on a node.

Services/SLURM (last edited 2024-10-29 10:11:55 by stroth)