The SLURM system installed on the powerful ITET arton compute servers is an alternative to the Condor batch computing system and '''reserved for staff of the contributing institutes (IBT,IFA,TIK,IKT,APS)'''. It consists of a master host, where the scheduler resides and the arton compute nodes, where the batch jobs are executed. The compute nodes are powerful servers located in server rooms, they are exclusively reserved for batch processing. Interactive logins are disabled. The SLURM system installed on the powerful ITET arton compute servers is an alternative to the Condor batch computing system. It consists of a master host, where the scheduler resides and the compute nodes, where batch jobs are executed. The compute nodes are powerful servers located in server rooms, they are exclusively reserved for batch processing. Interactive logins are disabled.

=== Access ===
Access to the SLURM grid is reserved for staff of the contributing institutes '''APS, IBT, IFA, MINS, NARI, TIK'''. Access is granted on request, please contact [[mailto:support@ee.ethz.ch|ISG.EE support]].<<BR>>
If your circumstances differ and you'd still like to use the cluster, please contact [[mailto:support@ee.ethz.ch|ISG.EE support]] as well and ask for an offer.
gfreudig@trollo:~/Batch$ sinfo
cpu.normal.32* up 1-00:00:00 2 idle arton02,zampano
cpu.normal.64 up 1-00:00:00 1 idle arton09
cpu.normal.256 up 1-00:00:00 1 idle arton09
array.normal up 1-00:00:00 2 idle arton02,zampano
gpu.normal up 1-00:00:00 1 mix artongpu01

== SLURM Arton Grid ==
== SLURM Grid ==
At the moment the computing power of the SLURM Arton Grid is based on the following 11 cpu compute servers and 1 gpu compute server (compute nodes) :<<BR>><<BR>>
||'''Server'''||||'''CPU'''||||'''Frequency'''||||'''Cores'''||||'''GPUs'''||||'''Memory'''||||'''Operating System'''||
||arton01 - 03||||Dual Octa-Core Intel Xeon E5-2690||||2.90 GHz||||16||||-||||128 GB||||Debian 9||
||arton04 - 08||||Dual Deca-Core Intel Xeon E5-2690 v2||||3.00 GHz||||20||||-||||128 GB||||Debian 9||
||arton09 - 10||||Dual Deca-Core Intel Xeon E5-2690 v2||||3.00 GHz||||20||||-||||256 GB||||Debian 9||
||arton11||||Dual Deca-Core Intel Xeon E5-2690 v2||||3.00 GHz||||20||||-||||768 GB||||Debian 9||
||artongpu01||||Dual Octa-Core Intel Xeon Silver 4208 CPU||||2.10 GHz||||16||||2||||128GB||||Debian 9||
The local disks (`/scratch`) of `arton09`, `arton10` and `arton11` are fast SSD-disks (6 GBit/s) with a size of 720 GByte.<<BR>><<BR>>
The SLURM job scheduler runs on the linux server `itetmaster01`.<<BR>>
At the moment the computing power of the SLURM grid is based on the following 11 cpu compute nodes and 1 gpu compute node:
||'''Server'''||'''CPU'''||'''Frequency'''||'''Cores'''||'''Memory'''||'''/scratch SSD'''||'''GPUs'''||'''GPU Memory'''||'''Operating System'''||
||arton01 - 03||Dual Octa-Core Intel Xeon E5-2690||2.90 GHz||16||128 GB||-||-||-||Debian 9||
||arton04 - 08||Dual Deca-Core Intel Xeon E5-2690 v2||3.00 GHz||20||128 GB||-||-||-||Debian 9||
||arton09 - 10||Dual Deca-Core Intel Xeon E5-2690 v2||3.00 GHz||20||256 GB||&#10003;||-||-||Debian 9||
||arton11||Dual Deca-Core Intel Xeon E5-2690 v2||3.00 GHz||20||736 GB||&#10003;||-||-||Debian 9||
||artongpu01||Dual Octa-Core Intel Xeon Silver 4208||2.10 GHz||16||128GB||&#10003;||4 RTX 2080 Ti||11 GB||Debian 9||
 * `artongpu01` is meant to be a test system to try out GPU calculations

The following 4 gpu nodes are reserved for exclusive use by TIK:
||'''Server'''||'''CPU'''||'''Frequency'''||'''Cores'''||'''Memory'''||'''/scratch SSD'''||'''GPUs'''||'''GPU Memory'''||'''Operating System'''||
||tikgpu01||Dual Tetrakaideca-Core Xeon E5-2680 v4||2.40GHz||28||512GB||&#10003;||5 Titan Xp, 2 GTX Titan X||12 GB||Debian 9||
||tikgpu02||Dual Tetrakaideca-Core Xeon E5-2680 v4||2.40GHz||28||512GB||&#10003;||8 Titan Xp||12 GB||Debian 9||
||tikgpu03||Dual Tetrakaideca-Core Xeon E5-2680 v4||2.40GHz||28||512GB||&#10003;||8 Titan Xp||12 GB||Debian 9||
||tikgpu04||Dual Hectakaideca-Core Xeon Gold 6242 v4||2.80GHz||32||384GB||&#10003;||8 Titan RTX||24 GB||Debian 9||

The SLURM job scheduler runs on the linux server `itetmaster01`.
The artons cpu nodes offer the same software environment as all D-ITET managed Linux clients, gpu nodes have a restricted software (no desktops installed). The nodes offer the same software environment as all D-ITET managed Linux clients, gpu nodes have a restricted software (no desktops installed, minimal dependencies needed for driver support).
{{{#!highlight bash numbers=disable
export PATH=/usr/pack/slurm-19.05.0-sr/amd64-debian-linux9/bin:$PATH
The above commands only work if the environment variables for SLURM are set. Please issue the following commands in your bash shell to start working with the cluster immediately or add them to your `~/.bashrc` to have the slurm commands available for new instances of bash:<<BR>>
{{{#!highlight bash numbers=disable
export PATH=/usr/pack/slurm-19.05.x-sr/amd64-debian-linux9/bin:$PATH
#SBATCH  --mail-type=ALL # mail configuration: NONE, BEGIN, END, FAIL, REQUEUE, ALL
#SBATCH  --output=log/%j.out # where to store the output ( %j is the JOBID )

#SBATCH --mail-type=ALL # mail configuration: NONE, BEGIN, END, FAIL, REQUEUE, ALL
#SBATCH --output=log/%j.out # where to store the output ( %j is the JOBID )
#SBATCH --error=log/%j.err # where to store error messages
# exit on errors
set -o errexit
Similar to condor it is also possible to start an array job. The above job would run 10 times if you added the option '''#SBATCH --array=0-9''' to the job-script. A repeated execution only makes sense if the executed program adapts its behaviour according to the changing array task count number. The array count number can be referenced through the variable '''$SLURM_ARRAY_TASK_ID'''. You can pass the value of `$SLURM_ARRAY_TASK_ID` or some derived parameters to the executable. <<BR>> You can only submit jobs to SLURM if your account is configured in the SLURM user database. If it isn't, you'll receive this [[#Batch_job_submission_failed:_Invalid_account|error message]]

==== sbatch -> Submitting an array job ====
Similar to condor it is also possible to start an array job. The above job would run 10 times if you added the option '''#SBATCH --array=0-9''' to the job-script. A repeated execution only makes sense if the executed program adapts its behaviour according to the changing array task count number. The array count number can be referenced through the variable '''$SLURM_ARRAY_TASK_ID'''. You can pass the value of `$SLURM_ARRAY_TASK_ID` or some derived parameters to the executable.<<BR>>
Every run of the program in the array job with a different task-id will produce a separate output file.<<BR>><<BR>>
Every run of the program in the array job with a different task-id will produce a separate output file.<<BR>>

The option expects a '''range of task-ids''' expressed in the form `--array=n[,k[,...]][-m[:s]]%l`<<BR>>
where `n`, `k`, `m` are discreet task-ids, `s` is a step applied to a range `n`-`m` and `l` applies a limit to the number of simultaneously running tasks. See `man sbatch` for examples.<<BR>>
Specifying '''one''' task-id instead of a range as in `--array=10` results in an array job with a single task with task-id 10.<<BR>>
The following variables will be available in the job context and reflect the option arguments given: '''$SLURM_ARRAY_TASK_MAX''', '''$SLURM_ARRAY_TASK_MIN''', '''$SLURM_ARRAY_TASK_STEP'''.

==== sbatch -> Common options ====
||--mem=<n>G||||the job needs a maximum of <n> GByte ( if omitted the default of 12G is used )|| ||--mem=<n>G||||the job needs a maximum of <n> GByte ( if omitted the default of 6G is used )||
More detailled information can be obtained by issuing the following command: More detailed information can be obtained by issuing the following command:
The partition status can be obtained by using the `sinfo command`. An example listing is shown below.
gfreudig@trollo:~/Batch$ sinfo
The partition status can be obtained by using the `sinfo` command. An example listing is shown below.
cpu.normal.32* up 1-00:00:00 2 idle arton[01-11]
cpu.normal.64 up 1-00:00:00 1 idle arton[09-11]
cpu.normal.256 up 1-00:00:00 1 idle arton11
array.normal up 1-00:00:00 2 idle arton[01-08]
gpu.normal up 1-00:00:00 1 idle artongpu01
For normal jobs (single,multicore) you can not select the partition for the job to run in the sbatch command, the partition is selected by the scheduler according to your memory request. Array jobs are put in the array.normal partition, gpu jobs in the gpu.normal partition. Here a table of the job memory limits in the different partiitons:<<BR>>
||cpu.normal.32||||32 GB||
||cpu.normal.64||||64 GB||
||cpu.normal.256||||256 GB||
||array.normal||||32 GB||
||gpu.normal||||64 GB||
Only a job with a --mem request of maximal 32 GByte can run in the cpu.normal.32 partition which contains all 11 artons.
cpu.normal.32* up 2-00:00:00 11 idle arton[01-11]
cpu.normal.64 up 2-00:00:00 3 idle arton[09-11]
cpu.normal.256 up 2-00:00:00 1 idle arton11
array.normal up 2-00:00:00 10 idle arton[01-10]
gpu.normal up 2-00:00:00 1 idle artongpu01
tikgpu.normal up 2-00:00:00 3 mix tikgpu[01-03]
tikgpu.mon up 2-00:00:00 3 mix tikgpu[01-03]
For normal jobs (single, multicore) you can not choose the partition for the job to run in the `sbatch` command, the partition is selected by the scheduler according to your memory request. Array jobs are put in the `array.normal` partition, [[#GPU_jobs|gpu jobs]] in the `gpu.normal` partition. The following table shows the job memory limits in different partitions:<<BR>>
||'''PARTITION'''||'''Requested Memory'''||
||cpu.normal.32||< 32 GB||
||cpu.normal.64||32 - 64 GB||
||cpu.normal.256||> 64 GB||
||array.normal||< 32 GB||
||gpu.normal||< 64 GB||
Only a job with a `--mem` request of a maximum of 32 GByte can run in the `cpu.normal.32` partition, which contains all 11 artons.<<BR>>
While `tikgpu.normal` has no memory limit itself, the memory on nodes in this partition is not unlimited.
The purpose of `tikgpu.mon` is to login interactively for live monitoring of running jobs.

==== sinfo -> Show resources and utilization ====
Adding selected format parameters to the `sinfo` command shows the resources available on every node and their utilization:
{{{#!highlight bash numbers=disable
sinfo -Node --Format nodelist:12,statecompact:7,memory:7,allocmem:10,freemem:10,cpusstate:15,cpusload:10,gresused:100 |(sed -u 1q; sort -u)
Restricting the command to selected partitions allows to show only GPU nodes:
{{{#!highlight bash numbers=disable
sinfo -Node --partition=tikgpu.normal,gpu.normal --Format nodelist:12,statecompact:7,memory:7,allocmem:10,freemem:10,cpusstate:15,cpusload:10,gresused:100
==== srun -> Start an interactive shell ====
An interactive session on a compute node is possible for short tests, checking your environment or transferring data to the local scratch of a node available under `/scratch_net/arton[0-11]`. An interactive session lasting for 10 minutes on a GPU node can be started with:
{{{#!highlight bash numbers=disable
srun --time 10 --gres=gpu:1 --pty bash -i
The ouptut will look similar to the following:
srun: Start executing function slurm_job_submit......
srun: Your job is a gpu job.
srun: Setting partition to gpu.normal
srun: job 11526 queued and waiting for resources
Omitting the parameter `--gres=gpu:1` opens an interactive session on a CPU-only node.<<BR>>
Do not use an interactive login to run compute jobs, use this only briefly as outlined above. Restrict job time to the necessary minimum with the `--time` option as shown above. For details see the related section in the `srun` man page by issuing the command `man --pager='less +/--time' srun` in your shell.
==== srun -> Monitoring tik jobs ====
The following information is only applicable for institute-owned nodes.<<BR>>
The partition `tikgpu.mon` is available to monitor jobs interactively on a specific node:
{{{#!highlight bash numbers=disable
srun --time 10 --partition=tikgpu.mon --nodelist=tikgpu01 --pty bash -i

==== srun -> Selecting artongpu01 ====
The following information is only applicable for members of institutes with institute-owned nodes.
To specifically run a GPU job on `artongpu01`, the node has to be selected with the `--nodelist` parameter as in the following example:
{{{#!highlight bash numbers=disable
srun --time 10 --nodelist=artongpu01 --gres=gpu:1 --pty bash -i

==== srun -> Launch a command as a job step ====
When `srun` is used inside a `sbatch` script it spawns the given command inside a job step. This allows resource monitoring with the `sstat` command (see `man sstat`. Spawning several single-threaded commands and putting them in the background allows to schedule these commands inside the job allocation.<<BR>>
Here's an example how to run overall GPU logging and per-process logging in job steps before starting the actual computing commands.

{{{#!highlight bash numbers=disable
set -o errexit

srun --exclusive --ntasks=1 --cpus-per-task=1 nvidia-smi dmon -i ${CUDA_VISIBLE_DEVICES} -d 5 -s ucm -o DT > "${SLURM_JOB_ID}.gpulog" &
srun --exclusive --ntasks=1 --cpus-per-task=1 nvidia-smi pmon -i ${CUDA_VISIBLE_DEVICES} -d 5 -s um -o DT > "${SLURM_JOB_ID}.processlog" &
echo finished at: `date`
exit 0;

==== sstat -> Display status information of a running job ====
The status information shows your job's resource usage while it is running:
{{{#!highlight bash numbers=disable
sstat --jobs=<JOBID> --format=JobID,AveVMSize%15,MaxRSS%15,AveCPU%15
 * `AveVMSize`: Average virtual memory of all tasks in the job
 * `MaxRSS`: Peak memory usage of all tasks in the job
 * `AveCPU`: Average CPU time of all tasks in the job

==== sacct -> Display accounting information of past jobs ====
Accounting information for past jobs can be displayed with various details (see man page).<<BR>>
The following example lists all jobs of the logged in user since the beginning of the year 2020:
{{{#!highlight bash numbers=disable
sacct --user ${USER} --starttime=2020-01-01 --format=JobID,Start%20,Partition%20,ReqTRES%50,AveVMSize%15,MaxRSS%15,AveCPU%15,Elapsed%15,State%20
=== GPU jobs ===
To select the GPU allocated by the scheduler, slurm sets the environment variable `CUDA_VISIBLE_DEVICES` in the context of a job. It is '''imperative''' to work with this variable exactly as it is set by slurm, anything else leads to usage of wrong GPU(s) which might be in use by jobs of other users.<<BR>>
For details, see the section [[https://slurm.schedmd.com/gres.html#GPU_Management|GPU Management]] in the offical slurm documentation.

=== GPU numbering ===
The numbering of GPUs can be confusing as it is non-uniform across different sources of information. One source of information is the so-called ''PCI bus number'', the other is the ''PCI device minor number''. They are generated differently and although their order might match, this cannot be taken for granted!

==== By PCI bus number: CUDA_VISIBLE_DEVICES ====
The environment variable `CUDA_DEVICE_ORDER` controls the numbering of GPUs in a CUDA context. It's default is `FASTEST_FIRST`, which sets the fastest available GPU to be the number 0 in `CUDA_VISIBLE_DEVICES`.<<BR>>
For details, see the section [[https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars|CUDA Environment Variables]] in the CUDA toolkit documentation.<<BR>>
As long as a node only has one type of GPUs installed, this numbering can be identical to the ordering enforced by setting `CUDA_DEVICE_ORDER=PCI_BUS_ID`.

==== By PCI device minor number: nvidia-smi/NVML ====
The command `nvidia-smi` which uses the [[https://developer.nvidia.com/nvidia-management-library-nvml|Nvidia Management Library]] (NVML) numbers GPUs based on the enumeration by the kernel driver. As this can change between node reboots it should not be used as a constant value.<<BR>>
For details see the related section in the `nvidia-smi` man page by issuing the command `man --pager='less +/--id=ID' nvidia-smi` in your shell.<<BR>>
A GPU can consistently be detected by its UUID or PCI bus ID as follows:
{{{#!highlight bash numbers=disable
nvidia-smi -q |grep -E '(GPU UUID|Minor Number|Bus Id)\s+:' |paste - - - |column -t

==== By PCI device minor number: Operating system/Kernel driver ====
The GPU ID used by the operating system in /dev/nvidia[0..n] is based on the ''PCI device minor number''. This number is generated by the kernel driver in a non-transparent way, it can change after a reboot.<<BR>>
A GPU can consistently be detected by its UUID or PCI bus ID as follows:
{{{#!highlight bash numbers=disable
grep -h -E '(GPU UUID|Device Minor|Bus Location):' /proc/driver/nvidia/gpus/*/information |paste - - - |column -t
The modern linux kernels are able to bind a process and all its childs to a fixed number of cores. By default a job submitted to the SLURM arton grid is bound to to the numbers of requested cores/cpus. The default number of requested cpus is 1, if you have an application which is able to run multithreaded on several cores you must use the '''--cpus-per-task''' option in the sbatch command to get a binding to more than one core. To see if there are processes with core bindings on a machine use the "hwloc-ps -c" command: A modern linux kernel is able to bind a process and all its children to a fixed number of cores. By default a job submitted to the SLURM arton grid is bound to to the numbers of requested cores/cpus. The default number of requested cpus is 1, if you have an application which is able to run multithreaded on several cores you must use the '''--cpus-per-task''' option in the sbatch command to get a binding to more than one core. To check for processes with core bindings, use the command `hwloc-ps -c`:
Temporary data storage of a job, which is only used while the job is running, should be placed in the /scratch directory of the compute nodes. The environment variables of the tools should be set accordingly. The Matlab MCR_ROOT_CACHE variable is set automatically by the SLURM scheduler.<<BR>>
The file system protection of the /scratch directory allows everybody to create files and directories in it. A cron job runs periodically on the execution hosts to prevent the /scratch directory from getting full and cleans it according to given policies. Therefore data you put in the /scratch directory of a compute node is not safe over time.<<BR>><<BR>>
Small sized input and output data for the jobs is best placed in your home directory. It is available on every compute node through the /home automounter.<<BR>><<BR>>
If you have problems with the quota limit in your home directory you could transfer data from your home or the /scratch directory of your submit host to the /scratch directories of the arton compute nodes and vice versa. To do this you are allowed to login interactively on arton01 with your personal account. All /scratch directories of the compute nodes are available on arton01 with the /scratch_net automount system. You can access the /scratch directory of arton<nn> under /scratch_net/arton<nn>. So you are able to transfer data between the /scratch_net directories and your home with normal linux file copy and to the scratch of your submission host with scp.<<BR>><<BR>>
Please do not use the possible login on arton01 to run compute jobs interactively. Our procguard system will detect you.
Temporary data storage of a job used only while the job is running, should be placed in the `/scratch` directory of the compute nodes. Set the environment variables of the tools you use accordingly. The Matlab `MCR_ROOT_CACHE` variable is set automatically by the SLURM scheduler.<<BR>>
The file system protection of the `/scratch` directory allows everybody to create files and directories in it. A cron job runs periodically on the execution hosts to prevent the `/scratch` directory from filling up and cleans it governed by pre-set policies. Therefore data you place in the `/scratch` directory of a compute node cannot be assumed to stay there forever.<<BR>><<BR>>
Small sized input and output data for the jobs is best placed in your home directory. It is available on every compute node through the `/home` automounter.<<BR>>
Larger amounts of data should be placed in your personal [[Services/NetScratch|netscratch]] folder and can be accessed on all compute nodes

If you have problems with the quota limit in your home directory you could transfer data from your home or the `/scratch` directory of your submit host to the `/scratch` directories of the arton compute nodes and vice versa. For this purpose interactive logins with personal accounts are allowed on `arton01`. All `/scratch` directories of the compute nodes are available on `arton01` through the `/scratch_net` automount system. You can access the `/scratch` directory of `arton<nn>` under `/scratch_net/arton<nn>`. This allows you to transfer data between the `/scratch_net` directories and your home with normal linux file copy and to the `/scratch` of your submission host with scp.<<BR>><<BR>>
Do not log in on `arton01` to run compute jobs interactively. Such jobs will be detected by our procguard system and killed..
In the near future ISG will provide a network attached scratch storage system (Netscratch) which will be accessible from all managed linux clients and also from the compute nodes.
== SLURM and Matlab ==

=== Matlab Distributed Computing Environment (MDCE) ===
The Matlab '''P'''arallel '''C'''omputing '''T'''oolbox (PCT) can be configured with an interface to the SLURM cluster. To work with MDCE please import [[attachment:Slurm.mlsettings]] in Matlab GUI (Parallel -> Create and manage Clusters -> Import ). Adjust the setting "JobStorageLocation" to your requirements. The cluster profile Slurm will now appear besides the standard local(default) profile in the profile list. With the local profile, you can use as many workers on one computer as there are physical cores while the Slurm profile allows to initiate up to 32 worker processes distributed over all slurm compute nodes.<<BR>>
/!\ Please temporay reduce the number of workers to 4 in the Slurm profile when performing the profile "Validation" function in the Matlab Cluster Manager.<<BR>>
/!\ Don't forget to set the Slurm environment variables before starting Matlab!<<BR>>
The Slurm cluster profile can be used with Matlab programs running as Slurm batch jobs but it's also possible to use the profile in an interactive Matlab session on your client. When you open a Slurm parpool, the workers are started automatically as jobs in the cluster.<<BR>>
/!\ In interactive mode please always close your parpool if you aren't performing any calculations on the workers.<<BR>>
Sample code for the 3 Matlab PCT methods parfor, spmd, tasks using the local or Slurm cluster profile is provided in [[attachment:PCTRefJobs.tar.gz]].

=== Frequently Asked Questions ===

==== Batch job submission failed: Invalid account ====
If you receive one of the following error messages after submitting a job with `sbatch` or using `srun`
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified
your account hasn't been registered with slurm yet. Please contact [[mailto:support@ee.ethz.ch|support]] and ask to be registered.

==== Invalid user for SlurmUser slurm ====
After executing one of the slurm executables like `sbatch` or `sinfo` the following error appears:
error: Invalid user for SlurmUser slurm, ignored
The user `slurm` doesn't exist on the host you're running your slurm executable. If this happens on a host managed by ISG.EE, please contact [[mailto:support@ee.ethz.ch|support]], tell us the name of your host and ask us to configure it as a slurm submission host.

==== Node(s) in drain state ====
If `sinfo` shows one or more nodes in ''drain'' state, the reason can be shown with
sinfo -R
or in case the reason is cut off with
sinfo -o '%60E %9u %19H %N'
Nodes are set to drain by ISG.EE to empty them of jobs in time for scheduled maintenance or by the scheduler itself in case a problem is detected on a node.


