26344
Comment:
|
33963
Remove reference to condor
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
#rev 2020-09-10 stroth | |
Line 4: | Line 5: |
At ITET the Condor Batch Queueing System has been used for a long time and is still used for running compute-intensive jobs. It uses the free resources on the tardis-PCs of the student rooms and on numerous PCs and compute servers at ITET institutes. Interactive work is privileged over batch computing, so running jobs could be killed by new interactive load or by shutdown/restart of a PC.<<BR>><<BR>> The SLURM system installed on the powerful ITET arton compute servers is an alternative to the Condor batch computing system. It consists of a master host, where the scheduler resides and the compute nodes, where batch jobs are executed. The compute nodes are powerful servers located in server rooms, they are exclusively reserved for batch processing. Interactive logins are disabled. |
At D-ITET the Slurm job scheduling sytem can be used for running compute-intensive jobs. It consists of a master host, where the scheduler resides and the compute nodes, where batch jobs are executed. The compute nodes are powerful servers located in server rooms, they are exclusively reserved for batch processing. Interactive logins are disabled. |
Line 8: | Line 8: |
Access to the SLURM grid is reserved for staff of the contributing institutes '''APS, IBT, IFA, MINS, NARI, TIK'''. Access is granted on request, please contact [[mailto:support@ee.ethz.ch|ISG.EE support]].<<BR>> If your circumstances differ and you'd still like to use the cluster, please contact [[mailto:support@ee.ethz.ch|ISG.EE support]] as well and ask for an offer. Time-limited test accounts for up to 2 weeks are also available on request. == SLURM == SLURM ('''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement) is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and compute clusters. Slurm's design is very modular with about 100 optional plugins. In 2010, the developers of Slurm founded [[https://www.schedmd.com|SchedMD]], which maintains the canonical source, provides development, level 3 commercial support and training services and also provide a very good online documentation to [[https://slurm.schedmd.com|Slurm]]. == SLURM Grid == |
Access to the Slurm cluster is reserved for staff of the contributing institutes '''APS, IBT, IFA, MINS, NARI, TIK''' and '''PBL'''. Access is granted on request. * Contact [[mailto:support@ee.ethz.ch|ISG D-ITET support]] if your institute is supported by us * Members of an institute supported by [[https://www.s4d.id.ethz.ch/|ID Services for Departments]] (S4D), use the email address listed for your institute instead If your circumstances differ and you'd still like to use the cluster, please contact [[mailto:support@ee.ethz.ch|ISG D-ITET support]] as well and ask for an offer. Time-limited test accounts for up to 2 weeks are also available on request. === Additional information for institutes === Some institutes have additional setup and configuration, if you are a member of such an institute, make sure to read the information linked below after reading this article: * '''CVL''' uses it's own Slurm cluster, please read it's [[Services/SLURM-Biwi|documentation]] for access and specific additional information to this article. * '''TIK''' owns nodes in the Slurm cluster, please read the [[Services/SLURM-tik|additional information]] about those nodes and access. * '''PBL''' student supervisors can apply for access for their students. == Slurm == Slurm ('''S'''imple '''L'''inux '''U'''tility for '''R'''esource '''M'''anagement) is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and compute clusters. Slurm's design is very modular with about 100 optional plugins. In 2010, the developers of Slurm founded [[https://www.schedmd.com|SchedMD]], which maintains the canonical source, provides development, level 3 commercial support and training services and also provide a very good online documentation to [[https://slurm.schedmd.com|Slurm]]. == Slurm Cluster == |
Line 17: | Line 26: |
At the moment the computing power of the SLURM grid is based on the following 11 cpu compute nodes and 1 gpu compute node: ||'''Server'''||'''CPU'''||'''Frequency'''||'''Cores'''||'''Memory'''||'''/scratch SSD'''||'''GPUs'''||'''GPU Memory'''||'''Operating System'''|| ||arton01 - 03||Dual Octa-Core Intel Xeon E5-2690||2.90 GHz||16||128 GB||-||-||-||Debian 9|| ||arton04 - 08||Dual Deca-Core Intel Xeon E5-2690 v2||3.00 GHz||20||128 GB||-||-||-||Debian 9|| ||arton09 - 10||Dual Deca-Core Intel Xeon E5-2690 v2||3.00 GHz||20||256 GB||✓||-||-||Debian 9|| ||arton11||Dual Deca-Core Intel Xeon E5-2690 v2||3.00 GHz||20||736 GB||✓||-||-||Debian 9|| ||artongpu01||Dual Octa-Core Intel Xeon Silver 4208||2.10 GHz||16||128GB||✓||4 RTX 2080 Ti||11 GB||Debian 9|| * `artongpu01` is meant to be a test system to try out GPU calculations <<BR>> The following 4 gpu nodes are reserved for exclusive use by TIK: ||'''Server'''||'''CPU'''||'''Frequency'''||'''Cores'''||'''Memory'''||'''/scratch SSD'''||'''GPUs'''||'''GPU Memory'''||'''Operating System'''|| ||tikgpu01||Dual Tetrakaideca-Core Xeon E5-2680 v4||2.40GHz||28||512GB||✓||5 Titan Xp, 2 GTX Titan X||12 GB||Debian 9|| ||tikgpu02||Dual Tetrakaideca-Core Xeon E5-2680 v4||2.40GHz||28||512GB||✓||8 Titan Xp||12 GB||Debian 9|| ||tikgpu03||Dual Tetrakaideca-Core Xeon E5-2680 v4||2.40GHz||28||512GB||✓||8 Titan Xp||12 GB||Debian 9|| ||tikgpu04||Dual Hectakaideca-Core Xeon Gold 6242 v4||2.80GHz||32||384GB||✓||8 Titan RTX||24 GB||Debian 9|| The SLURM job scheduler runs on the linux server `itetmaster01`. |
At the moment the computing power of the Slurm cluster is based on the following 11 cpu compute nodes and 1 gpu compute node: ||'''Server'''||'''CPU'''||'''Frequency'''||'''Cores'''||'''Memory'''||'''/scratch SSD'''||'''/scratch Size'''||'''GPUs'''||'''GPU Memory'''||'''Operating System'''|| ||arton[01-03]||Dual Octa-Core Intel Xeon E5-2690||2.90 GHz||16||125 GB||-||895 GB||-||-||Debian 10|| ||arton[04-08]||Dual Deca-Core Intel Xeon E5-2690 v2||3.00 GHz||20||125 GB||-||895 GB||-||-||Debian 10|| ||arton[09-10]||Dual Deca-Core Intel Xeon E5-2690 v2||3.00 GHz||20||251 GB||✓||1.7 TB||-||-||Debian 10|| ||arton11||Dual Deca-Core Intel Xeon E5-2690 v2||3.00 GHz||20||535 GB||✓||1.7 TB||-||-||Debian 10|| ||artongpu01||Dual Octa-Core Intel Xeon Silver 4208||2.10 GHz||16||125 GB||✓||1.1 TB||4 RTX 2080 Ti||11 GB||Debian 10|| * '''Memory''' shows the amount available to Slurm The nodes are "weighted", which gives the scheduler an additional selection criteria between nodes which fulfill criterias to run a job, like resources and membership in certain partitions. The idea is to prefer nodes with faster CPUs and of those, prefer those with lower RAM. For details, see `/home/sladmitet/slurm/nodes.conf`. The Slurm job scheduler runs on the linux server `itetmaster01`. |
Line 39: | Line 41: |
=== Using SLURM === At a basic level, SLURM is very easy to use. The following sections will describe the commands you need to run and manage your batch jobs. The commands that will be most useful to you are as follows<<BR>> |
=== Using Slurm === At a basic level, Slurm is very easy to use. The following sections will describe the commands you need to run and manage your batch jobs. The commands that will be most useful to you are as follows: |
Line 47: | Line 49: |
The above commands only work if the environment variables for SLURM are set. Please issue the following commands in your bash shell to start working with the cluster immediately or add them to your `~/.bashrc` to have the slurm commands available for new instances of bash:<<BR>> {{{#!highlight bash numbers=disable export PATH=/usr/pack/slurm-19.05.x-sr/amd64-debian-linux9/bin:$PATH |
The above commands only work if the environment variables for Slurm are set. Please issue the following command in your bash shell to start working with the cluster immediately or add them to your `~/.bashrc` to reference the Slurm cluster for new instances of bash:<<BR>> {{{#!highlight bash numbers=disable |
Line 53: | Line 54: |
==== sbatch -> Submitting a job ==== `sbatch` doesn't allow to submit a binary program directly, please wrap the program to run into a surrounding bash script. The `sbatch` command has the following syntax:<<BR>> {{{ > sbatch [options] job_script [job_script arguments] }}} The `job_script` is a standard UNIX shell script. The fixed options for the SLURM Scheduler are placed in the `job_script` in lines starting with '''#SBATCH'''. The UNIX shell interprets these lines as comments and ignores them. Only temporary options should be placed outside the `job_script`. To test your `job-script` you can simply run it interactively.<<BR>><<BR>> Assume there is a c program [[attachment:primes.c]] which is compiled to an executable binary named `primes` with "gcc -o primes primes.c". The program runs 5 seconds and calculates prime numbers. The found prime numbers and a final summary are sent to standard output. A sample `job_script` `primes.sh` to perform a batch run of the binary primes on the Arton grid looks like this: |
==== sbatch → Submitting a job ==== `sbatch` doesn't allow to submit a binary program directly, wrap the program to run into a surrounding bash script. The `sbatch` command has the following syntax:<<BR>> {{{#!highlight console numbers=disable > sbatch [temporary_options] job_script [job_script arguments] }}} The `job_script` is a standard UNIX shell script. The fixed options for the Slurm Scheduler are placed in the `job_script` in lines starting with '''#SBATCH'''. The UNIX shell interprets these lines as comments and ignores them. * Put options into the `job_script` for easier reference. Place only temporary options outside the `job_script` as options to the `sbatch` command. * Make sure to create the directories you intend to store logfiles in before submitting the `job_script` * Use absolute paths in your scripts to ensure your log files and commands are found * Make sure the paths you use in your scripts [[Services/StorageOverview|are available]] on cluster nodes To test your `job-script` simply run it interactively on your host.<<BR>><<BR>> Assume there is a c program [[attachment:primes.c]] which is compiled to an executable binary with "gcc -o primes primes.c" and stored as `/absolute/path/to/primes`. The program runs 5 seconds and calculates prime numbers. The found prime numbers and a final summary are sent to standard output. A sample `job_script` placed in the same location `/absolute/path/to/primes.sh` to perform a batch run of the binary primes on the Arton cluster looks like this: |
Line 63: | Line 69: |
#SBATCH --mail-type=ALL # mail configuration: NONE, BEGIN, END, FAIL, REQUEUE, ALL #SBATCH --output=log/%j.out # where to store the output (%j is the JOBID), subdirectory must exist #SBATCH --error=log/%j.err # where to store error messages /bin/echo Running on host: `hostname` /bin/echo In directory: `pwd` /bin/echo Starting on: `date` /bin/echo SLURM_JOB_ID: $SLURM_JOB_ID # exit on errors |
#SBATCH --mail-type=ALL # mail configuration: NONE, BEGIN, END, FAIL, REQUEUE, ALL #SBATCH --output=/absolute/path/to/log/%j.out # where to store the output (%j is the JOBID), subdirectory "log" must exist #SBATCH --error=/absolute/path/to/log/log/%j.err # where to store error messages # Exit on errors |
Line 74: | Line 75: |
# binary to execute ./primes echo finished at: `date` exit 0; |
# Set a directory for temporary files unique to the job with automatic removal at job termination TMPDIR=$(mktemp -d) if [[ ! -d ${TMPDIR} ]]; then echo 'Failed to create temp directory' >&2 exit 1 fi trap "exit 1" HUP INT TERM trap 'rm -rf "${TMPDIR}"' EXIT export TMPDIR # Change the current directory to the location where you want to store temporary files, exit if changing didn't succeed. # Adapt this to your personal preference cd "${TMPDIR}" || exit 1 # Send some noteworthy information to the output log echo "Running on node: $(hostname)" echo "In directory: $(pwd)" echo "Starting on: $(date)" echo "SLURM_JOB_ID: ${SLURM_JOB_ID}" # Binary or script to execute /absolute/path/to/primes # Send more noteworthy information to the output log echo "Finished at: $(date)" # End the script with exit code 0 exit 0 |
Line 80: | Line 106: |
{{{ gfreudig@trollo:~/Batch$ ./primes.sh }}} If the script runs successfully you can now submit it as a batch job to the SLURM arton grid: {{{ gfreudig@trollo:~/Batch$ sbatch primes.sh |
{{{#!highlight console numbers=disable $ /absolute/path/to/primes.sh }}} If the script runs successfully you can now submit it as a batch job to the Slurm arton cluster: {{{#!highlight console numbers=disable $ sbatch /absolute/path/to/primes.sh |
Line 87: | Line 113: |
sbatch: Job partition set to : cpu.normal.32 (normal memory) | sbatch: Job partition set to : cpu.normal |
Line 89: | Line 115: |
gfreudig@trollo:~/Batch$ }}} After the job has finished, you will find the output file of the job in the log subdirectory with a name of `<JOBID>.out`.<<BR>> /!\ The directory for the job output must exist, it is not created automatically! |
}}} After the job has finished, you will find the output file of the job in the file `/absolute/path/to/log/<JOBID>.out`. If there were errors they are stored in the file `/absolute/path/to/log/<JOBID>.err`.<<BR>> ⚠ Remember: The directory for the job output has to exist before submitting the job, it is not created automatically! |
Line 94: | Line 119: |
You can only submit jobs to SLURM if your account is configured in the SLURM user database. If it isn't, you'll receive this [[#Batch_job_submission_failed:_Invalid_account|error message]] ==== sbatch -> Submitting an array job ==== Similar to condor it is also possible to start an array job. The above job would run 10 times if you added the option '''#SBATCH --array=0-9''' to the job-script. A repeated execution only makes sense if the executed program adapts its behaviour according to the changing array task count number. The array count number can be referenced through the variable '''$SLURM_ARRAY_TASK_ID'''. You can pass the value of `$SLURM_ARRAY_TASK_ID` or some derived parameters to the executable.<<BR>> |
You can only submit jobs to Slurm if your account is configured in the Slurm user database. If it isn't, you'll receive this [[#Batch_job_submission_failed:_Invalid_account|error message]] ==== sbatch → Submitting an array job ==== It is also possible to start an array job. The above job would run 10 times if you added the option '''#SBATCH --array=0-9''' to the job-script. A repeated execution only makes sense if the executed program adapts its behaviour according to the changing array task count number. The array count number can be referenced through the variable '''$SLURM_ARRAY_TASK_ID'''. You can pass the value of `$SLURM_ARRAY_TASK_ID` or some derived parameters to the executable.<<BR>> |
Line 113: | Line 138: |
==== sbatch -> Common options ==== | ==== sbatch → Common options ==== |
Line 115: | Line 140: |
||'''option'''||||'''description'''|| ||--mail-type=...||||Possible Values: NONE, BEGIN, END, FAIL, REQUEUE, ALL|| ||--mem=<n>G||||the job needs a maximum of <n> GByte ( if omitted the default of 6G is used )|| ||--cpus-per-task=<n>||||number of cores to be used for the job|| ||--gres=gpu:1||||number of GPUs needed for the job ( limited to 1 ! )|| ||--nodes=<n>||||number of compute nodes to be used for the job|| /!\ The --nodes option should only be used for MPI jobs ! ==== squeue -> Show running/waiting jobs ==== |
||'''option'''||'''description'''|| ||--mail-type=...||Possible Values: NONE, BEGIN, END, FAIL, REQUEUE, ALL|| ||--mem=<n>G||the job needs a maximum of <n> GByte ( if omitted the default of 6G is used )|| ||--cpus-per-task=<n>||number of cores to be used for the job|| ||--gres=gpu:1||number of GPUs needed for the job|| ||--nodes=<n>||number of compute nodes to be used for the job|| ||--hint=<type>||Bind tasks to CPU cores according to application hints (See `man --pager='less +/--hint' srun` and [[https://slurm.schedmd.com/mc_support.html#srun_hints|multi-core support]]|| ||--constraint=<feature_name>||Request one or more [[#sinfo_.2BIZI_Show_available_features|features]], optionally combined by operators (See `man --pager='less +/--constraint' sbatch`)|| * ⚠ The `--nodes` option should only be used for MPI jobs ! * The operators to combine `--constraint` lists are: . '''AND (&)''': `#SBATCH --constraint='geforce_rtx_2080_ti&titan_rtx` . '''OR (|)''': `#SBATCH --constraint='titan_rtx|titan_xp'` ==== squeue → Show running/waiting jobs ==== |
Line 125: | Line 156: |
{{{ gfreudig@trollo:~/Batch$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 951 cpu.norma primes.s gfreudig R 0:11 1 arton02 950 cpu.norma primes_4 gfreudig R 0:36 1 arton02 949 cpu.norma primes.s fgtest01 R 1:22 1 arton02 948 gpu.norma primes.s fgtest01 R 1:39 1 artongpu01 gfreudig@trollo:~/Batch$ |
{{{#!highlight console numbers=disable $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 951 cpu.norma primes.s gfreudig R 0:11 1 arton02 950 cpu.norma primes_4 gfreudig R 0:36 1 arton02 949 cpu.norma primes.s fgtest01 R 1:22 1 arton02 948 gpu.norma primes.s fgtest01 R 1:39 1 artongpu01 |
Line 135: | Line 165: |
{{{ gfreudig@trollo:~/Batch$ squeue -O jobarrayid:10,state:10,partition:16,reasonlist:18,username:10,tres-alloc:45,timeused:8,command:50 JOBID STATE PARTITION NODELIST(REASON) USER TRES_ALLOC TIME COMMAND 951 RUNNING cpu.normal.32 arton02 gfreudig cpu=1,mem=32G,node=1,billing=1 1:20 /home/gfreudig/BTCH/Slurm/jobs/single/primes.sh 600 950 RUNNING cpu.normal.32 arton02 gfreudig cpu=4,mem=8G,node=1,billing=4 1:45 /home/gfreudig/BTCH/Slurm/jobs/multi/primes_4.sh 600 949 RUNNING cpu.normal.32 arton02 fgtest01 cpu=1,mem=8G,node=1,billing=1 2:31 /home/fgtest01/BTCH/Slurm/jobs/single/primes.sh 600 948 RUNNING gpu.normal artongpu01 fgtest01 cpu=1,mem=8G,node=1,billing=1,gres/gpu=1 2:48 /home/fgtest01/BTCH/Slurm/jobs/single/primes.sh 600 gfreudig@trollo:~/Batch$ }}} * `STATE` is explained in the squeue manpage in section `JOB STATE CODES`, see `man --pager='less +/^JOB\ STATE\ CODES' squeue` for details |
{{{#!highlight console numbers=disable $ squeue --Format=jobarrayid:10,state:10,partition:16,reasonlist:18,username:10,tres-alloc:45,timeused:8,command:50 JOBID STATE PARTITION NODELIST(REASON) USER TRES_ALLOC TIME COMMAND 951 RUNNING cpu.normal arton02 gfreudig cpu=1,mem=32G,node=1,billing=1 1:20 /home/gfreudig/BTCH/Slurm/jobs/single/primes.sh 600 950 RUNNING cpu.normal arton02 gfreudig cpu=4,mem=8G,node=1,billing=4 1:45 /home/gfreudig/BTCH/Slurm/jobs/multi/primes_4.sh 600 949 RUNNING cpu.normal arton02 fgtest01 cpu=1,mem=8G,node=1,billing=1 2:31 /home/fgtest01/BTCH/Slurm/jobs/single/primes.sh 600 948 RUNNING gpu.normal artongpu01 fgtest01 cpu=1,mem=8G,node=1,billing=1,gres/gpu=1 2:48 /home/fgtest01/BTCH/Slurm/jobs/single/primes.sh 600 }}} * `STATE` is explained in the squeue man page in section `JOB STATE CODES`, see `man --pager='less +/^JOB\ STATE\ CODES' squeue` for details |
Line 148: | Line 177: |
alias sq1='squeue -O jobarrayid:10,state:10,partition:16,reasonlist:18,username:10,tres-alloc:45,timeused:8,command:50' | alias sq1='squeue --Format=jobarrayid:10,state:10,partition:16,reasonlist:18,username:10,tres-alloc:45,timeused:8,command:50' |
Line 152: | Line 181: |
==== scancel -> Deleting a job ==== | ⚠ '''Never call squeue from any kind of loop''', i.e. never do `watch squeue`. See `man --pager='less +/^PERFORMANCE' squeue` for an explanation.<<BR>> To monitor your jobs, set the `sbatch` option `--mail-type` to send you notifications. If you absolutely have to see a live display of your jobs, use the `--iterate` option with a value of several seconds: {{{#!highlight bash numbers=disable squeue --user=$USER --iterate=30 }}} ==== squeue → Show job steps ==== Individual [[#srun_.2BIZI_Launch_a_command_as_a_job_step|job steps]] are listed with a specific option: {{{#!highlight bash numbers=disable squeue -s }}} ==== scancel → Deleting a job ==== |
Line 154: | Line 195: |
{{{ | {{{#!highlight console numbers=disable |
Line 158: | Line 199: |
{{{ | {{{#!highlight console numbers=disable |
Line 164: | Line 205: |
==== sinfo -> Show partition configuration ==== | ==== sinfo → Show partition configuration ==== |
Line 168: | Line 209: |
cpu.normal.32* up 2-00:00:00 11 idle arton[01-11] cpu.normal.64 up 2-00:00:00 3 idle arton[09-11] cpu.normal.256 up 2-00:00:00 1 idle arton11 array.normal up 2-00:00:00 10 idle arton[01-10] gpu.normal up 2-00:00:00 1 idle artongpu01 tikgpu.normal up 2-00:00:00 3 mix tikgpu[01-03] tikgpu.mon up 2-00:00:00 3 mix tikgpu[01-03] }}} For normal jobs (single, multicore) you can not choose the partition for the job to run in the `sbatch` command, the partition is selected by the scheduler according to your memory request. Array jobs are put in the `array.normal` partition, [[#GPU_jobs|gpu jobs]] in the `gpu.normal` partition. The following table shows the job memory limits in different partitions:<<BR>> ||'''PARTITION'''||'''Requested Memory'''|| ||cpu.normal.32||< 32 GB|| ||cpu.normal.64||32 - 64 GB|| ||cpu.normal.256||> 64 GB|| ||array.normal||< 32 GB|| ||gpu.normal||< 64 GB|| ||tikgpu.normal||unlimited|| Only a job with a `--mem` request of a maximum of 32 GByte can run in the `cpu.normal.32` partition, which contains all 11 artons.<<BR>> While `tikgpu.normal` has no memory limit itself, the memory on nodes in this partition is not unlimited. The purpose of `tikgpu.mon` is to login interactively for live monitoring of running jobs. ==== sinfo -> Show resources and utilization ==== |
cpu.normal* up 7-00:00:00 10 idle arton[01-11] gpu.normal up 2-00:00:00 1 idle artongpu01 tikgpu.all up 2-00:00:00 7 idle tikgpu[01-07] tikgpu.medium up 2-00:00:00 3 idle tikgpu[01-03] }}} The partition is chosen by the scheduler according to your resource request and memberships in Slurm accounts. The logic can be seen in `/home/sladmitet/slurm/jobsumit.lua`. ==== sinfo → Show resources and utilization ==== |
Line 193: | Line 221: |
Restricting the command to selected partitions allows to show only GPU nodes: {{{#!highlight bash numbers=disable sinfo -Node --partition=tikgpu.normal,gpu.normal --Format nodelist:12,statecompact:7,memory:7,allocmem:10,freemem:10,cpusstate:15,cpusload:10,gresused:100 }}} ==== srun -> Start an interactive shell ==== An interactive session on a compute node is possible for short tests, checking your environment or transferring data to the local scratch of a node available under `/scratch_net/arton[0-11]`. An interactive session lasting for 10 minutes on a GPU node can be started with: |
Restricting the command to a selected partition allows to show only GPU nodes: {{{#!highlight bash numbers=disable sinfo -Node --partition=gpu.normal --Format nodelist:12,statecompact:7,memory:7,allocmem:10,freemem:10,cpusstate:15,cpusload:10,gresused:100 }}} ==== sinfo → Show available features ==== So-called ''features'' are used to constrain jobs to nodes with different hardware capabilities, typically GPU types. To show currently active features issue the following command sequence: {{{#!highlight bash numbers=disable sinfo --Format nodehost:20,features_act:80 |grep -v '(null)' |awk 'NR == 1; NR > 1 {print $0 | "sort -n"}' }}} An example of feature use can be seen in section [[#Specifying_GPUs_based_on_compute_capability|Specifying GPUs based on compute capability]]. ==== srun → Start an interactive shell ==== An interactive session on a compute node is possible for short tests, checking your environment or transferring data to the local scratch of a node available under `/scratch_net/arton[0-11]`. Such sessions are limited to a maximum run time of 720 minutes (12 hours) regardless of the partition they are sent to. An interactive session lasting for 10 minutes on a GPU node can be started with: |
Line 211: | Line 249: |
==== srun -> Monitoring tik jobs ==== The following information is only applicable for institute-owned nodes.<<BR>> The partition `tikgpu.mon` is available to monitor jobs interactively on a specific node: {{{#!highlight bash numbers=disable srun --time 10 --partition=tikgpu.mon --nodelist=tikgpu01 --pty bash -i }}} ==== srun -> Selecting artongpu01 ==== The following information is only applicable for members of institutes with institute-owned nodes. To specifically run a GPU job on `artongpu01`, the node has to be selected with the `--nodelist` parameter as in the following example: {{{#!highlight bash numbers=disable srun --time 10 --nodelist=artongpu01 --gres=gpu:1 --pty bash -i }}} ==== srun -> Launch a command as a job step ==== |
==== srun → Attaching an interactive shell to a running job ==== An interactive shell can be opened inside a running job by specifying its job id: {{{#!highlight bash numbers=disable srun --time 10 --jobid=123456 --overlap --pty bash -i }}} A typical use case of the above is interactive live-monitoring of a running job. ==== srun → Launch a command as a job step ==== |
Line 233: | Line 265: |
srun --exclusive --ntasks=1 --cpus-per-task=1 nvidia-smi dmon -i ${CUDA_VISIBLE_DEVICES} -d 5 -s ucm -o DT > "${SLURM_JOB_ID}.gpulog" & srun --exclusive --ntasks=1 --cpus-per-task=1 nvidia-smi pmon -i ${CUDA_VISIBLE_DEVICES} -d 5 -s um -o DT > "${SLURM_JOB_ID}.processlog" & |
srun --ntasks=1 --cpus-per-task=1 nvidia-smi dmon -i ${CUDA_VISIBLE_DEVICES} -d 5 -s ucm -o DT > "${SLURM_JOB_ID}.gpulog" & srun --ntasks=1 --cpus-per-task=1 nvidia-smi pmon -i ${CUDA_VISIBLE_DEVICES} -d 5 -s um -o DT > "${SLURM_JOB_ID}.processlog" & |
Line 240: | Line 272: |
==== sstat -> Display status information of a running job ==== | ==== sstat → Display status information of a running job ==== |
Line 243: | Line 275: |
sstat --jobs=<JOBID> --format=JobID,AveVMSize%15,MaxRSS%15,AveCPU%15 | sstat --jobs=<JOBID> --allsteps --format=JobID,AveVMSize%15,MaxRSS%15,AveCPU%15 |
Line 248: | Line 280: |
==== sacct -> Display accounting information of past jobs ==== |
All your currently running job's resource usages can be shown with: {{{#!highlight bash numbers=disable sstat --jobs=$(squeue --noheader --me --format=%A |paste -s -d ',') --allsteps --format=JobID,AveVMSize%15,MaxRSS%15,AveCPU%15 }}} ==== sacct → Display accounting information of past jobs ==== |
Line 255: | Line 291: |
==== sprio → Show priorities of pending jobs ==== Pending jobs are prioritized by the scheduler by accounting for fair sharing of resources and age of a pending job. Priorities of pending jobs and the factors comprising them can be shown with {{{#!highlight bash numbers=disable sprio --long }}} On job submission, a nice value can be added to influence priorities of your own jobs: {{{#!highlight bash numbers=disable sbatch --nice=10 job_script.sh }}} The nice value of an already pending job can be incremented with positive values: {{{#!highlight bash numbers=disable scontrol update job <jobid> nice=5 }}} Only incrementation is possible. The value can be reset to `nice=0`. The offical Slurm manual contains a detailled explanation of [[https://slurm.schedmd.com/priority_multifactor.htm|job priorisation]]. ==== smon → GPU / CPU availability ==== Information about the GPU nodes and current availability of the installed GPUs as well as CPU availability of CPU-only nodes is updated every 5 minutes to the file `/home/sladmitet/smon.txt`. Here are some convenient aliases to display the file with highlighting of either free GPUs or those running the current user's jobs: {{{#!highlight bash numbers=disable alias smon_free="grep --color=always --extended-regexp 'free|$' /home/sladmitet/smon.txt" alias smon_mine="grep --color=always --extended-regexp '${USER}|$' /home/sladmitet/smon.txt" }}} For monitoring its content the following aliases can be used: {{{#!highlight bash numbers=disable alias watch_smon_free="watch --interval 300 --no-title --differences --color \"grep --color=always --extended-regexp 'free|$' /home/sladmitet/smon.txt\"" alias watch_smon_mine="watch --interval 300 --no-title --differences --color \"grep --color=always --extended-regexp '${USER}|$' /home/sladmitet/smon.txt\"" }}} ⚠ Never use `watch` directly on `smon`, as this places considerable load on the Slurm controller! |
|
Line 256: | Line 324: |
To select the GPU allocated by the scheduler, slurm sets the environment variable `CUDA_VISIBLE_DEVICES` in the context of a job. It is '''imperative''' to work with this variable exactly as it is set by slurm, anything else leads to usage of wrong GPU(s) which might be in use by jobs of other users.<<BR>> For details, see the section [[https://slurm.schedmd.com/gres.html#GPU_Management|GPU Management]] in the offical slurm documentation. === GPU numbering === The numbering of GPUs can be confusing as it is non-uniform across different sources of information. One source of information is the so-called ''PCI bus number'', the other is the ''PCI device minor number''. They are generated differently and although their order might match, this cannot be taken for granted! ==== By PCI bus number: CUDA_VISIBLE_DEVICES ==== The environment variable `CUDA_DEVICE_ORDER` controls the numbering of GPUs in a CUDA context. It's default is `FASTEST_FIRST`, which sets the fastest available GPU to be the number 0 in `CUDA_VISIBLE_DEVICES`.<<BR>> For details, see the section [[https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars|CUDA Environment Variables]] in the CUDA toolkit documentation.<<BR>> As long as a node only has one type of GPUs installed, this numbering can be identical to the ordering enforced by setting `CUDA_DEVICE_ORDER=PCI_BUS_ID`. ==== By PCI device minor number: nvidia-smi/NVML ==== The command `nvidia-smi` which uses the [[https://developer.nvidia.com/nvidia-management-library-nvml|Nvidia Management Library]] (NVML) numbers GPUs based on the enumeration by the kernel driver. As this can change between node reboots it should not be used as a constant value.<<BR>> For details see the related section in the `nvidia-smi` man page by issuing the command `man --pager='less +/--id=ID' nvidia-smi` in your shell.<<BR>> A GPU can consistently be detected by its UUID or PCI bus ID as follows: {{{#!highlight bash numbers=disable nvidia-smi -q |grep -E '(GPU UUID|Minor Number|Bus Id)\s+:' |paste - - - |column -t }}} ==== By PCI device minor number: Operating system/Kernel driver ==== The GPU ID used by the operating system in /dev/nvidia[0..n] is based on the ''PCI device minor number''. This number is generated by the kernel driver in a non-transparent way, it can change after a reboot.<<BR>> A GPU can consistently be detected by its UUID or PCI bus ID as follows: {{{#!highlight bash numbers=disable grep -h -E '(GPU UUID|Device Minor|Bus Location):' /proc/driver/nvidia/gpus/*/information |paste - - - |column -t }}} |
==== Selecting the allocated GPUs ==== To select the GPU allocated by the scheduler, Slurm sets the environment variable `CUDA_VISIBLE_DEVICES` in the context of a job to the GPUs allocated to the job. The numbering always starts at 0 and is consecutively numbered up to the requested amount of GPUs - 1.<<BR>> It is '''imperative''' to work with this variable exactly as it is set by Slurm, anything else leads to unexpected errors.<<BR>> For details see the section [[https://slurm.schedmd.com/gres.html#GPU_Management|GPU Management]] in the official Slurm documentation. ==== Specifying a GPU type ==== It's possible to specify a GPU type by inserting the type description in the gres allocation: {{{#!highlight bash numbers=disable --gres=gpu:titan_rtx:1 }}} Available GPU type descriptions can be filtered from an appropriate `sinfo` command: {{{#!highlight bash numbers=disable sinfo --noheader --Format gres:200 |tr ':' '\n' |sort -u |grep -vE '^(gpu|[0-9,\(]+)' }}} Multiple GPU types can be requested by using the [[#sbatch_.2BIZI_Common_options|--constraint]] option. ==== Specifying GPUs based on compute capability ==== CUDA code compiled with `nvcc` can be optimized for ranges of so-called [[https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability|Compute Capabilities]] defining `generation.version` of a NVIDIA GPU.<<BR>> More information about compute capabilites can be read here: * [[https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html#gpu-feature-list||GPU feature list]] * [[https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capabilities|Detailled features contained in each capability]] The following table shows an abbreviated list of the compute capabilities of available GPU types selectable by [[#sinfo_.2BIZI_Show_available_features|features]]: ||'''Compute capability'''||'''Features'''|| ||3.5||tesla_k40c|| ||5.2||geforce_gtx_titan_x|| ||6.1||geforce_gtx_1080_ti, titan_x, titan_xp|| ||7.0||tesla_v100|| ||7.5||geforce_rtx_2080_ti, titan_rtx|| ||8.0||a100|| ||8.6||geforce_rtx_3090, a6000|| For more information see the [[https://developer.nvidia.com/cuda-gpus|full list of NVIDIA CUDA GPUs]].<<BR>> As GPU nodes may house different generations of GPUs, compiled CUDA code might not run on all of them and errors similar to the following can appear: {{{ RuntimeError: CUDA error: no kernel image is available for execution on the device }}} If you see a similar error: * Note the type of GPU on which your job failed * In the table above, note the features with a higher compute capability than the GPU type you noted before * Check the list of [[https://computing.ee.ethz.ch/Services/SLURM#sinfo_.2BIZI_Show_available_features|available features]] to exclude non-existing GPU types * Build a [[#sbatch_.2BIZI_Common_options|constraint]] to include only GPUs with supported compute capabilites Example: Add the constraint `--constraint='tesla_v100|geforce_rtx_2080_ti|titan_rtx|geforce_rtx_3090'` to your job submission to run a job only on nodes with GPUs of compute capability `7.0` or higher. |
Line 283: | Line 372: |
A modern linux kernel is able to bind a process and all its children to a fixed number of cores. By default a job submitted to the SLURM arton grid is bound to to the numbers of requested cores/cpus. The default number of requested cpus is 1, if you have an application which is able to run multithreaded on several cores you must use the '''--cpus-per-task''' option in the sbatch command to get a binding to more than one core. To check for processes with core bindings, use the command `hwloc-ps -c`: {{{ gfreudig@trollo:~/Batch$ ssh arton02 hwloc-ps -c |
A modern linux kernel is able to bind a process and all its children to a fixed number of cores. By default a job submitted to the Slurm arton cluster is bound to to the numbers of requested cores/cpus. The default number of requested cpus is 1, if you have an application which is able to run multithreaded on several cores you must use the '''--cpus-per-task''' option in the sbatch command to get a binding to more than one core. To check for processes with core bindings, use the command `hwloc-ps -c`: {{{ $ ssh arton02 hwloc-ps -c |
Line 289: | Line 378: |
gfreudig@trollo:~/Batch$ | |
Line 293: | Line 381: |
Temporary data storage of a job used only while the job is running, should be placed in the `/scratch` directory of the compute nodes. Set the environment variables of the tools you use accordingly. The Matlab `MCR_ROOT_CACHE` variable is set automatically by the SLURM scheduler.<<BR>> The file system protection of the `/scratch` directory allows everybody to create files and directories in it. A cron job runs periodically on the execution hosts to prevent the `/scratch` directory from filling up and cleans it governed by pre-set policies. Therefore data you place in the `/scratch` directory of a compute node cannot be assumed to stay there forever.<<BR>><<BR>> Small sized input and output data for the jobs is best placed in your home directory. It is available on every compute node through the `/home` automounter.<<BR>> Larger amounts of data should be placed in your personal [[Services/NetScratch|netscratch]] folder and can be accessed on all compute nodes.<<BR>><<BR>> If you have problems with the quota limit in your home directory you could transfer data from your home or the `/scratch` directory of your submit host to the `/scratch` directories of the arton compute nodes and vice versa. For this purpose interactive logins with personal accounts are allowed on `arton01`. All `/scratch` directories of the compute nodes are available on `arton01` through the `/scratch_net` automount system. You can access the `/scratch` directory of `arton<nn>` under `/scratch_net/arton<nn>`. This allows you to transfer data between the `/scratch_net` directories and your home with normal linux file copy and to the `/scratch` of your submission host with scp.<<BR>><<BR>> Do not log in on `arton01` to run compute jobs interactively. Such jobs will be detected by our procguard system and killed.. Other data storage concepts for the arton grid are possible and will be investigated, if the above solution proves not to be sufficient.<<BR>> |
Temporary data storage of a job used only while the job is running, should be placed in the `/scratch` directory of the compute nodes. Set the environment variables of the tools you use accordingly. The Matlab `MCR_ROOT_CACHE` variable is set automatically by the Slurm scheduler.<<BR>> The file system protection of the `/scratch` directory allows everybody to create files and directories in it. A cron job runs periodically on the execution hosts to prevent the `/scratch` directory from filling up and cleans it governed by pre-set policies. Therefore data you place in the `/scratch` directory of a compute node cannot be assumed to stay there forever. * Small sized input and output data for the jobs is best placed in your home directory. It is available on every compute node through the `/home` automounter. * Larger amounts of data should be placed in your personal [[Services/NetScratch|netscratch]] folder and can be accessed on all compute nodes. * If you have problems with the quota limit in your home directory you could transfer data from your home or the `/scratch` directory of your submit host to the `/scratch` directories of the arton compute nodes and vice versa. All `/scratch` directories of the compute nodes are available through the `/scratch_net` automount system. You can access the `/scratch` directory of `arton<nn>` under `/scratch_net/arton<nn>`. This allows you to transfer data between the `/scratch_net` directories and your home with normal linux file copy and to the `/scratch` of your submission host with scp, for example from an [[#srun_.2BIZI_Start_an_interactive_shell|interactive session]] on any node. Other data storage concepts for the arton cluster are possible and will be investigated, if the above solution proves not to be sufficient. |
Line 304: | Line 391: |
The Matlab '''P'''arallel '''C'''omputing '''T'''oolbox (PCT) can be configured with an interface to the SLURM cluster. To work with MDCE please import [[attachment:Slurm.mlsettings]] in Matlab GUI (Parallel -> Create and manage Clusters -> Import ). Adjust the setting "JobStorageLocation" to your requirements. The cluster profile Slurm will now appear besides the standard local(default) profile in the profile list. With the local profile, you can use as many workers on one computer as there are physical cores while the Slurm profile allows to initiate up to 32 worker processes distributed over all slurm compute nodes.<<BR>> /!\ Please temporay reduce the number of workers to 4 in the Slurm profile when performing the profile "Validation" function in the Matlab Cluster Manager.<<BR>> /!\ Don't forget to set the Slurm environment variables before starting Matlab!<<BR>> |
The Matlab '''P'''arallel '''C'''omputing '''T'''oolbox (PCT) can be configured with an interface to the Slurm cluster. To work with MDCE please import [[attachment:Slurm.mlsettings]] in Matlab GUI (Parallel → Create and manage Clusters → Import ). Adjust the setting ''!JobStorageLocation'' to your requirements. The cluster profile Slurm will now appear besides the standard local(default) profile in the profile list. With the local profile, you can use as many workers on one computer as there are physical cores while the Slurm profile allows to initiate up to 32 worker processes distributed over all Slurm compute nodes.<<BR>> ⚠ Please temporay reduce the number of workers to 4 in the Slurm profile when performing the profile "Validation" function in the Matlab Cluster Manager.<<BR>> ⚠ Don't forget to set the Slurm environment variables before starting Matlab!<<BR>> |
Line 308: | Line 395: |
/!\ In interactive mode please always close your parpool if you aren't performing any calculations on the workers.<<BR>> | ⚠ In interactive mode please always close your parpool if you aren't performing any calculations on the workers.<<BR>> |
Line 310: | Line 397: |
=== Reservations === Nodes may be reserved at certain times for courses or maintenance. If your job is pending with the reason `ReqNodeNotAvail,_May_be_reserved_for_other_job`, check reservations and adjust the `--time` parameter of your job accordingly. ==== Showing current reservations ==== Current reservations can be shown by issuing {{{#!highlight bash numbers=disable scontrol show reservation }}} ==== Using a reservation ==== If you are entitled to use a reservation, specify the reservation in your job submission by appending the parameter `--reservation=<ReservationName>`. ==== Requesting a reservation ==== Reservations are managed by Slurm administrators. Please contact [[mailto:support@ee.ethz.ch|ISG D-ITET support]] if you're in need of a reservation. |
|
Line 322: | Line 425: |
your account hasn't been registered with slurm yet. Please contact [[mailto:support@ee.ethz.ch|support]] and ask to be registered. | your account hasn't been registered with Slurm yet. Please contact [[mailto:support@ee.ethz.ch|support]] and ask to be registered. ==== Code runs with srun but fails with sbatch ==== This points to a difference in the environment of the interactive job and the batch job. By default both `srun` and `sbatch` forward the complete current submission environment to a job (See `man --pager='less +/--export=' srun`, `man --pager='less +/--export=' sbatch`) and also change path to the current path of the submission environment (See `man --pager='less +/--chdir=' srun`, `man --pager='less +/--chdir=' sbatch`) Compare the output of `printenv` from jobs started with `srun` and `sbatch` to figure out the differences: * Environment variable starting with `SLURM_` are set by Slurm * `HOSTNAME` reflects the node a job runs on * `ENVIRONMENT=BATCH` is set by `sbatch`. Differences have to come from: * Anything done interactively in the submission session after submitting the job with `srun` and before submitting it with `sbatch` * Or the job submit script used with `sbatch`. |
Line 325: | Line 444: |
After executing one of the slurm executables like `sbatch` or `sinfo` the following error appears: | After executing one of the Slurm executables like `sbatch` or `sinfo` the following error appears: |
Line 329: | Line 448: |
The user `slurm` doesn't exist on the host you're running your slurm executable. If this happens on a host managed by ISG.EE, please contact [[mailto:support@ee.ethz.ch|support]], tell us the name of your host and ask us to configure it as a slurm submission host. | The user `slurm` doesn't exist on the host you're running your Slurm executable. If this happens on a host managed by ISG D-ITET, please contact [[mailto:support@ee.ethz.ch|support]], tell us the name of your host and ask us to configure it as a Slurm submission host. |
Line 340: | Line 459: |
Nodes are set to drain by ISG.EE to empty them of jobs in time for scheduled maintenance or by the scheduler itself in case a problem is detected on a node. | Nodes are set to drain by ISG D-ITET to empty them of jobs in time for scheduled maintenance or by the scheduler itself in case a problem is detected on a node. ==== <Slurm command>: fatal: Could not establish a configuration source ==== If you receive the following error messages after using a Slurm command like `srun`, `sbatch`, `squeue` or `sinfo` (replace `squeue` with the name of the command you used to end up with the error message): {{{ squeue: fatal: Could not establish a configuration source }}} Make sure you [[#Setting_environment|set your environment]]. ==== My job was terminated by the OOM killer ==== If your job got terminated and you see a line similar to the following in your job log: {{{ slurmstepd: error: Detected 1 oom-kill event(s) in StepId=<JOB_ID>.batch cgroup. ... }}} this means a process in your job attempted to use more memory than you requested for the job, so it was killed by the OOM (__O__ut __O__f __M__emory) killer. This in turn resulted in termination of your job by the Slurm scheduler.<<BR>> Check the value of `MaxRSS` in the output of [[#sacct_.2BIZI_Display_accounting_information_of_past_jobs|sacct]] for your job to verify the maximum memory usage of your job. Run tests by adjusting your memory allocation with [[#sbatch_.2BIZI_Common_options|--mem]] until you figure out how much memory your job needs.<<BR>> Slurm's accounting samples jobs every 30 seconds, so there is no useful data if a job was killed within the first 30 seconds after it started. Also sudden spikes in memory consumption may not be recorded, but can still trigger the OOM killer.<<BR>> ⚠ This is error pertains to the onboard memory allocated to a job, a GPU allocation always contains its full GPU memory. |
Contents
- Introduction
- Slurm
-
Slurm Cluster
- Hardware
- Software
-
Using Slurm
- Setting environment
- sbatch → Submitting a job
- sbatch → Submitting an array job
- sbatch → Common options
- squeue → Show running/waiting jobs
- squeue → Show job steps
- scancel → Deleting a job
- sinfo → Show partition configuration
- sinfo → Show resources and utilization
- sinfo → Show available features
- srun → Start an interactive shell
- srun → Attaching an interactive shell to a running job
- srun → Launch a command as a job step
- sstat → Display status information of a running job
- sacct → Display accounting information of past jobs
- sprio → Show priorities of pending jobs
- smon → GPU / CPU availability
- GPU jobs
- Multicore jobs/ job to core binding
- Job input/output data storage
- Matlab Distributed Computing Environment (MDCE)
- Reservations
- Frequently Asked Questions
Introduction
At D-ITET the Slurm job scheduling sytem can be used for running compute-intensive jobs. It consists of a master host, where the scheduler resides and the compute nodes, where batch jobs are executed. The compute nodes are powerful servers located in server rooms, they are exclusively reserved for batch processing. Interactive logins are disabled.
Access
Access to the Slurm cluster is reserved for staff of the contributing institutes APS, IBT, IFA, MINS, NARI, TIK and PBL. Access is granted on request.
Contact ISG D-ITET support if your institute is supported by us
Members of an institute supported by ID Services for Departments (S4D), use the email address listed for your institute instead
If your circumstances differ and you'd still like to use the cluster, please contact ISG D-ITET support as well and ask for an offer. Time-limited test accounts for up to 2 weeks are also available on request.
Additional information for institutes
Some institutes have additional setup and configuration, if you are a member of such an institute, make sure to read the information linked below after reading this article:
CVL uses it's own Slurm cluster, please read it's documentation for access and specific additional information to this article.
TIK owns nodes in the Slurm cluster, please read the additional information about those nodes and access.
PBL student supervisors can apply for access for their students.
Slurm
Slurm (Simple Linux Utility for Resource Management) is a free and open-source job scheduler for Linux and Unix-like kernels, used by many of the world's supercomputers and compute clusters. Slurm's design is very modular with about 100 optional plugins. In 2010, the developers of Slurm founded SchedMD, which maintains the canonical source, provides development, level 3 commercial support and training services and also provide a very good online documentation to Slurm.
Slurm Cluster
Hardware
At the moment the computing power of the Slurm cluster is based on the following 11 cpu compute nodes and 1 gpu compute node:
Server |
CPU |
Frequency |
Cores |
Memory |
/scratch SSD |
/scratch Size |
GPUs |
GPU Memory |
Operating System |
arton[01-03] |
Dual Octa-Core Intel Xeon E5-2690 |
2.90 GHz |
16 |
125 GB |
- |
895 GB |
- |
- |
Debian 10 |
arton[04-08] |
Dual Deca-Core Intel Xeon E5-2690 v2 |
3.00 GHz |
20 |
125 GB |
- |
895 GB |
- |
- |
Debian 10 |
arton[09-10] |
Dual Deca-Core Intel Xeon E5-2690 v2 |
3.00 GHz |
20 |
251 GB |
✓ |
1.7 TB |
- |
- |
Debian 10 |
arton11 |
Dual Deca-Core Intel Xeon E5-2690 v2 |
3.00 GHz |
20 |
535 GB |
✓ |
1.7 TB |
- |
- |
Debian 10 |
artongpu01 |
Dual Octa-Core Intel Xeon Silver 4208 |
2.10 GHz |
16 |
125 GB |
✓ |
1.1 TB |
4 RTX 2080 Ti |
11 GB |
Debian 10 |
Memory shows the amount available to Slurm
The nodes are "weighted", which gives the scheduler an additional selection criteria between nodes which fulfill criterias to run a job, like resources and membership in certain partitions. The idea is to prefer nodes with faster CPUs and of those, prefer those with lower RAM. For details, see /home/sladmitet/slurm/nodes.conf.
The Slurm job scheduler runs on the linux server itetmaster01.
Software
The nodes offer the same software environment as all D-ITET managed Linux clients, gpu nodes have a restricted software (no desktops installed, minimal dependencies needed for driver support).
Using Slurm
At a basic level, Slurm is very easy to use. The following sections will describe the commands you need to run and manage your batch jobs. The commands that will be most useful to you are as follows:
sbatch - submit a job to the batch scheduler
squeue - examine running and waiting jobs
sinfo - status compute nodes
scancel - delete a running job
Setting environment
The above commands only work if the environment variables for Slurm are set. Please issue the following command in your bash shell to start working with the cluster immediately or add them to your ~/.bashrc to reference the Slurm cluster for new instances of bash:
export SLURM_CONF=/home/sladmitet/slurm/slurm.conf
sbatch → Submitting a job
sbatch doesn't allow to submit a binary program directly, wrap the program to run into a surrounding bash script. The sbatch command has the following syntax:
> sbatch [temporary_options] job_script [job_script arguments]
The job_script is a standard UNIX shell script. The fixed options for the Slurm Scheduler are placed in the job_script in lines starting with #SBATCH. The UNIX shell interprets these lines as comments and ignores them.
Put options into the job_script for easier reference. Place only temporary options outside the job_script as options to the sbatch command.
Make sure to create the directories you intend to store logfiles in before submitting the job_script
- Use absolute paths in your scripts to ensure your log files and commands are found
Make sure the paths you use in your scripts are available on cluster nodes
To test your job-script simply run it interactively on your host.
Assume there is a c program primes.c which is compiled to an executable binary with "gcc -o primes primes.c" and stored as /absolute/path/to/primes. The program runs 5 seconds and calculates prime numbers. The found prime numbers and a final summary are sent to standard output. A sample job_script placed in the same location /absolute/path/to/primes.sh to perform a batch run of the binary primes on the Arton cluster looks like this:
#!/bin/bash
#SBATCH --mail-type=ALL # mail configuration: NONE, BEGIN, END, FAIL, REQUEUE, ALL
#SBATCH --output=/absolute/path/to/log/%j.out # where to store the output (%j is the JOBID), subdirectory "log" must exist
#SBATCH --error=/absolute/path/to/log/log/%j.err # where to store error messages
# Exit on errors
set -o errexit
# Set a directory for temporary files unique to the job with automatic removal at job termination
TMPDIR=$(mktemp -d)
if [[ ! -d ${TMPDIR} ]]; then
echo 'Failed to create temp directory' >&2
exit 1
fi
trap "exit 1" HUP INT TERM
trap 'rm -rf "${TMPDIR}"' EXIT
export TMPDIR
# Change the current directory to the location where you want to store temporary files, exit if changing didn't succeed.
# Adapt this to your personal preference
cd "${TMPDIR}" || exit 1
# Send some noteworthy information to the output log
echo "Running on node: $(hostname)"
echo "In directory: $(pwd)"
echo "Starting on: $(date)"
echo "SLURM_JOB_ID: ${SLURM_JOB_ID}"
# Binary or script to execute
/absolute/path/to/primes
# Send more noteworthy information to the output log
echo "Finished at: $(date)"
# End the script with exit code 0
exit 0
You can test the script by running it interactively in a terminal:
$ /absolute/path/to/primes.sh
If the script runs successfully you can now submit it as a batch job to the Slurm arton cluster:
$ sbatch /absolute/path/to/primes.sh
sbatch: Start executing function slurm_job_submit......
sbatch: Job partition set to : cpu.normal
Submitted batch job 931
After the job has finished, you will find the output file of the job in the file /absolute/path/to/log/<JOBID>.out. If there were errors they are stored in the file /absolute/path/to/log/<JOBID>.err.
⚠ Remember: The directory for the job output has to exist before submitting the job, it is not created automatically!
You can only submit jobs to Slurm if your account is configured in the Slurm user database. If it isn't, you'll receive this error message
sbatch → Submitting an array job
It is also possible to start an array job. The above job would run 10 times if you added the option #SBATCH --array=0-9 to the job-script. A repeated execution only makes sense if the executed program adapts its behaviour according to the changing array task count number. The array count number can be referenced through the variable $SLURM_ARRAY_TASK_ID. You can pass the value of $SLURM_ARRAY_TASK_ID or some derived parameters to the executable.
Here is a simple example of passing an input filename parameter changing with $SLURM_ARRAY_TASK_ID to the executable:
.
#SBATCH --array=0-9
#
# binary to execute
<path-to-executable> data$SLURM_ARRAY_TASK_ID.dat
Every run of the program in the array job with a different task-id will produce a separate output file.
The option expects a range of task-ids expressed in the form --array=n[,k[,...]][-m[:s]]%l
where n, k, m are discreet task-ids, s is a step applied to a range n-m and l applies a limit to the number of simultaneously running tasks. See man sbatch for examples.
Specifying one task-id instead of a range as in --array=10 results in an array job with a single task with task-id 10.
The following variables will be available in the job context and reflect the option arguments given: $SLURM_ARRAY_TASK_MAX, $SLURM_ARRAY_TASK_MIN, $SLURM_ARRAY_TASK_STEP.
sbatch → Common options
The following table shows the most common options available for sbatch to be used in the job_script in lines starting with #SBATCH
option |
description |
--mail-type=... |
Possible Values: NONE, BEGIN, END, FAIL, REQUEUE, ALL |
--mem=<n>G |
the job needs a maximum of <n> GByte ( if omitted the default of 6G is used ) |
--cpus-per-task=<n> |
number of cores to be used for the job |
--gres=gpu:1 |
number of GPUs needed for the job |
--nodes=<n> |
number of compute nodes to be used for the job |
--hint=<type> |
Bind tasks to CPU cores according to application hints (See man --pager='less +/--hint' srun and multi-core support |
--constraint=<feature_name> |
Request one or more features, optionally combined by operators (See man --pager='less +/--constraint' sbatch) |
⚠ The --nodes option should only be used for MPI jobs !
The operators to combine --constraint lists are:
AND (&): #SBATCH --constraint='geforce_rtx_2080_ti&titan_rtx
OR (|): #SBATCH --constraint='titan_rtx|titan_xp'
squeue → Show running/waiting jobs
The squeue command shows the actual list of running and pending jobs in the system. As you can see in the following sample output the default format is quite minimalistic:
$ squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
951 cpu.norma primes.s gfreudig R 0:11 1 arton02
950 cpu.norma primes_4 gfreudig R 0:36 1 arton02
949 cpu.norma primes.s fgtest01 R 1:22 1 arton02
948 gpu.norma primes.s fgtest01 R 1:39 1 artongpu01
More detailed information can be obtained by issuing the following command:
$ squeue --Format=jobarrayid:10,state:10,partition:16,reasonlist:18,username:10,tres-alloc:45,timeused:8,command:50
JOBID STATE PARTITION NODELIST(REASON) USER TRES_ALLOC TIME COMMAND
951 RUNNING cpu.normal arton02 gfreudig cpu=1,mem=32G,node=1,billing=1 1:20 /home/gfreudig/BTCH/Slurm/jobs/single/primes.sh 600
950 RUNNING cpu.normal arton02 gfreudig cpu=4,mem=8G,node=1,billing=4 1:45 /home/gfreudig/BTCH/Slurm/jobs/multi/primes_4.sh 600
949 RUNNING cpu.normal arton02 fgtest01 cpu=1,mem=8G,node=1,billing=1 2:31 /home/fgtest01/BTCH/Slurm/jobs/single/primes.sh 600
948 RUNNING gpu.normal artongpu01 fgtest01 cpu=1,mem=8G,node=1,billing=1,gres/gpu=1 2:48 /home/fgtest01/BTCH/Slurm/jobs/single/primes.sh 600
STATE is explained in the squeue man page in section JOB STATE CODES, see man --pager='less +/^JOB\ STATE\ CODES' squeue for details
REASON is explained there as well in section JOB REASON CODE, see man --pager='less +/^JOB\ REASON\ CODES' squeue
Defining an alias in your .bashrc with
alias sq1='squeue --Format=jobarrayid:10,state:10,partition:16,reasonlist:18,username:10,tres-alloc:45,timeused:8,command:50'
puts the command sq1 at your fingertips.
⚠ Never call squeue from any kind of loop, i.e. never do watch squeue. See man --pager='less +/^PERFORMANCE' squeue for an explanation.
To monitor your jobs, set the sbatch option --mail-type to send you notifications. If you absolutely have to see a live display of your jobs, use the --iterate option with a value of several seconds:
squeue --user=$USER --iterate=30
squeue → Show job steps
Individual job steps are listed with a specific option:
squeue -s
scancel → Deleting a job
With scancel you can remove your waiting and running jobs from the scheduler queue by their associated JOBID. The command squeue lists your jobs including their JOBIDs. A job can then be deleted with
> scancel <JOBID>
To operate on an array job you can use the following commands
> scancel <JOBID> # all jobs (waiting or running) of the array job are deleted
> scancel <JOBID>_n # the job with task-ID n is deleted
> scancel <JOBID>_[n1-n2] # the jobs with task-ID in the range n1-n2 are deleted
sinfo → Show partition configuration
The partition status can be obtained by using the sinfo command. An example listing is shown below.
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST cpu.normal* up 7-00:00:00 10 idle arton[01-11] gpu.normal up 2-00:00:00 1 idle artongpu01 tikgpu.all up 2-00:00:00 7 idle tikgpu[01-07] tikgpu.medium up 2-00:00:00 3 idle tikgpu[01-03]
The partition is chosen by the scheduler according to your resource request and memberships in Slurm accounts. The logic can be seen in /home/sladmitet/slurm/jobsumit.lua.
sinfo → Show resources and utilization
Adding selected format parameters to the sinfo command shows the resources available on every node and their utilization:
sinfo -Node --Format nodelist:12,statecompact:7,memory:7,allocmem:10,freemem:10,cpusstate:15,cpusload:10,gresused:100 |(sed -u 1q; sort -u)
Restricting the command to a selected partition allows to show only GPU nodes:
sinfo -Node --partition=gpu.normal --Format nodelist:12,statecompact:7,memory:7,allocmem:10,freemem:10,cpusstate:15,cpusload:10,gresused:100
sinfo → Show available features
So-called features are used to constrain jobs to nodes with different hardware capabilities, typically GPU types. To show currently active features issue the following command sequence:
sinfo --Format nodehost:20,features_act:80 |grep -v '(null)' |awk 'NR == 1; NR > 1 {print $0 | "sort -n"}'
An example of feature use can be seen in section Specifying GPUs based on compute capability.
srun → Start an interactive shell
An interactive session on a compute node is possible for short tests, checking your environment or transferring data to the local scratch of a node available under /scratch_net/arton[0-11]. Such sessions are limited to a maximum run time of 720 minutes (12 hours) regardless of the partition they are sent to.
An interactive session lasting for 10 minutes on a GPU node can be started with:
srun --time 10 --gres=gpu:1 --pty bash -i
The ouptut will look similar to the following:
srun: Start executing function slurm_job_submit...... srun: Your job is a gpu job. srun: Setting partition to gpu.normal srun: job 11526 queued and waiting for resources
Omitting the parameter --gres=gpu:1 opens an interactive session on a CPU-only node.
Do not use an interactive login to run compute jobs, use this only briefly as outlined above. Restrict job time to the necessary minimum with the --time option as shown above. For details see the related section in the srun man page by issuing the command man --pager='less +/--time' srun in your shell.
srun → Attaching an interactive shell to a running job
An interactive shell can be opened inside a running job by specifying its job id:
srun --time 10 --jobid=123456 --overlap --pty bash -i
A typical use case of the above is interactive live-monitoring of a running job.
srun → Launch a command as a job step
When srun is used inside a sbatch script it spawns the given command inside a job step. This allows resource monitoring with the sstat command (see man sstat. Spawning several single-threaded commands and putting them in the background allows to schedule these commands inside the job allocation.
Here's an example how to run overall GPU logging and per-process logging in job steps before starting the actual computing commands.
...
set -o errexit
srun --ntasks=1 --cpus-per-task=1 nvidia-smi dmon -i ${CUDA_VISIBLE_DEVICES} -d 5 -s ucm -o DT > "${SLURM_JOB_ID}.gpulog" &
srun --ntasks=1 --cpus-per-task=1 nvidia-smi pmon -i ${CUDA_VISIBLE_DEVICES} -d 5 -s um -o DT > "${SLURM_JOB_ID}.processlog" &
...
echo finished at: `date`
exit 0;
sstat → Display status information of a running job
The status information shows your job's resource usage while it is running:
sstat --jobs=<JOBID> --allsteps --format=JobID,AveVMSize%15,MaxRSS%15,AveCPU%15
AveVMSize: Average virtual memory of all tasks in the job
MaxRSS: Peak memory usage of all tasks in the job
AveCPU: Average CPU time of all tasks in the job
All your currently running job's resource usages can be shown with:
sstat --jobs=$(squeue --noheader --me --format=%A |paste -s -d ',') --allsteps --format=JobID,AveVMSize%15,MaxRSS%15,AveCPU%15
sacct → Display accounting information of past jobs
Accounting information for past jobs can be displayed with various details (see man page).
The following example lists all jobs of the logged in user since the beginning of the year 2020:
sacct --user ${USER} --starttime=2020-01-01 --format=JobID,Start%20,Partition%20,ReqTRES%50,AveVMSize%15,MaxRSS%15,AveCPU%15,Elapsed%15,State%20
sprio → Show priorities of pending jobs
Pending jobs are prioritized by the scheduler by accounting for fair sharing of resources and age of a pending job. Priorities of pending jobs and the factors comprising them can be shown with
sprio --long
On job submission, a nice value can be added to influence priorities of your own jobs:
sbatch --nice=10 job_script.sh
The nice value of an already pending job can be incremented with positive values:
scontrol update job <jobid> nice=5
Only incrementation is possible. The value can be reset to nice=0.
The offical Slurm manual contains a detailled explanation of job priorisation.
smon → GPU / CPU availability
Information about the GPU nodes and current availability of the installed GPUs as well as CPU availability of CPU-only nodes is updated every 5 minutes to the file /home/sladmitet/smon.txt. Here are some convenient aliases to display the file with highlighting of either free GPUs or those running the current user's jobs:
alias smon_free="grep --color=always --extended-regexp 'free|$' /home/sladmitet/smon.txt"
alias smon_mine="grep --color=always --extended-regexp '${USER}|$' /home/sladmitet/smon.txt"
For monitoring its content the following aliases can be used:
alias watch_smon_free="watch --interval 300 --no-title --differences --color \"grep --color=always --extended-regexp 'free|$' /home/sladmitet/smon.txt\""
alias watch_smon_mine="watch --interval 300 --no-title --differences --color \"grep --color=always --extended-regexp '${USER}|$' /home/sladmitet/smon.txt\""
⚠ Never use watch directly on smon, as this places considerable load on the Slurm controller!
GPU jobs
Selecting the allocated GPUs
To select the GPU allocated by the scheduler, Slurm sets the environment variable CUDA_VISIBLE_DEVICES in the context of a job to the GPUs allocated to the job. The numbering always starts at 0 and is consecutively numbered up to the requested amount of GPUs - 1.
It is imperative to work with this variable exactly as it is set by Slurm, anything else leads to unexpected errors.
For details see the section GPU Management in the official Slurm documentation.
Specifying a GPU type
It's possible to specify a GPU type by inserting the type description in the gres allocation:
--gres=gpu:titan_rtx:1
Available GPU type descriptions can be filtered from an appropriate sinfo command:
sinfo --noheader --Format gres:200 |tr ':' '\n' |sort -u |grep -vE '^(gpu|[0-9,\(]+)'
Multiple GPU types can be requested by using the --constraint option.
Specifying GPUs based on compute capability
CUDA code compiled with nvcc can be optimized for ranges of so-called Compute Capabilities defining generation.version of a NVIDIA GPU.
More information about compute capabilites can be read here:
The following table shows an abbreviated list of the compute capabilities of available GPU types selectable by features:
Compute capability |
Features |
3.5 |
tesla_k40c |
5.2 |
geforce_gtx_titan_x |
6.1 |
geforce_gtx_1080_ti, titan_x, titan_xp |
7.0 |
tesla_v100 |
7.5 |
geforce_rtx_2080_ti, titan_rtx |
8.0 |
a100 |
8.6 |
geforce_rtx_3090, a6000 |
For more information see the full list of NVIDIA CUDA GPUs.
As GPU nodes may house different generations of GPUs, compiled CUDA code might not run on all of them and errors similar to the following can appear:
RuntimeError: CUDA error: no kernel image is available for execution on the device
If you see a similar error:
- Note the type of GPU on which your job failed
- In the table above, note the features with a higher compute capability than the GPU type you noted before
Check the list of available features to exclude non-existing GPU types
Build a constraint to include only GPUs with supported compute capabilites
Example: Add the constraint --constraint='tesla_v100|geforce_rtx_2080_ti|titan_rtx|geforce_rtx_3090' to your job submission to run a job only on nodes with GPUs of compute capability 7.0 or higher.
Multicore jobs/ job to core binding
A modern linux kernel is able to bind a process and all its children to a fixed number of cores. By default a job submitted to the Slurm arton cluster is bound to to the numbers of requested cores/cpus. The default number of requested cpus is 1, if you have an application which is able to run multithreaded on several cores you must use the --cpus-per-task option in the sbatch command to get a binding to more than one core. To check for processes with core bindings, use the command hwloc-ps -c:
$ ssh arton02 hwloc-ps -c 43369 0x00010001 slurmstepd: [984.batch] 43374 0x00010001 /bin/sh 43385 0x00010001 codebin/primes
Job input/output data storage
Temporary data storage of a job used only while the job is running, should be placed in the /scratch directory of the compute nodes. Set the environment variables of the tools you use accordingly. The Matlab MCR_ROOT_CACHE variable is set automatically by the Slurm scheduler.
The file system protection of the /scratch directory allows everybody to create files and directories in it. A cron job runs periodically on the execution hosts to prevent the /scratch directory from filling up and cleans it governed by pre-set policies. Therefore data you place in the /scratch directory of a compute node cannot be assumed to stay there forever.
Small sized input and output data for the jobs is best placed in your home directory. It is available on every compute node through the /home automounter.
Larger amounts of data should be placed in your personal netscratch folder and can be accessed on all compute nodes.
If you have problems with the quota limit in your home directory you could transfer data from your home or the /scratch directory of your submit host to the /scratch directories of the arton compute nodes and vice versa. All /scratch directories of the compute nodes are available through the /scratch_net automount system. You can access the /scratch directory of arton<nn> under /scratch_net/arton<nn>. This allows you to transfer data between the /scratch_net directories and your home with normal linux file copy and to the /scratch of your submission host with scp, for example from an interactive session on any node.
Other data storage concepts for the arton cluster are possible and will be investigated, if the above solution proves not to be sufficient.
Matlab Distributed Computing Environment (MDCE)
The Matlab Parallel Computing Toolbox (PCT) can be configured with an interface to the Slurm cluster. To work with MDCE please import Slurm.mlsettings in Matlab GUI (Parallel → Create and manage Clusters → Import ). Adjust the setting JobStorageLocation to your requirements. The cluster profile Slurm will now appear besides the standard local(default) profile in the profile list. With the local profile, you can use as many workers on one computer as there are physical cores while the Slurm profile allows to initiate up to 32 worker processes distributed over all Slurm compute nodes.
⚠ Please temporay reduce the number of workers to 4 in the Slurm profile when performing the profile "Validation" function in the Matlab Cluster Manager.
⚠ Don't forget to set the Slurm environment variables before starting Matlab!
The Slurm cluster profile can be used with Matlab programs running as Slurm batch jobs but it's also possible to use the profile in an interactive Matlab session on your client. When you open a Slurm parpool, the workers are started automatically as jobs in the cluster.
⚠ In interactive mode please always close your parpool if you aren't performing any calculations on the workers.
Sample code for the 3 Matlab PCT methods parfor, spmd, tasks using the local or Slurm cluster profile is provided in PCTRefJobs.tar.gz.
Reservations
Nodes may be reserved at certain times for courses or maintenance. If your job is pending with the reason ReqNodeNotAvail,_May_be_reserved_for_other_job, check reservations and adjust the --time parameter of your job accordingly.
Showing current reservations
Current reservations can be shown by issuing
scontrol show reservation
Using a reservation
If you are entitled to use a reservation, specify the reservation in your job submission by appending the parameter --reservation=<ReservationName>.
Requesting a reservation
Reservations are managed by Slurm administrators. Please contact ISG D-ITET support if you're in need of a reservation.
Frequently Asked Questions
If your question isn't listed below, an answer might be listed in the official Slurm FAQ.
Batch job submission failed: Invalid account
If you receive one of the following error messages after submitting a job with sbatch or using srun
sbatch: error: Batch job submission failed: Invalid account or account/partition combination specified
srun: error: Unable to allocate resources: Invalid account or account/partition combination specified
your account hasn't been registered with Slurm yet. Please contact support and ask to be registered.
Code runs with srun but fails with sbatch
This points to a difference in the environment of the interactive job and the batch job.
By default both srun and sbatch forward the complete current submission environment to a job (See man --pager='less +/--export=' srun, man --pager='less +/--export=' sbatch) and also change path to the current path of the submission environment (See man --pager='less +/--chdir=' srun, man --pager='less +/--chdir=' sbatch)
Compare the output of printenv from jobs started with srun and sbatch to figure out the differences:
Environment variable starting with SLURM_ are set by Slurm
HOSTNAME reflects the node a job runs on
ENVIRONMENT=BATCH is set by sbatch.
Differences have to come from:
Anything done interactively in the submission session after submitting the job with srun and before submitting it with sbatch
Or the job submit script used with sbatch.
Invalid user for SlurmUser slurm
After executing one of the Slurm executables like sbatch or sinfo the following error appears:
error: Invalid user for SlurmUser slurm, ignored
The user slurm doesn't exist on the host you're running your Slurm executable. If this happens on a host managed by ISG D-ITET, please contact support, tell us the name of your host and ask us to configure it as a Slurm submission host.
Node(s) in drain state
If sinfo shows one or more nodes in drain state, the reason can be shown with
sinfo -R
or in case the reason is cut off with
sinfo -o '%60E %9u %19H %N'
Nodes are set to drain by ISG D-ITET to empty them of jobs in time for scheduled maintenance or by the scheduler itself in case a problem is detected on a node.
<Slurm command>: fatal: Could not establish a configuration source
If you receive the following error messages after using a Slurm command like srun, sbatch, squeue or sinfo (replace squeue with the name of the command you used to end up with the error message):
squeue: fatal: Could not establish a configuration source
Make sure you set your environment.
My job was terminated by the OOM killer
If your job got terminated and you see a line similar to the following in your job log:
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=<JOB_ID>.batch cgroup. ...
this means a process in your job attempted to use more memory than you requested for the job, so it was killed by the OOM (Out Of Memory) killer. This in turn resulted in termination of your job by the Slurm scheduler.
Check the value of MaxRSS in the output of sacct for your job to verify the maximum memory usage of your job. Run tests by adjusting your memory allocation with --mem until you figure out how much memory your job needs.
Slurm's accounting samples jobs every 30 seconds, so there is no useful data if a job was killed within the first 30 seconds after it started. Also sudden spikes in memory consumption may not be recorded, but can still trigger the OOM killer.
⚠ This is error pertains to the onboard memory allocated to a job, a GPU allocation always contains its full GPU memory.