TIK Slurm information

The Computer Engineering and Networks Laboratory (TIK) owns nodes in the Slurm cluster with restricted access. The following information is an addendum to the main Slurm article in this wiki specific for accessing these TIK nodes.
If the information you're looking for isn't available here, please consult the main Slurm article.

Hardware

The following GPU nodes are reserved for exclusive use by TIK:

Server	CPU	Frequency	Cores	Memory	/scratch SSD	/scratch size	GPUs	GPU memory	GPU architecture	Operating system
tikgpu02	Dual Tetrakaideca-Core Xeon E5-2680 v4	2.40GHz	28	503 GB	✓	1.1 TB	8 Titan Xp	12 GB	Pascal	Debian 11
tikgpu03	Dual Tetrakaideca-Core Xeon E5-2680 v4	2.40GHz	28	503 GB	✓	1.1 TB	8 Titan Xp	12 GB	Pascal	Debian 11
tikgpu04	Dual Hectakaideca-Core Xeon Gold 6242 v4	2.80GHz	32	754 GB	✓	1.8 TB	8 Titan RTX	24 GB	Turing	Debian 11
tikgpu05	AMD EPYC 7742	3.4 GHz	128	503 GB	✓	7.0 TB	5 Titan RTX 2 Tesla V100	24 GB 32 GB	Turing Volta	Debian 11
tikgpu06	AMD EPYC 7742	3.4 GHz	128	503 GB	✓	8.7 TB	8 RTX 3090	24 GB	Ampere	Debian 11
tikgpu07	AMD EPYC 7742	3.4 GHz	128	503 GB	✓	8.7 TB	8 RTX 3090	24 GB	Ampere	Debian 11
tikgpu08	AMD EPYC 7742	3.4 GHz	128	503 GB	✓	8.7 TB	8 RTX A6000	48 GB	Ampere	Debian 11
tikgpu09	AMD EPYC 7742	3.4 GHz	128	503 GB	✓	8.7 TB	8 RTX 3090	24 GB	Ampere	Debian 11
tikgpu10	AMD EPYC 7742	3.4 GHz	128	2015 GB	✓	8.7 TB	8 A100	80 GB	Ampere	Debian 11

Shared /scratch_net

Access to local /scratch of each node is available as an automount (on demand) under /scratch_net/tikgpuNM (Replace NM with an existing hostname number) on each node.

On demand means: The path to a node's /scratch will appear at first access, like after issuing ls /scratch_net/tikgpuNM and disappear again when unused.
scratch_clean is active on local /scratch of all nodes, meaning older data will be deleted if space is needed. For details see the man page man scratch_clean.

Accounts and partitions

The nodes are grouped in partitions to prioritize access for different accounts:

Partition	Nodes	Slurm accounts with access	Account membership
tikgpu.medium	tikgpu[02-07,09]	tik-external	On request* for guests and students
tikgpu.all	tikgpu[02-10]	tik-internal	Automatic for staff members
tikgpu.all	tikgpu[02-10]	tik-highmem	On request* for guests and students

* Please contact the person vouching for your guest access - or your supervisor if you're a student - and ask them to have you granted account membership

Overflow into gpu.normal

Jobs from TIK users will overflow to partition gpu.normal in case all TIK nodes are busy, as TIK is an institute contributing to the Slurm cluster besides owning nodes.

Dual account membership

Check which accounts you're a member of with the following command:

sacctmgr show users WithAssoc Format=User%-15,DefaultAccount%-15,Account%-15 ${USER}

If you're a member of account tik-external and have also been added to tik-highmem, your default account is the latter and all your jobs will by default be sent to partition tikgpu.all. So when you want to run jobs in partition tikgpu.medium you have to specify the account tik-external as in the following example:

sbatch --account=tik-external job_script.sh

If you already have a PENDING job in the wrong partition you can correct it by issuing the following command:

scontrol update jobid=<job id> partition=tikgpu.medium account=tik-external

Rules of conduct

There are no limits imposed on resources requested by jobs. Please be polite and share available resources sensibly. If you're in need of above-average resources, please coordinate with other TIK Slurm users.

Improving the configuration

If you think the current configuration of TIK nodes, partitions etc. could be improved:

Discuss your ideas with your team colleagues
Ask your ID institute support who the current TIK cluster coordinators are
Bring your suggestions for improvement to the coordinators

The coordinators will streamline your ideas into a concrete change request which we (ISG D-ITET) will implement for you.

-  ⇤ ← Revision 2 as of 2020-12-21 12:09:44 → 
  Size: 2414
  Editor: stroth
  Comment:
+   ← Revision 30 as of 2025-02-27 10:34:54 → ⇥
  Size: 5789
  Editor: stroth
  Comment: Change wording of "on demand" explanation to reflect CVL version
-Deletions are marked like this.
+Additions are marked like this.
 Line 1:
+<<TableOfContents(4)>>
-Line 7:
+Line 9:
-The following gpu nodes are reserved for exclusive use by TIK:
||'''Server'''||'''CPU'''||'''Frequency'''||'''Cores'''||'''Memory'''||'''/scratch SSD'''||'''/scratch Size'''||'''GPUs'''||'''GPU Memory'''||'''Operating System'''||
||tikgpu01||Dual Tetrakaideca-Core Xeon E5-2680 v4||2.40GHz||28||503 GB||&#10003;||1.1 TB||5 Titan Xp<<BR>>2 GTX Titan X||12 GB<<BR>>12 GB||Debian 10||
||tikgpu02||Dual Tetrakaideca-Core Xeon E5-2680 v4||2.40GHz||28||503 GB||&#10003;||1.1 TB||8 Titan Xp||12 GB||Debian 10||
||tikgpu03||Dual Tetrakaideca-Core Xeon E5-2680 v4||2.40GHz||28||503 GB||&#10003;||1.1 TB||8 Titan Xp||12 GB||Debian 10||
||tikgpu04||Dual Hectakaideca-Core Xeon Gold 6242 v4||2.80GHz||32||754 GB||&#10003;||1.8 TB||8 Titan RTX||24 GB||Debian 10||
||tikgpu05||AMD EPYC 7742||1.50 GHz||256||503 GB||&#10003;||7.0 TB||5 Titan RTX<<BR>>2 Tesla V100||24 GB<<BR>>32 GB||Debian 10||
+The following GPU nodes are reserved for exclusive use by TIK:
||'''Server'''||'''CPU'''                               ||'''Frequency'''||'''Cores'''||'''Memory'''||'''/scratch SSD'''||'''/scratch size'''||'''GPUs'''                   ||'''GPU memory'''||'''GPU architecture'''||'''Operating system'''||
||tikgpu02    ||Dual Tetrakaideca-Core Xeon E5-2680 v4  ||2.40GHz        ||28         ||503 GB      ||✓                ||1.1 TB             ||8 Titan Xp                   ||12 GB           ||Pascal                ||Debian 11||
||tikgpu03    ||Dual Tetrakaideca-Core Xeon E5-2680 v4  ||2.40GHz        ||28         ||503 GB      ||✓                ||1.1 TB             ||8 Titan Xp                   ||12 GB           ||Pascal                ||Debian 11||
||tikgpu04    ||Dual Hectakaideca-Core Xeon Gold 6242 v4||2.80GHz        ||32         ||754 GB      ||✓                ||1.8 TB             ||8 Titan RTX                  ||24 GB           ||Turing                ||Debian 11||
||tikgpu05    ||AMD EPYC 7742                           ||3.4 GHz        ||128        ||503 GB      ||✓                ||7.0 TB             ||5 Titan RTX<<BR>>2 Tesla V100||24 GB<<BR>>32 GB||Turing<<BR>>Volta     ||Debian 11||
||tikgpu06    ||AMD EPYC 7742                           ||3.4 GHz        ||128        ||503 GB      ||✓                ||8.7 TB             ||8 RTX 3090                   ||24 GB           ||Ampere                ||Debian 11||
||tikgpu07    ||AMD EPYC 7742                           ||3.4 GHz        ||128        ||503 GB      ||✓                ||8.7 TB             ||8 RTX 3090                   ||24 GB           ||Ampere                ||Debian 11||
||tikgpu08    ||AMD EPYC 7742                           ||3.4 GHz        ||128        ||503 GB      ||✓                ||8.7 TB             ||8 RTX A6000                  ||48 GB           ||Ampere                ||Debian 11||
||tikgpu09    ||AMD EPYC 7742                           ||3.4 GHz        ||128        ||503 GB      ||✓                ||8.7 TB             ||8 RTX 3090                   ||24 GB           ||Ampere                ||Debian 11||
||tikgpu10    ||AMD EPYC 7742                           ||3.4 GHz        ||128        ||2015 GB     ||✓                ||8.7 TB             ||8 A100                       ||80 GB           ||Ampere                ||Debian 11||
-Line 15:
+Line 21:
-== Partitions ==
+== Shared /scratch_net ==
Access to local `/scratch` of each node is available as an automount (on demand) under `/scratch_net/tikgpuNM` (Replace `NM` with an existing hostname number) on each node.<<BR>>
 * ''On demand'' means: The path to a node's `/scratch` will appear at first access, like after issuing `ls /scratch_net/tikgpuNM` and disappear again when unused.
 * `scratch_clean` is active on local `/scratch` of all nodes, meaning older data will be deleted if space is needed. For details see the man page `man scratch_clean`.

== Accounts and partitions ==
-Line 18:
+Line 29:
-||'''Partition'''||'''Nodes'''||'''Slurm accounts with access'''||'''Member type'''||
||tikgpu.all||tikgpu[01-05]||tik-internal||Staff members||
||tikgpu.medium||tikgpu[01-03]||tik-external||Guests and students||
||tikgpu.high||tikgpu[04-05]||tik-highmem||On request access to highmem nodes||
+||'''Partition'''||'''Nodes'''||'''Slurm accounts with access'''||'''Account membership'''||
||tikgpu.medium||tikgpu[02-07,09]||tik-external||On request* for guests and students||
||tikgpu.all||tikgpu[02-10]||tik-internal||Automatic for staff members||
||tikgpu.all||tikgpu[02-10]||tik-highmem||On request* for guests and students||
-Line 23:
+Line 34:
-== Access ==
 * '''Staff members''' and '''guests''' typically obtain Slurm access when their Linux account is created
 * '''Students''' need to ask their supervisor to request Slurm access from the TIK technical contact
 * If you're in need of access to partition `tikgpu.high` please contact the person vouching for your guest access or your supervisor if you're a student
+* Please contact the person vouching for your guest access - or your supervisor if you're a student - and ask them to have you granted account membership

=== Overflow into gpu.normal ===
Jobs from TIK users will overflow to partition ''gpu.normal'' in case all TIK nodes are busy, as TIK is an institute contributing to the Slurm cluster besides owning nodes.
 
=== Dual account membership ===
Check which accounts you're a member of with the following command:
{{{#!highlight bash numbers=disable
sacctmgr show users WithAssoc Format=User%-15,DefaultAccount%-15,Account%-15 ${USER}
}}}
If you're a member of account ''tik-external'' and have also been added to ''tik-highmem'', your default account is the latter and all your jobs will by default be sent to partition ''tikgpu.all''. So when you want to run jobs in partition ''tikgpu.medium'' you have to specify the account ''tik-external'' as in the following example:
{{{#!highlight bash numbers=disable
sbatch --account=tik-external job_script.sh
}}}
If you already have a PENDING job in the wrong partition you can correct it by issuing the following command:
{{{#!highlight bash numbers=disable
scontrol update jobid=<job id> partition=tikgpu.medium account=tik-external
}}}
-Line 29:
+Line 54:
-There are no limits imposed on resources requested by jobs. Please be polite and share available resources sensibly. If you're in need of above-average resources, please [[https://matrix.ee.ethz.ch/_matrix/client/#/room/#gpu_scheduling:matrix.ee.ethz.ch|coordinate with other TIK Slurm users]].
+There are no limits imposed on resources requested by jobs. Please be polite and share available resources sensibly. If you're in need of above-average resources, please coordinate with other TIK Slurm users.

== Improving the configuration ==
If you think the current configuration of TIK nodes, partitions etc. could be improved:
 * Discuss your ideas with your team colleagues
 * Ask your [[mailto:servicedesk-tik@id.ethz.ch|ID institute support]] who the current TIK cluster coordinators are
 * Bring your suggestions for improvement to the coordinators
The coordinators will streamline your ideas into a concrete change request which we (ISG D-ITET) will implement for you.

Wiki

Page