Differences between revisions 20 and 40 (spanning 20 versions)
Revision 20 as of 2024-01-30 15:14:43
Size: 7404
Editor: stroth
Comment:
Revision 40 as of 2024-09-09 10:47:01
Size: 6936
Editor: stroth
Comment: Update course list
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= Slurm Student Course Cluster "Snowflake" =

This is the landing page for the Slurm cluster '''Snowflake''', which will be available for official student courses at the beginning of the spring semester 2024.


Complete content will be ready in time.

<<BR>><<BR>><<BR>>
----------
<<BR>><<BR>><<BR>>
Line 15: Line 4:
The '''Snowflake''' Slurm cluster is availabe only for official student courses. The '''Snowflake''' Slurm cluster is availabe '''exclusively for official student courses'''.
Line 27: Line 16:
||'''Institute/Group''' ||'''Lecturer''' ||'''Course''' ||'''No''' ||'''Semester'''||'''# Participants'''||
||[[https://vision.ee.ethz.ch/|CVL]]||E. Konukoglu, E. Erdil, M. A. Reyes Aguirre||Medical Image Analysis ||227-0391-00L||FS ||90 ||
||[[https://vision.ee.ethz.ch/|CVL]]||C. Sakaridis ||Computer Vision and Artificial Intelligence for Autonomous Cars ||227-0560-00L||HS ||90 ||
||[[https://vision.ee.ethz.ch/|CVL]]||F. Yu ||Robot Learning ||227-0562-00L||FS ||30 ||
||[[https://vision.ee.ethz.ch/|CVL]]||L. Van Gool ||P&S: Deep Learning for Image Manipulation (DLIM) ||227-0085-11L||HS ||15 ||
||[[https://lbb.ethz.ch/|LBB]] ||J. Vörös ||P&S: Controlling Biological Neuronal Networks Using Machine Learning||227-0085-38L||HS ||60 ||
||[[https://tik.ethz.ch/|TIK]] ||R. Wattenhofer ||P&S: Hands-On Deep Learning ||227-0085-59L||FS+HS ||40 ||
||'''Institute/Group''' ||'''Lecturer''' ||'''Course''' ||'''No''' ||'''Semester'''||'''# Participants'''||
||[[https://www.mins.ee.ethz.ch/|MINS]]||M. Lerjen ||P&S: Software Defined Radio||227-0085-19P||HS || 20 ||
||[[https://tik.ethz.ch/|TIK]] ||R. Wattenhofer |P&S: Hands-On Deep Learning ||227-0085-59L||HS ||200 ||
Line 37: Line 21:
=== Requesting course accounts === === My course needs access ===
Course responsibles receive an reminder to request course accounts before the start of each semester. If your course needs access to the Snowflake cluster, add the following information to your request for course accounts:
Line 39: Line 24:
Course accounts have to be requested with sufficient time for preparation, from ISG's side as well as the course coordinator's side. Factor in time to test a course setup after accounts have been set up.
 
Latest 4 weeks before course begin, course coordinators have to hand in a request for course accounts containing the following information:
 1. Amount of course accounts needed
Line 44: Line 25:
 1. Quota (available disk space) per account. The default is 2 GB, maximum is 10 GB.
Line 46: Line 26:
 1. At which date course account contents may be deleted (latest possible date: [[https://ethz.ch/staffnet/en/news-and-events/academic-calendar.html|End of examn session]] of the semester the course takes place
 1. Does the course primarily need ''interactive'' (shell access to 1 GPU for up to 4h) or ''batch jobs'' (running submitted scripts for up to 24h)?
 1. Any additional requirements not listed here
 1. If your course accounts will start only ''interactive'' jobs (shell access to 1 GPU for up to 8h).<<BR>>
 Note: The default is to use mainly ''batch'' jobs (running submitted scripts for up to 24h) and few short ''interactive'' jobs (running up to 4 hours)
Line 59: Line 38:
 * Access to a ISG managed PC, for example [[Workstations/ComputerRooms|Computer room PCs]] or the [[https://computing.ee.ethz.ch/RemoteAccess?highlight=%28login.ee%29#From_ETH_internal|D-ITET login node]]  * Access to a ISG managed PC, for example [[Workstations/ComputerRooms|Computer room PCs]] or the [[RemoteAccess#From_ETH_internal|D-ITET login node|&highlight=login.ee]]
Line 70: Line 49:
||snowflake[01-09]||Intel Xeon Gold 6240||2.60 GHz ||36 ||36 ||376 GB ||✓ ||1.8 TB ||8 !GeForce RTX 2080 Ti (11 GB)||Debian 11|| ||snowflake[01-nn]||Intel Xeon Gold 6240||2.60 GHz ||36 ||36 ||376 GB ||✓ ||1.8 TB ||8 !GeForce RTX 2080 Ti (11 GB)||Debian 11||
Line 79: Line 58:
 * Occasional interactive jobs in `gpu.normal` are allowed, but runtime is capped at 4 hours
Line 81: Line 61:
Running a script in the cluster (Job type ''batch'') or starting an interactive shell (Job type ''interactive'') on a cluster node requires a so-called job submission initiated with a Slurm command. The simplest use of these commands is the following, details can be read in the referenced main Slurm wiki article:
 * `sbatch job_script.sh`<<BR>> Main article entry for [[Services/SLURM#sbatch_.2BIZI_Submitting_a_job|sbatch]]
 * `srun --pty bash -i`<<BR>> Main article entry for [[Services/SLURM#srun_.2BIZI_Start_an_interactive_shell|srun]]
Running a script in the cluster (Job type ''batch'') or starting an interactive shell (Job type ''interactive'') on a cluster node requires a so-called job submission initiated with a Slurm command. The simplest use of these commands is the following:
 * `sbatch job_script.sh`<<BR>> More details for [[Services/SLURM#sbatch_.2BIZI_Submitting_a_job|sbatch]]
 * `srun --pty bash -i`<<BR>> More details for [[Services/SLURM#srun_.2BIZI_Start_an_interactive_shell|srun]]<<BR>>If you only need a short interactive job, specify the amount of minutes needed by adding the parameter `--time=10` (10 minutes):<<BR>>`srun --time=10 --pty bash -i`
 * A useful exercise is to integrate the [[FAQ/JupyterNotebook#Example_of_a_minimal_setup_with_micromamba|example to run a Jupyter notebook]] into a job script.
Line 88: Line 69:
 * 4 GB Memory (per GPU)
The simplest change would be to request 1 additional GPU, which would then allocate 8 CPUs and 8 GB of Memory
 * 40 GB Memory (per GPU)
The simplest change would be to request 1 additional GPU, which would then allocate 8 CPUs and 80 GB of Memory. Details how to request resources different from defaults listed in the [[Services/SLURM#sbatch_.2BIZI_Common_options|main Slurm article]].
Line 115: Line 96:

=== Resource availability ===
==== Reservations ====
Cluster resources may be [[Services/SLURM#Reservations|reserved]] at certain times for specific courses. Details about [[Services/SLURM#Showing_current_reservations|showing reservations]] and submitting jobs during reservations using the [[Services/SLURM#srun_.2BIZI_Start_an_interactive_shell|--time|&highlight=--time]] option is available in the main Slurm article.

==== GPU availability ====
The [[Services/SLURM#smon_.2BIZI_GPU_.2F_CPU_availability|examples to show resource availabilities]] in the main Slurm article can be used for the Snowflake cluster as well by using the Slurm configuration account name ''sladmsnow'' instead of ''sladmitet'', thus using the file `/home/sladmsnow/smon.txt`.

Snowflake Slurm cluster

The Snowflake Slurm cluster is availabe exclusively for official student courses.

The following information is an addendum to the main Slurm article in this wiki specific for usage of the Snowflake cluster. Consult the main Slurm article if the information you're looking for isn't available here:

Course information

Courses with access

The following table shows courses which are currently registered to access the Snowflake cluster:

Institute/Group

Lecturer

Course

No

Semester

# Participants

MINS

M. Lerjen

P&S: Software Defined Radio

227-0085-19P

HS

20

TIK

R. Wattenhofer |P&S: Hands-On Deep Learning

227-0085-59L

HS

200

My course needs access

Course responsibles receive an reminder to request course accounts before the start of each semester. If your course needs access to the Snowflake cluster, add the following information to your request for course accounts:

  1. Whether course accounts need access to net_scratch or a ISG managed institute NAS (those are mutually exclusive)

  2. Whether a master account to provide course data to students is needed
  3. If your course accounts will start only interactive jobs (shell access to 1 GPU for up to 8h).
    Note: The default is to use mainly batch jobs (running submitted scripts for up to 24h) and few short interactive jobs (running up to 4 hours)

After successful request

  • Course coordinators will receive the list of course account passwords for distribution to course participants
  • Course coordinators are responsible to keep a list mapping course participant names to course accounts

Cluster information

Access prerequisites

There are two requirements to access the cluster:

  • Access to a course account (handed out by course coordinators at the beginning of a course)
  • Access to a ISG managed PC, for example Computer room PCs or the D-ITET login node

Setting environment

The environment variable SLURM_CONF needs to be set to point to the configuration of the Snowflake cluster before running any Slurm command:

export SLURM_CONF=/home/sladmsnow/slurm/slurm.conf

Hardware

The nodes in the cluster have the following setup:

Node name

CPU

Frequency

Physical cores

Logical processors

Memory

/scratch SSD

/scratch Size

GPUs

Operating System

snowflake[01-nn]

Intel Xeon Gold 6240

2.60 GHz

36

36

376 GB

1.8 TB

8 GeForce RTX 2080 Ti (11 GB)

Debian 11

Partitions

Nodes are members of the following partitions, which serve to channel different job requirements to dedicated resources:

Name

Job type

Job runtime

gpu.normal

batch/interactive jobs

24/4h

gpu.interactive

interactive jobs only

8h

Job submission

Running a script in the cluster (Job type batch) or starting an interactive shell (Job type interactive) on a cluster node requires a so-called job submission initiated with a Slurm command. The simplest use of these commands is the following:

  • sbatch job_script.sh
    More details for sbatch

  • srun --pty bash -i
    More details for srun
    If you only need a short interactive job, specify the amount of minutes needed by adding the parameter --time=10 (10 minutes):
    srun --time=10 --pty bash -i

  • A useful exercise is to integrate the example to run a Jupyter notebook into a job script.

When used in this simple form, the following default resource allocations are used:

  • 1 GPU per Job
  • 4 CPUs (per GPU)
  • 40 GB Memory (per GPU)

The simplest change would be to request 1 additional GPU, which would then allocate 8 CPUs and 80 GB of Memory. Details how to request resources different from defaults listed in the main Slurm article.

Fair share

  • gpu.normal is availabe to all courses

  • gpu.interactive is available only when booked by a course (indicated by membership in Slurm account interactive)

  • Resources are shared fairly based on usage
  • Usage accounting is reset on a weekly basis

Slurm account information

Slurm accounts exist only within Slurm. They serve as groups to allow inheritance of attributes to members. Members are D-ITET accounts, referred to here as course accounts.
The following commands show how to display account information for Slurm:

Show all Slurm accounts

sacctmgr show accounts Format=Account%-15,Description%-25,Organization%-15

Show all course accounts with Slurm account membership

sacctmgr show users WithAssoc Format=User%-15,DefaultAccount%-15,Account%-15

Show all Slurm accounts with course account members

sacctmgr show accounts WithAssoc Format=Account%-15,Description%-25,Organization%-16,User%-15

Resource availability

Reservations

Cluster resources may be reserved at certain times for specific courses. Details about showing reservations and submitting jobs during reservations using the --time option is available in the main Slurm article.

GPU availability

The examples to show resource availabilities in the main Slurm article can be used for the Snowflake cluster as well by using the Slurm configuration account name sladmsnow instead of sladmitet, thus using the file /home/sladmsnow/smon.txt.

Services/SLURM-Snowflake (last edited 2024-09-16 11:58:43 by stroth)