Size: 2893
Comment:
|
Size: 16785
Comment:
|
Deletions are marked like this. | Additions are marked like this. |
Line 1: | Line 1: |
= Data Management = == Choosing the optimal storage system == Quoting [[https://scicomp.ethz.ch/wiki/Getting_started_with_clusters#Choosing_the_optimal_storage_system|Choosing the optimal storage system]] from the scientific computing wiki: When working on an HPC cluster that provides different storage categories/systems, the choice of which system to use can have a big influence of the performance of your workflow. In the best case you can speedup your workflow by quite a lot, whereas in the worst case the system administrator has to kill all your jobs and has to limit the number of concurrent jobs that you can run because your jobs slow down the entire storage system and this can affect other users jobs. The same holds for the clusters and storage infrastructure as maintained in the D-ITET environment. When thus working (in particular from compute jobs in one of the clusters) with storage systems please consider the following guidelines, those are largely inspired by the mentioned scientific computing guidelines. * Use local "scratch" disks as whenever possible. Many of the nodes (but not all) have in meanwhile in particular SSD disks available as scratch storage further improving the performance. * [Applicable to BIWI] For working in parallel from a cluster with '''large''' files consider using the scale-out scratch place (beegfs02). * Do not create large number of small files in the [[Services/NetScratch|D-ITET NetScratch]] service storage (or the BeeGFS scratch or project filesystem), this can slow down not only you but the whole system. Consider the first item when working on data stored on those services whenever possible. * If on the data you need to perform very high I/O (e.g. opening and closing files at high rate, reading many small files per second, do short appends to files from various locations), then this will have a severe impact o the network attached storages or the scale out filesystems. Use as much as possible data copied to local storage for those and only move back results to appropriate places. * If working a lot with large amount of small files, then keep those sensibly grouped in bigger archives which you can move to local storage in the job (as big files) and unpack those there and work on the local storage with them. Do not work on the large amount of small files on network attached storage or the cluster filesystems. Process them and group the results again in archives and move the results to appropriate places. Respecting this guidelines can improve '''your''' '''own''' work performance and at same time do not severely impact the performance of the storage systems in a bad way (and for other users). == References == * [[https://scicomp.ethz.ch/wiki/Getting_started_with_clusters#Choosing_the_optimal_storage_system|Choosing the optimal storage system]] |
#rev 2024-08-12 stroth <<TableOfContents()>> = HPC Storage and Input/Output (I/O) best practices and guidelines = D-ITET ISG offers storage systems public to all department members as well as systems private to institute/groups. While access to data on public systems can be ''restricted'' as well as ''shared'', the load placed on the underlying hardware cannot be ''restricted'' and is always ''shared''. This implies on the systems currently employed at D-ITET, fair I/O cannot be guaranteed by technical means. '''Fair I/O can only be maintained by adhering to the following guidelines'''. The storage system is the main bottleneck for compute jobs on HPC systems compared to CPU/GPU and memory resources. Compute jobs - if set up improperly - can stall a storage system. The goal of this article is to explain how to: * Maximize a job's I/O performance * Keep I/O from compute jobs low on storage systems == Prepare your data and code == The '''worst imaginable case''' of using data on a storage system is '''reading/writing many small files''' and their metadata (creation date, modification date, size) concurrently and in parallel.<<BR>> The '''best case''' is using only '''few large files as containers''' for data and code which provide the same features as storing files directly on system and additional optimizations to speed up access to their content.<<BR>> Sizewise, large files may be in the terabyte range. === Data === Make use of a I/O library designed to parallelize, aggregate and efficiently manage I/O operations (descending order of relevance): * [[https://petastorm.readthedocs.io/en/latest/|Use Parquet storage from Python]], . [[https://parquet.apache.org/|Apache Parquet format]], . [[https://saturncloud.io/blog/how-to-write-data-to-parquet-with-python/|How to write data to parquet with python]], . [[https://www.blog.pythonlibrary.org/2024/05/06/how-to-read-and-write-parquet-files-with-python/|How to read and write Parquet files with Python]] * [[https://www.unidata.ucar.edu/software/netcdf/|NetCDF4]], . [[https://unidata.github.io/netcdf4-python/|NetCDF4 Python interface]] * [[https://www.hdfgroup.org/|HDF5]], . [[https://docs.h5py.org/en/stable/|HDF5 Python interface]], . [[https://realpython.com/storing-images-in-python/|Example of storing/acessing lots of images in Python]] If your job generates a continuous stream of uncompressed output, consider piping it through a compressor before writing it to a file. We recommend using `gzip` with a low compression setting to keep CPU usage low: {{{#!highlight bash numbers=disable my_program | gzip --fast --rsyncable my_output.gz }}} There are various compressors available on Linux systems, please investigate comparisons and use cases yourself. === Code === Code could be just a single statically compiled executable file or a [[Programming/Languages/Conda|conda]] environment with thousands of small files. ==== Ready-made software ==== If you're looking for a specific software, check if it is available as an [[https://appimage.org/|AppImage]]. This is already a compressed image (read-only archive) of a directory structure containing software and all its dependencies. ==== Custom-built software with unavailable system dependencies ==== If you have to build software yourself which requires root permission to install dependencies and modify the underlying system, the easiest solution is to deploy it in an [[https://apptainer.org/|Apptainer container]].<<BR>> For use of `apptainer` on D-ITET infrastructure, see: * [[Services/Apptainer|Apptainer]] * [[Services/SingularityBuilder|Singularity Builder]] Apptainer containers created on private PCs may be transferred to the D-ITET infrastructure and used there as well. ==== Self-contained custom-built software ==== This is anything you can install with your permissions in a directory on the local `/scratch` of your ISG managed workstation, for example a `conda` environment.<<BR>> The following script is an extensive example to create a portable `conda` environment to run a [[https://jupyter.org/|jupyter notebook]] with some optional optimizations.<<BR>> Read the comments to decide which parts of the script match your use case and adapt them to your needs: {{{#!highlight bash numbers=disable #!/bin/bash # - Use Micromamba to set up a python environment using $conda_channels with $python_packages called $env_name # - Optionally reduce space used by the environment by: # - Deduplicating files # - Stripping binaries # - Removing python bytecode # - Compressing the environment into a squashfs image # Minimal installation, takes ~1' env_name='jupyter_notebook' python_packages='notebook' conda_channels='--channel conda-forge' # Installation with pytorch and Cuda matching GPU driver in cluster: # Takes more than 5' #python_packages='notebook matplotlib scipy sqlite pytorch torchvision pytorch-cuda=11.8' #conda_channels='--channel conda-forge --channel pytorch --channel nvidia' micromamba_installer_url='https://micro.mamba.pm/api/micromamba/linux-64/latest' scratch="/scratch/${USER}" MAMBA_ROOT_PREFIX="${scratch}/${env_name}" CONDA_PKGS_DIRS="${scratch}/${env_name}_pkgs" PYTHONPYCACHEPREFIX="${scratch}/${env_name}_pycache" # Generate a line of the current terminal window's width line=$(printf '%*s\n' "${COLUMNS:-$(tput cols)}" '' | tr ' ' '-') # Display underlined title to improve readability of script output function title() { echo echo "$@" echo "${line}" } mkdir -v -p "${MAMBA_ROOT_PREFIX}" && cd "${MAMBA_ROOT_PREFIX}" && title 'Downloading latest Micromamba (static linked binary)' && wget --output-document=- "${micromamba_installer_url}" | tar -xjv bin/micromamba && # Set base path for Micromamba export MAMBA_ROOT_PREFIX CONDA_PKGS_DIRS PYTHONPYCACHEPREFIX && # Initialize Micromamba eval "$(${MAMBA_ROOT_PREFIX}/bin/micromamba shell hook --shell=bash)" && title "Creating environment '${env_name}'" && micromamba create --yes --name ${env_name} ${python_packages} ${conda_channels} && title 'Cleaning up Micromamba installation' && micromamba clean --all --yes && # Optional step title 'Deduplicating files' && rdfind -makesymlinks true -makeresultsfile false . && # Optional step title 'Converting absolute symlinks to relative symlinks' && symlinks -rc . && # Optional step. May break a software, use with care. title 'Stripping binaries' && find . -xdev -type f -print0 | xargs --null --no-run-if-empty file --no-pad | grep -E '^.*: ELF.*x86-64.*not stripped.*$' | cut -d ':' -f 1 | xargs --no-run-if-empty strip --verbose --strip-all --discard-all && # Optional step title 'Deleting bytecode files (*pyc)' && find . -xdev -name '*.pyc' -print0 | xargs --null --no-run-if-empty rm --one-file-system -v && find . -type d -empty -name '__pycache__' -print -delete && # Optional step: Speed up start of jupyter server cat <<EOF >"${MAMBA_ROOT_PREFIX}/envs/${env_name}/etc/jupyter/jupyter_server_config.d/nochecks.json" && { "ServerApp": { "tornado_settings": { "page_config_data": { "buildCheck": false, "buildAvailable": false } } } } EOF # Create start wrapper: This is specific for a jupyter notebook and Python code with the byte cache code placed on a writable storage cat <<EOF >"${MAMBA_ROOT_PREFIX}"/start.sh && #!/bin/bash env_name="${env_name}" scratch="/itet-stor/\${USER}" MAMBA_ROOT_PREFIX="\${scratch}/\${env_name}" PYTHONPYCACHEPREFIX="\${scratch}/\${env_name}_pycache" export MAMBA_ROOT_PREFIX PYTHONPYCACHEPREFIX eval "\$(\${MAMBA_ROOT_PREFIX}/bin/micromamba shell hook --shell=bash)" && micromamba run -n jupyter_notebook jupyter notebook --no-browser --port 5998 --ip "\$(hostname -f)" EOF title 'Fixing permissions' && chmod 755 "${MAMBA_ROOT_PREFIX}"/start.sh && chmod --recursive --changes go-w,go+r "${MAMBA_ROOT_PREFIX}" && find "${MAMBA_ROOT_PREFIX}" -xdev -perm /u+x -print0 | xargs --null --no-run-if-empty chmod --changes go+x && title 'Creating squashfs image' && mksquashfs "${MAMBA_ROOT_PREFIX}" "${MAMBA_ROOT_PREFIX}".sqsh -no-xattrs -comp zstd && # Show how to start the wrapper title 'Start the environment with the following command' && echo "squashfs-mount ${MAMBA_ROOT_PREFIX}.sqsh:${MAMBA_ROOT_PREFIX} -- ${MAMBA_ROOT_PREFIX}/start.sh" }}} Any self-contained software installed in `/scratch/$USER/<software>` can be compressed with `mksquashfs` and used with `squashfs-mount` as in the example above. == Available storage systems == === Local node scratch === Primarily use the local `/scratch` of a compute node. This storage offers lowest access latency, but space is limited and can differ per node. To be fair to other users it's '''important to clean up after use'''. ==== Available space and harddisk type ==== These are listed in the '''Hardware''' tables for our compute clusters: * [[Services/SLURM#Hardware|Arton nodes]] in D-ITET cluster * [[Services/SLURM-tik#Hardware|TIK nodes]] in D-ITET cluster * [[Services/SLURM-Biwi#Hardware|CVL/BMIC nodes]] in the CVL cluster * [[Services/SLURM-Snowflake#Hardware|Snowflake nodes]] in the D-ITET course cluster ==== scratch cleanup ==== 1. `scratch_clean` is active on local `/scratch` of all nodes, meaning older data will be deleted automatically if space is needed. For details see the man page `man scratch_clean`.<<BR>>This is a safety-net which does automatic cleanup, where you have no control over which files are deleted. 1. Always create a personal directory on a local scratch and '''clean it up after use'''! This way you're in control of deletion and `scratch_clean` will not have to clean up after you.<<BR>>Personal automatic cleanup can be achieved by adapting the following bash script snippet and adding it to your [[Services/SLURM?highlight=%28slurm%29#sbatch_.2BIZI_Submitting_a_job|job submit script]]: {{{#!highlight bash numbers=disable my_local_scratch_dir="/scratch/${USER}" # List contents of my_local_scratch_dir to trigger automounting if ! ls "${my_local_scratch_dir}" 1>/dev/null 2>/dev/null; then if ! mkdir --parents --mode=700 "${my_local_scratch_dir}"; then echo 'Failed to create my_local_scratch_dir' 1>&2 exit 1 fi fi # Set a trap to remove my_local_scratch_dir when the job script ends trap "exit 1" HUP INT TERM trap 'rm -rf "${my_local_scratch_dir}"' EXIT # Syncronize a directory containing large files which are not in use by any other process: rsync -av --inplace <source directory> "${my_local_scratch_dir}" # Optional: Change the current directory to my_local_scratch_dir, exit if changing didn't succeed. cd "${my_local_scratch_dir}" || exit 1 }}} === Common node scratch === Local `/scratch` of nodes is available among nodes at `/scratch_net/node_A` as an automount (on demand). It is accessible exclusively on compute nodes from compute jobs. A use case for this kind of storage is running several compute jobs on different nodes using the same data. * Accessing data stored on `/scratch` on one node `A` from other nodes `B, C, D, ...` will '''impact I/O latency for all jobs running on node `A`!''' * You have to ensure writing data from nodes `B, C, D, ...` concurrently to `/scratch` on node `A` does not overwrite data already in use * `scratch_clean` is active (see above)! * Automatic cleanup per job as shown above has to be replaced by a final cleanup in the last job accessing the data === Public storage === Public storage is accessible widely: On personal workstations, file servers and compute nodes. It is used in the daily work by all D-ITET members.<<BR>> This storage allows direct access to data from compute jobs without the need to transfer it to local `/scratch`. Latency is higher because of wide use and network bandwidth.<<BR>> While this may look like a convenient storage to use for compute jobs, '''using public storage mandates strict use of the guidelines here to prevent blocking other users'''! There are different types of public storage available at D-ITET. Make sure you understand what is available to you and which one to use for what purpose. Details about public storage available at D-ITET is summarized in the [[Services/StorageOverview|Storage overview]]. Your supervisor or institutes/groups administrative/technical contact will tell you: * which storage is available to you from your institute/group * which storage to use for intermediate, generated data * which storage to use to store your final results ⚠ For storage without automated backup: '''Make sure to backup stored data yourself!'''<<BR>> ⚠ Better, '''don't store data worthy of a backup on a system without automated backup!''' === Transferring data === Transfer of a large file between any storage accessible within the D-ITET structure is most efficient with the following `rsync` commands: {{{#!highlight bash numbers=disable # Minimal output rsync -a --inplace /path/to/large/origin/file /path/to/copy/of/origin/file # Add on-the-fly compression if your file is uncompressed rsync -a --inplace /path/to/large/origin/file /path/to/copy/of/origin/file --compress # Add verbose output and a progress indicator rsync -a --inplace /path/to/large/origin/file /path/to/copy/of/origin/file --verbose --progress }}} In this example there is a significant reduction in use of resources (bandwidth, cpu, memory, time) if a previous version of the target file is already in place, as only changed blocks will be transferred. A concrete example syncing the file `dataset.parquet` in the project folder `project_one` to a (existing) directory with your username on the local `/scratch` of your (ISG managed) workstation: {{{#!highlight bash numbers=disable rsync -a --inplace /itet-stor/$USER/project_one/dataset.parquet /scratch/$USER/ -v --progress }}} == Tuning Weights & Biases (wandb) == If you use [[https://wandb.ai/|Weights & Biases (wandb)]], be aware it can [[https://docs.wandb.ai/guides/technical-faq/metrics-and-performance#will-wandb-slow-down-my-training|create intense I/O]] on the storage it logs its metrics. Quote: ''It is possible to log a huge amount of data quickly, and if you do that you might create disk I/O issues.'' In a szenario where many HPC jobs run with `wandb` using the same storage system for job and `wandb` data, this can result in a slowdown of any I/O operation for all job submitters. To prevent this, setup `wandb` as follows: === Use a fast local scratch disk for main and cache directory === Set environment variables to relocate main and cache directories and create these directories in your (`bash`) job script: {{{#!highlight bash numbers=disable WANDB_DIR="/scratch/${USER}/wandb_dir" WANDB_CACHE_DIR="${WANDB_DIR}/.cache" export WANDB_DIR WANDB_CACHE_DIR mkdir -vp "${WANDB_CACHE_DIR}" }}} See [[https://docs.wandb.ai/guides/track/environment-variables|Environment Variables]] for details. If you want to keep this data, remove the cache, copy the main directory into a compressed tar archive away from the local `/scratch` at the end of a job to a backuped location like a project directory, then '''delete it''' from the local `/scratch` disk: {{{#!highlight bash numbers=disable rm -r "${WANDB_CACHE_DIR}" && tar -czf "/itet-stor/${USER}/<your_project_directory>/wandb_${SLURM_JOB_ID}.tar.gz" "${WANDB_DIR}" && rm -r "${WANDB_DIR}" }}} To automate removal, setting a `trap` as in the example under [[#Local_scratch|Local scratch]] makes sense here as well. === Run wandb offline === Consider [[https://docs.wandb.ai/guides/technical-faq/setup#can-i-run-wandb-offline|running wandb offline]].<<BR>> If necessary, sync metrics at the end of your job as explained in the link above. === Tune metrics collection === Consider [[https://docs.wandb.ai/guides/track/limits|tuning your metrics collection parameters for faster logging]]. == Related information == * [[https://scicomp.ethz.ch/wiki/Best_Practices|Euler cluster best practices]] * [[https://scicomp.ethz.ch/wiki/Getting_started_with_clusters#Choosing_the_optimal_storage_system|Euler: choosing optimal storage]] * [[https://readme.phys.ethz.ch/storage/general_advice/|D-PHYS storage advice]] * [[https://www.researchgate.net/publication/338599610_Understanding_Data_Motion_in_the_Modern_HPC_Data_Center|Understanding Data Motion in the Modern HPC Data Center]] |
Contents
HPC Storage and Input/Output (I/O) best practices and guidelines
D-ITET ISG offers storage systems public to all department members as well as systems private to institute/groups. While access to data on public systems can be restricted as well as shared, the load placed on the underlying hardware cannot be restricted and is always shared. This implies on the systems currently employed at D-ITET, fair I/O cannot be guaranteed by technical means.
Fair I/O can only be maintained by adhering to the following guidelines.
The storage system is the main bottleneck for compute jobs on HPC systems compared to CPU/GPU and memory resources. Compute jobs - if set up improperly - can stall a storage system. The goal of this article is to explain how to:
- Maximize a job's I/O performance
- Keep I/O from compute jobs low on storage systems
Prepare your data and code
The worst imaginable case of using data on a storage system is reading/writing many small files and their metadata (creation date, modification date, size) concurrently and in parallel.
The best case is using only few large files as containers for data and code which provide the same features as storing files directly on system and additional optimizations to speed up access to their content.
Sizewise, large files may be in the terabyte range.
Data
Make use of a I/O library designed to parallelize, aggregate and efficiently manage I/O operations (descending order of relevance):
HDF5,
If your job generates a continuous stream of uncompressed output, consider piping it through a compressor before writing it to a file. We recommend using gzip with a low compression setting to keep CPU usage low:
my_program | gzip --fast --rsyncable my_output.gz
There are various compressors available on Linux systems, please investigate comparisons and use cases yourself.
Code
Code could be just a single statically compiled executable file or a conda environment with thousands of small files.
Ready-made software
If you're looking for a specific software, check if it is available as an AppImage. This is already a compressed image (read-only archive) of a directory structure containing software and all its dependencies.
Custom-built software with unavailable system dependencies
If you have to build software yourself which requires root permission to install dependencies and modify the underlying system, the easiest solution is to deploy it in an Apptainer container.
For use of apptainer on D-ITET infrastructure, see:
Apptainer containers created on private PCs may be transferred to the D-ITET infrastructure and used there as well.
Self-contained custom-built software
This is anything you can install with your permissions in a directory on the local /scratch of your ISG managed workstation, for example a conda environment.
The following script is an extensive example to create a portable conda environment to run a jupyter notebook with some optional optimizations.
Read the comments to decide which parts of the script match your use case and adapt them to your needs:
#!/bin/bash
# - Use Micromamba to set up a python environment using $conda_channels with $python_packages called $env_name
# - Optionally reduce space used by the environment by:
# - Deduplicating files
# - Stripping binaries
# - Removing python bytecode
# - Compressing the environment into a squashfs image
# Minimal installation, takes ~1'
env_name='jupyter_notebook'
python_packages='notebook'
conda_channels='--channel conda-forge'
# Installation with pytorch and Cuda matching GPU driver in cluster:
# Takes more than 5'
#python_packages='notebook matplotlib scipy sqlite pytorch torchvision pytorch-cuda=11.8'
#conda_channels='--channel conda-forge --channel pytorch --channel nvidia'
micromamba_installer_url='https://micro.mamba.pm/api/micromamba/linux-64/latest'
scratch="/scratch/${USER}"
MAMBA_ROOT_PREFIX="${scratch}/${env_name}"
CONDA_PKGS_DIRS="${scratch}/${env_name}_pkgs"
PYTHONPYCACHEPREFIX="${scratch}/${env_name}_pycache"
# Generate a line of the current terminal window's width
line=$(printf '%*s\n' "${COLUMNS:-$(tput cols)}" '' | tr ' ' '-')
# Display underlined title to improve readability of script output
function title() {
echo
echo "$@"
echo "${line}"
}
mkdir -v -p "${MAMBA_ROOT_PREFIX}" &&
cd "${MAMBA_ROOT_PREFIX}" &&
title 'Downloading latest Micromamba (static linked binary)' &&
wget --output-document=- "${micromamba_installer_url}" |
tar -xjv bin/micromamba &&
# Set base path for Micromamba
export MAMBA_ROOT_PREFIX CONDA_PKGS_DIRS PYTHONPYCACHEPREFIX &&
# Initialize Micromamba
eval "$(${MAMBA_ROOT_PREFIX}/bin/micromamba shell hook --shell=bash)" &&
title "Creating environment '${env_name}'" &&
micromamba create --yes --name ${env_name} ${python_packages} ${conda_channels} &&
title 'Cleaning up Micromamba installation' &&
micromamba clean --all --yes &&
# Optional step
title 'Deduplicating files' &&
rdfind -makesymlinks true -makeresultsfile false . &&
# Optional step
title 'Converting absolute symlinks to relative symlinks' &&
symlinks -rc . &&
# Optional step. May break a software, use with care.
title 'Stripping binaries' &&
find . -xdev -type f -print0 |
xargs --null --no-run-if-empty file --no-pad |
grep -E '^.*: ELF.*x86-64.*not stripped.*$' |
cut -d ':' -f 1 |
xargs --no-run-if-empty strip --verbose --strip-all --discard-all &&
# Optional step
title 'Deleting bytecode files (*pyc)' &&
find . -xdev -name '*.pyc' -print0 |
xargs --null --no-run-if-empty rm --one-file-system -v &&
find . -type d -empty -name '__pycache__' -print -delete &&
# Optional step: Speed up start of jupyter server
cat <<EOF >"${MAMBA_ROOT_PREFIX}/envs/${env_name}/etc/jupyter/jupyter_server_config.d/nochecks.json" &&
{
"ServerApp": {
"tornado_settings": {
"page_config_data": {
"buildCheck": false,
"buildAvailable": false
}
}
}
}
EOF
# Create start wrapper: This is specific for a jupyter notebook and Python code with the byte cache code placed on a writable storage
cat <<EOF >"${MAMBA_ROOT_PREFIX}"/start.sh &&
#!/bin/bash
env_name="${env_name}"
scratch="/itet-stor/\${USER}"
MAMBA_ROOT_PREFIX="\${scratch}/\${env_name}"
PYTHONPYCACHEPREFIX="\${scratch}/\${env_name}_pycache"
export MAMBA_ROOT_PREFIX PYTHONPYCACHEPREFIX
eval "\$(\${MAMBA_ROOT_PREFIX}/bin/micromamba shell hook --shell=bash)" &&
micromamba run -n jupyter_notebook jupyter notebook --no-browser --port 5998 --ip "\$(hostname -f)"
EOF
title 'Fixing permissions' &&
chmod 755 "${MAMBA_ROOT_PREFIX}"/start.sh &&
chmod --recursive --changes go-w,go+r "${MAMBA_ROOT_PREFIX}" &&
find "${MAMBA_ROOT_PREFIX}" -xdev -perm /u+x -print0 |
xargs --null --no-run-if-empty chmod --changes go+x &&
title 'Creating squashfs image' &&
mksquashfs "${MAMBA_ROOT_PREFIX}" "${MAMBA_ROOT_PREFIX}".sqsh -no-xattrs -comp zstd &&
# Show how to start the wrapper
title 'Start the environment with the following command' &&
echo "squashfs-mount ${MAMBA_ROOT_PREFIX}.sqsh:${MAMBA_ROOT_PREFIX} -- ${MAMBA_ROOT_PREFIX}/start.sh"
Any self-contained software installed in /scratch/$USER/<software> can be compressed with mksquashfs and used with squashfs-mount as in the example above.
Available storage systems
Local node scratch
Primarily use the local /scratch of a compute node. This storage offers lowest access latency, but space is limited and can differ per node. To be fair to other users it's important to clean up after use.
Available space and harddisk type
These are listed in the Hardware tables for our compute clusters:
Arton nodes in D-ITET cluster
TIK nodes in D-ITET cluster
CVL/BMIC nodes in the CVL cluster
Snowflake nodes in the D-ITET course cluster
scratch cleanup
scratch_clean is active on local /scratch of all nodes, meaning older data will be deleted automatically if space is needed. For details see the man page man scratch_clean.
This is a safety-net which does automatic cleanup, where you have no control over which files are deleted.Always create a personal directory on a local scratch and clean it up after use! This way you're in control of deletion and scratch_clean will not have to clean up after you.
Personal automatic cleanup can be achieved by adapting the following bash script snippet and adding it to your job submit script:my_local_scratch_dir="/scratch/${USER}" # List contents of my_local_scratch_dir to trigger automounting if ! ls "${my_local_scratch_dir}" 1>/dev/null 2>/dev/null; then if ! mkdir --parents --mode=700 "${my_local_scratch_dir}"; then echo 'Failed to create my_local_scratch_dir' 1>&2 exit 1 fi fi # Set a trap to remove my_local_scratch_dir when the job script ends trap "exit 1" HUP INT TERM trap 'rm -rf "${my_local_scratch_dir}"' EXIT # Syncronize a directory containing large files which are not in use by any other process: rsync -av --inplace <source directory> "${my_local_scratch_dir}" # Optional: Change the current directory to my_local_scratch_dir, exit if changing didn't succeed. cd "${my_local_scratch_dir}" || exit 1
Common node scratch
Local /scratch of nodes is available among nodes at /scratch_net/node_A as an automount (on demand). It is accessible exclusively on compute nodes from compute jobs. A use case for this kind of storage is running several compute jobs on different nodes using the same data.
Accessing data stored on /scratch on one node A from other nodes B, C, D, ... will impact I/O latency for all jobs running on node A!
You have to ensure writing data from nodes B, C, D, ... concurrently to /scratch on node A does not overwrite data already in use
scratch_clean is active (see above)!
- Automatic cleanup per job as shown above has to be replaced by a final cleanup in the last job accessing the data
Public storage
Public storage is accessible widely: On personal workstations, file servers and compute nodes. It is used in the daily work by all D-ITET members.
This storage allows direct access to data from compute jobs without the need to transfer it to local /scratch. Latency is higher because of wide use and network bandwidth.
While this may look like a convenient storage to use for compute jobs, using public storage mandates strict use of the guidelines here to prevent blocking other users!
There are different types of public storage available at D-ITET. Make sure you understand what is available to you and which one to use for what purpose. Details about public storage available at D-ITET is summarized in the Storage overview.
Your supervisor or institutes/groups administrative/technical contact will tell you:
- which storage is available to you from your institute/group
- which storage to use for intermediate, generated data
- which storage to use to store your final results
⚠ For storage without automated backup: Make sure to backup stored data yourself!
⚠ Better, don't store data worthy of a backup on a system without automated backup!
Transferring data
Transfer of a large file between any storage accessible within the D-ITET structure is most efficient with the following rsync commands:
# Minimal output
rsync -a --inplace /path/to/large/origin/file /path/to/copy/of/origin/file
# Add on-the-fly compression if your file is uncompressed
rsync -a --inplace /path/to/large/origin/file /path/to/copy/of/origin/file --compress
# Add verbose output and a progress indicator
rsync -a --inplace /path/to/large/origin/file /path/to/copy/of/origin/file --verbose --progress
In this example there is a significant reduction in use of resources (bandwidth, cpu, memory, time) if a previous version of the target file is already in place, as only changed blocks will be transferred.
A concrete example syncing the file dataset.parquet in the project folder project_one to a (existing) directory with your username on the local /scratch of your (ISG managed) workstation:
rsync -a --inplace /itet-stor/$USER/project_one/dataset.parquet /scratch/$USER/ -v --progress
Tuning Weights & Biases (wandb)
If you use Weights & Biases (wandb), be aware it can create intense I/O on the storage it logs its metrics.
Quote: It is possible to log a huge amount of data quickly, and if you do that you might create disk I/O issues.
In a szenario where many HPC jobs run with wandb using the same storage system for job and wandb data, this can result in a slowdown of any I/O operation for all job submitters. To prevent this, setup wandb as follows:
Use a fast local scratch disk for main and cache directory
Set environment variables to relocate main and cache directories and create these directories in your (bash) job script:
WANDB_DIR="/scratch/${USER}/wandb_dir"
WANDB_CACHE_DIR="${WANDB_DIR}/.cache"
export WANDB_DIR WANDB_CACHE_DIR
mkdir -vp "${WANDB_CACHE_DIR}"
See Environment Variables for details.
If you want to keep this data, remove the cache, copy the main directory into a compressed tar archive away from the local /scratch at the end of a job to a backuped location like a project directory, then delete it from the local /scratch disk:
rm -r "${WANDB_CACHE_DIR}" &&
tar -czf "/itet-stor/${USER}/<your_project_directory>/wandb_${SLURM_JOB_ID}.tar.gz" "${WANDB_DIR}" &&
rm -r "${WANDB_DIR}"
To automate removal, setting a trap as in the example under Local scratch makes sense here as well.
Run wandb offline
Consider running wandb offline.
If necessary, sync metrics at the end of your job as explained in the link above.
Tune metrics collection
Consider tuning your metrics collection parameters for faster logging.