Differences between revisions 1 and 8 (spanning 7 versions)
Revision 1 as of 2021-04-20 05:58:42
Size: 2893
Editor: bonaccos
Comment:
Revision 8 as of 2024-08-12 14:49:33
Size: 16785
Editor: stroth
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= Data Management =

== Choosing the optimal storage system ==

Quoting [[https://scicomp.ethz.ch/wiki/Getting_started_with_clusters#Choosing_the_optimal_storage_system|Choosing the optimal storage system]]
from the scientific computing wiki:

When working on an HPC cluster that provides different storage
categories/systems, the choice of which system to use can have a big
influence of the performance of your workflow. In the best case you
can speedup your workflow by quite a lot, whereas in the worst case
the system administrator has to kill all your jobs and has to limit
the number of concurrent jobs that you can run because your jobs slow
down the entire storage system and this can affect other users jobs.


The same holds for the clusters and storage infrastructure as maintained
in the D-ITET environment.

When thus working (in particular from compute jobs in one of the
clusters) with storage systems please consider the following guidelines,
those are largely inspired by the mentioned scientific computing
guidelines.

 * Use local "scratch" disks as whenever possible. Many of the nodes (but not all) have in meanwhile in particular SSD disks available as scratch storage further improving the performance.

 * [Applicable to BIWI] For working in parallel from a cluster with '''large''' files consider using the scale-out scratch place (beegfs02).

 * Do not create large number of small files in the [[Services/NetScratch|D-ITET NetScratch]] service storage (or the BeeGFS scratch or project filesystem), this can slow down not only you but the whole system. Consider the first item when working on data stored on those services whenever possible.

 * If on the data you need to perform very high I/O (e.g. opening and closing files at high rate, reading many small files per second, do short appends to files from various locations), then this will have a severe impact o the network attached storages or the scale out filesystems. Use as much as possible data copied to local storage for those and only move back results to appropriate places.

 * If working a lot with large amount of small files, then keep those sensibly grouped in bigger archives which you can move to local storage in the job (as big files) and unpack those there and work on the local storage with them. Do not work on the large amount of small files on network attached storage or the cluster filesystems. Process them and group the results again in archives and move the results to appropriate places.

Respecting this guidelines can improve '''your''' '''own''' work performance
and at same time do not severely impact the performance of the storage
systems in a bad way (and for other users).

== References ==

 * [[https://scicomp.ethz.ch/wiki/Getting_started_with_clusters#Choosing_the_optimal_storage_system|Choosing the optimal storage system]]
#rev 2024-08-12 stroth

<<TableOfContents()>>

= HPC Storage and Input/Output (I/O) best practices and guidelines =

D-ITET ISG offers storage systems public to all department members as well as systems private to institute/groups. While access to data on public systems can be ''restricted'' as well as ''shared'', the load placed on the underlying hardware cannot be ''restricted'' and is always ''shared''. This implies on the systems currently employed at D-ITET, fair I/O cannot be guaranteed by technical means.

'''Fair I/O can only be maintained by adhering to the following guidelines'''.

The storage system is the main bottleneck for compute jobs on HPC systems compared to CPU/GPU and memory resources. Compute jobs - if set up improperly - can stall a storage system. The goal of this article is to explain how to:
 * Maximize a job's I/O performance
 * Keep I/O from compute jobs low on storage systems


== Prepare your data and code ==

The '''worst imaginable case''' of using data on a storage system is '''reading/writing many small files''' and their metadata (creation date, modification date, size) concurrently and in parallel.<<BR>>
The '''best case''' is using only '''few large files as containers''' for data and code which provide the same features as storing files directly on system and additional optimizations to speed up access to their content.<<BR>>
Sizewise, large files may be in the terabyte range.


=== Data ===

Make use of a I/O library designed to parallelize, aggregate and efficiently manage I/O operations (descending order of relevance):
 * [[https://petastorm.readthedocs.io/en/latest/|Use Parquet storage from Python]],
 . [[https://parquet.apache.org/|Apache Parquet format]],
 . [[https://saturncloud.io/blog/how-to-write-data-to-parquet-with-python/|How to write data to parquet with python]],
 . [[https://www.blog.pythonlibrary.org/2024/05/06/how-to-read-and-write-parquet-files-with-python/|How to read and write Parquet files with Python]]
 * [[https://www.unidata.ucar.edu/software/netcdf/|NetCDF4]],
 . [[https://unidata.github.io/netcdf4-python/|NetCDF4 Python interface]]
 * [[https://www.hdfgroup.org/|HDF5]],
 . [[https://docs.h5py.org/en/stable/|HDF5 Python interface]],
 . [[https://realpython.com/storing-images-in-python/|Example of storing/acessing lots of images in Python]]

If your job generates a continuous stream of uncompressed output, consider piping it through a compressor before writing it to a file. We recommend using `gzip` with a low compression setting to keep CPU usage low: {{{#!highlight bash numbers=disable
my_program | gzip --fast --rsyncable my_output.gz
}}}
There are various compressors available on Linux systems, please investigate comparisons and use cases yourself.


=== Code ===

Code could be just a single statically compiled executable file or a [[Programming/Languages/Conda|conda]] environment with thousands of small files.


==== Ready-made software ====

If you're looking for a specific software, check if it is available as an [[https://appimage.org/|AppImage]]. This is already a compressed image (read-only archive) of a directory structure containing software and all its dependencies.


==== Custom-built software with unavailable system dependencies ====

If you have to build software yourself which requires root permission to install dependencies and modify the underlying system, the easiest solution is to deploy it in an [[https://apptainer.org/|Apptainer container]].<<BR>>
For use of `apptainer` on D-ITET infrastructure, see:
 * [[Services/Apptainer|Apptainer]]
 * [[Services/SingularityBuilder|Singularity Builder]]
Apptainer containers created on private PCs may be transferred to the D-ITET infrastructure and used there as well.

==== Self-contained custom-built software ====

This is anything you can install with your permissions in a directory on the local `/scratch` of your ISG managed workstation, for example a `conda` environment.<<BR>>
The following script is an extensive example to create a portable `conda` environment to run a [[https://jupyter.org/|jupyter notebook]] with some optional optimizations.<<BR>>
Read the comments to decide which parts of the script match your use case and adapt them to your needs: {{{#!highlight bash numbers=disable
#!/bin/bash

# - Use Micromamba to set up a python environment using $conda_channels with $python_packages called $env_name
# - Optionally reduce space used by the environment by:
# - Deduplicating files
# - Stripping binaries
# - Removing python bytecode
# - Compressing the environment into a squashfs image

# Minimal installation, takes ~1'
env_name='jupyter_notebook'
python_packages='notebook'
conda_channels='--channel conda-forge'

# Installation with pytorch and Cuda matching GPU driver in cluster:
# Takes more than 5'
#python_packages='notebook matplotlib scipy sqlite pytorch torchvision pytorch-cuda=11.8'
#conda_channels='--channel conda-forge --channel pytorch --channel nvidia'

micromamba_installer_url='https://micro.mamba.pm/api/micromamba/linux-64/latest'
scratch="/scratch/${USER}"
MAMBA_ROOT_PREFIX="${scratch}/${env_name}"
CONDA_PKGS_DIRS="${scratch}/${env_name}_pkgs"
PYTHONPYCACHEPREFIX="${scratch}/${env_name}_pycache"

# Generate a line of the current terminal window's width
line=$(printf '%*s\n' "${COLUMNS:-$(tput cols)}" '' | tr ' ' '-')

# Display underlined title to improve readability of script output
function title() {
    echo
    echo "$@"
    echo "${line}"
}

mkdir -v -p "${MAMBA_ROOT_PREFIX}" &&
    cd "${MAMBA_ROOT_PREFIX}" &&
    title 'Downloading latest Micromamba (static linked binary)' &&
    wget --output-document=- "${micromamba_installer_url}" |
    tar -xjv bin/micromamba &&
    # Set base path for Micromamba
    export MAMBA_ROOT_PREFIX CONDA_PKGS_DIRS PYTHONPYCACHEPREFIX &&
    # Initialize Micromamba
    eval "$(${MAMBA_ROOT_PREFIX}/bin/micromamba shell hook --shell=bash)" &&
    title "Creating environment '${env_name}'" &&
    micromamba create --yes --name ${env_name} ${python_packages} ${conda_channels} &&
    title 'Cleaning up Micromamba installation' &&
    micromamba clean --all --yes &&
    # Optional step
    title 'Deduplicating files' &&
    rdfind -makesymlinks true -makeresultsfile false . &&
    # Optional step
    title 'Converting absolute symlinks to relative symlinks' &&
    symlinks -rc . &&
    # Optional step. May break a software, use with care.
    title 'Stripping binaries' &&
    find . -xdev -type f -print0 |
    xargs --null --no-run-if-empty file --no-pad |
        grep -E '^.*: ELF.*x86-64.*not stripped.*$' |
        cut -d ':' -f 1 |
        xargs --no-run-if-empty strip --verbose --strip-all --discard-all &&
    # Optional step
    title 'Deleting bytecode files (*pyc)' &&
    find . -xdev -name '*.pyc' -print0 |
    xargs --null --no-run-if-empty rm --one-file-system -v &&
    find . -type d -empty -name '__pycache__' -print -delete &&
    # Optional step: Speed up start of jupyter server
    cat <<EOF >"${MAMBA_ROOT_PREFIX}/envs/${env_name}/etc/jupyter/jupyter_server_config.d/nochecks.json" &&
{
  "ServerApp": {
    "tornado_settings": {
      "page_config_data": {
        "buildCheck": false,
        "buildAvailable": false
      }
    }
  }
}
EOF
    # Create start wrapper: This is specific for a jupyter notebook and Python code with the byte cache code placed on a writable storage
    cat <<EOF >"${MAMBA_ROOT_PREFIX}"/start.sh &&
#!/bin/bash
env_name="${env_name}"
scratch="/itet-stor/\${USER}"
MAMBA_ROOT_PREFIX="\${scratch}/\${env_name}"
PYTHONPYCACHEPREFIX="\${scratch}/\${env_name}_pycache"
export MAMBA_ROOT_PREFIX PYTHONPYCACHEPREFIX
eval "\$(\${MAMBA_ROOT_PREFIX}/bin/micromamba shell hook --shell=bash)" &&
    micromamba run -n jupyter_notebook jupyter notebook --no-browser --port 5998 --ip "\$(hostname -f)"
EOF
    title 'Fixing permissions' &&
    chmod 755 "${MAMBA_ROOT_PREFIX}"/start.sh &&
    chmod --recursive --changes go-w,go+r "${MAMBA_ROOT_PREFIX}" &&
    find "${MAMBA_ROOT_PREFIX}" -xdev -perm /u+x -print0 |
    xargs --null --no-run-if-empty chmod --changes go+x &&
    title 'Creating squashfs image' &&
    mksquashfs "${MAMBA_ROOT_PREFIX}" "${MAMBA_ROOT_PREFIX}".sqsh -no-xattrs -comp zstd &&
    # Show how to start the wrapper
    title 'Start the environment with the following command' &&
    echo "squashfs-mount ${MAMBA_ROOT_PREFIX}.sqsh:${MAMBA_ROOT_PREFIX} -- ${MAMBA_ROOT_PREFIX}/start.sh"

}}}

Any self-contained software installed in `/scratch/$USER/<software>` can be compressed with `mksquashfs` and used with `squashfs-mount` as in the example above.


== Available storage systems ==


=== Local node scratch ===

Primarily use the local `/scratch` of a compute node. This storage offers lowest access latency, but space is limited and can differ per node. To be fair to other users it's '''important to clean up after use'''.


==== Available space and harddisk type ====

These are listed in the '''Hardware''' tables for our compute clusters:
 * [[Services/SLURM#Hardware|Arton nodes]] in D-ITET cluster
 * [[Services/SLURM-tik#Hardware|TIK nodes]] in D-ITET cluster
 * [[Services/SLURM-Biwi#Hardware|CVL/BMIC nodes]] in the CVL cluster
 * [[Services/SLURM-Snowflake#Hardware|Snowflake nodes]] in the D-ITET course cluster


==== scratch cleanup ====

 1. `scratch_clean` is active on local `/scratch` of all nodes, meaning older data will be deleted automatically if space is needed. For details see the man page `man scratch_clean`.<<BR>>This is a safety-net which does automatic cleanup, where you have no control over which files are deleted.

 1. Always create a personal directory on a local scratch and '''clean it up after use'''! This way you're in control of deletion and `scratch_clean` will not have to clean up after you.<<BR>>Personal automatic cleanup can be achieved by adapting the following bash script snippet and adding it to your [[Services/SLURM?highlight=%28slurm%29#sbatch_.2BIZI_Submitting_a_job|job submit script]]: {{{#!highlight bash numbers=disable
my_local_scratch_dir="/scratch/${USER}"

# List contents of my_local_scratch_dir to trigger automounting
if ! ls "${my_local_scratch_dir}" 1>/dev/null 2>/dev/null; then
    if ! mkdir --parents --mode=700 "${my_local_scratch_dir}"; then
        echo 'Failed to create my_local_scratch_dir' 1>&2
        exit 1
    fi
fi

# Set a trap to remove my_local_scratch_dir when the job script ends
trap "exit 1" HUP INT TERM
trap 'rm -rf "${my_local_scratch_dir}"' EXIT

# Syncronize a directory containing large files which are not in use by any other process:
rsync -av --inplace <source directory> "${my_local_scratch_dir}"

# Optional: Change the current directory to my_local_scratch_dir, exit if changing didn't succeed.
cd "${my_local_scratch_dir}" || exit 1
}}}

=== Common node scratch ===

Local `/scratch` of nodes is available among nodes at `/scratch_net/node_A` as an automount (on demand). It is accessible exclusively on compute nodes from compute jobs. A use case for this kind of storage is running several compute jobs on different nodes using the same data.

 * Accessing data stored on `/scratch` on one node `A` from other nodes `B, C, D, ...` will '''impact I/O latency for all jobs running on node `A`!'''
 * You have to ensure writing data from nodes `B, C, D, ...` concurrently to `/scratch` on node `A` does not overwrite data already in use
 * `scratch_clean` is active (see above)!
 * Automatic cleanup per job as shown above has to be replaced by a final cleanup in the last job accessing the data


=== Public storage ===

Public storage is accessible widely: On personal workstations, file servers and compute nodes. It is used in the daily work by all D-ITET members.<<BR>>
This storage allows direct access to data from compute jobs without the need to transfer it to local `/scratch`. Latency is higher because of wide use and network bandwidth.<<BR>>
While this may look like a convenient storage to use for compute jobs, '''using public storage mandates strict use of the guidelines here to prevent blocking other users'''!

There are different types of public storage available at D-ITET. Make sure you understand what is available to you and which one to use for what purpose. Details about public storage available at D-ITET is summarized in the [[Services/StorageOverview|Storage overview]].

Your supervisor or institutes/groups administrative/technical contact will tell you:
 * which storage is available to you from your institute/group
 * which storage to use for intermediate, generated data
 * which storage to use to store your final results
 
⚠ For storage without automated backup: '''Make sure to backup stored data yourself!'''<<BR>>
⚠ Better, '''don't store data worthy of a backup on a system without automated backup!'''


=== Transferring data ===

Transfer of a large file between any storage accessible within the D-ITET structure is most efficient with the following `rsync` commands: {{{#!highlight bash numbers=disable
# Minimal output
rsync -a --inplace /path/to/large/origin/file /path/to/copy/of/origin/file
# Add on-the-fly compression if your file is uncompressed
rsync -a --inplace /path/to/large/origin/file /path/to/copy/of/origin/file --compress
# Add verbose output and a progress indicator
rsync -a --inplace /path/to/large/origin/file /path/to/copy/of/origin/file --verbose --progress
}}}
In this example there is a significant reduction in use of resources (bandwidth, cpu, memory, time) if a previous version of the target file is already in place, as only changed blocks will be transferred.

A concrete example syncing the file `dataset.parquet` in the project folder `project_one` to a (existing) directory with your username on the local `/scratch` of your (ISG managed) workstation: {{{#!highlight bash numbers=disable
rsync -a --inplace /itet-stor/$USER/project_one/dataset.parquet /scratch/$USER/ -v --progress }}}


== Tuning Weights & Biases (wandb) ==

If you use [[https://wandb.ai/|Weights & Biases (wandb)]], be aware it can [[https://docs.wandb.ai/guides/technical-faq/metrics-and-performance#will-wandb-slow-down-my-training|create intense I/O]] on the storage it logs its metrics.

Quote: ''It is possible to log a huge amount of data quickly, and if you do that you might create disk I/O issues.''

In a szenario where many HPC jobs run with `wandb` using the same storage system for job and `wandb` data, this can result in a slowdown of any I/O operation for all job submitters. To prevent this, setup `wandb` as follows:


=== Use a fast local scratch disk for main and cache directory ===

Set environment variables to relocate main and cache directories and create these directories in your (`bash`) job script: {{{#!highlight bash numbers=disable
WANDB_DIR="/scratch/${USER}/wandb_dir"
WANDB_CACHE_DIR="${WANDB_DIR}/.cache"
export WANDB_DIR WANDB_CACHE_DIR
mkdir -vp "${WANDB_CACHE_DIR}"
}}}
See [[https://docs.wandb.ai/guides/track/environment-variables|Environment Variables]] for details.

If you want to keep this data, remove the cache, copy the main directory into a compressed tar archive away from the local `/scratch` at the end of a job to a backuped location like a project directory, then '''delete it''' from the local `/scratch` disk: {{{#!highlight bash numbers=disable
rm -r "${WANDB_CACHE_DIR}" &&
tar -czf "/itet-stor/${USER}/<your_project_directory>/wandb_${SLURM_JOB_ID}.tar.gz" "${WANDB_DIR}" &&
rm -r "${WANDB_DIR}"
}}}
To automate removal, setting a `trap` as in the example under [[#Local_scratch|Local scratch]] makes sense here as well.


=== Run wandb offline ===

Consider [[https://docs.wandb.ai/guides/technical-faq/setup#can-i-run-wandb-offline|running wandb offline]].<<BR>>
If necessary, sync metrics at the end of your job as explained in the link above.


=== Tune metrics collection ===

Consider [[https://docs.wandb.ai/guides/track/limits|tuning your metrics collection parameters for faster logging]].


== Related information ==

 * [[https://scicomp.ethz.ch/wiki/Best_Practices|Euler cluster best practices]]
 * [[https://scicomp.ethz.ch/wiki/Getting_started_with_clusters#Choosing_the_optimal_storage_system|Euler: choosing optimal storage]]
 * [[https://readme.phys.ethz.ch/storage/general_advice/|D-PHYS storage advice]]
 * [[https://www.researchgate.net/publication/338599610_Understanding_Data_Motion_in_the_Modern_HPC_Data_Center|Understanding Data Motion in the Modern HPC Data Center]]

HPC Storage and Input/Output (I/O) best practices and guidelines

D-ITET ISG offers storage systems public to all department members as well as systems private to institute/groups. While access to data on public systems can be restricted as well as shared, the load placed on the underlying hardware cannot be restricted and is always shared. This implies on the systems currently employed at D-ITET, fair I/O cannot be guaranteed by technical means.

Fair I/O can only be maintained by adhering to the following guidelines.

The storage system is the main bottleneck for compute jobs on HPC systems compared to CPU/GPU and memory resources. Compute jobs - if set up improperly - can stall a storage system. The goal of this article is to explain how to:

  • Maximize a job's I/O performance
  • Keep I/O from compute jobs low on storage systems

Prepare your data and code

The worst imaginable case of using data on a storage system is reading/writing many small files and their metadata (creation date, modification date, size) concurrently and in parallel.
The best case is using only few large files as containers for data and code which provide the same features as storing files directly on system and additional optimizations to speed up access to their content.
Sizewise, large files may be in the terabyte range.

Data

Make use of a I/O library designed to parallelize, aggregate and efficiently manage I/O operations (descending order of relevance):

If your job generates a continuous stream of uncompressed output, consider piping it through a compressor before writing it to a file. We recommend using gzip with a low compression setting to keep CPU usage low:

my_program | gzip --fast --rsyncable my_output.gz

There are various compressors available on Linux systems, please investigate comparisons and use cases yourself.

Code

Code could be just a single statically compiled executable file or a conda environment with thousands of small files.

Ready-made software

If you're looking for a specific software, check if it is available as an AppImage. This is already a compressed image (read-only archive) of a directory structure containing software and all its dependencies.

Custom-built software with unavailable system dependencies

If you have to build software yourself which requires root permission to install dependencies and modify the underlying system, the easiest solution is to deploy it in an Apptainer container.
For use of apptainer on D-ITET infrastructure, see:

Apptainer containers created on private PCs may be transferred to the D-ITET infrastructure and used there as well.

Self-contained custom-built software

This is anything you can install with your permissions in a directory on the local /scratch of your ISG managed workstation, for example a conda environment.
The following script is an extensive example to create a portable conda environment to run a jupyter notebook with some optional optimizations.
Read the comments to decide which parts of the script match your use case and adapt them to your needs:

#!/bin/bash

# - Use Micromamba to set up a python environment using $conda_channels with $python_packages called $env_name
# - Optionally reduce space used by the environment by:
#   - Deduplicating files
#   - Stripping binaries
#   - Removing python bytecode
# - Compressing the environment into a squashfs image

# Minimal installation, takes ~1'
env_name='jupyter_notebook'
python_packages='notebook'
conda_channels='--channel conda-forge'

# Installation with pytorch and Cuda matching GPU driver in cluster:
# Takes more than 5'
#python_packages='notebook matplotlib scipy sqlite pytorch torchvision pytorch-cuda=11.8'
#conda_channels='--channel conda-forge --channel pytorch --channel nvidia'

micromamba_installer_url='https://micro.mamba.pm/api/micromamba/linux-64/latest'
scratch="/scratch/${USER}"
MAMBA_ROOT_PREFIX="${scratch}/${env_name}"
CONDA_PKGS_DIRS="${scratch}/${env_name}_pkgs"
PYTHONPYCACHEPREFIX="${scratch}/${env_name}_pycache"

# Generate a line of the current terminal window's width
line=$(printf '%*s\n' "${COLUMNS:-$(tput cols)}" '' | tr ' ' '-')

# Display underlined title to improve readability of script output
function title() {
    echo
    echo "$@"
    echo "${line}"
}

mkdir -v -p "${MAMBA_ROOT_PREFIX}" &&
    cd "${MAMBA_ROOT_PREFIX}" &&
    title 'Downloading latest Micromamba (static linked binary)' &&
    wget --output-document=- "${micromamba_installer_url}" |
    tar -xjv bin/micromamba &&
    # Set base path for Micromamba
    export MAMBA_ROOT_PREFIX CONDA_PKGS_DIRS PYTHONPYCACHEPREFIX &&
    # Initialize Micromamba
    eval "$(${MAMBA_ROOT_PREFIX}/bin/micromamba shell hook --shell=bash)" &&
    title "Creating environment '${env_name}'" &&
    micromamba create --yes --name ${env_name} ${python_packages} ${conda_channels} &&
    title 'Cleaning up Micromamba installation' &&
    micromamba clean --all --yes &&
    # Optional step
    title 'Deduplicating files' &&
    rdfind -makesymlinks true -makeresultsfile false . &&
    # Optional step
    title 'Converting absolute symlinks to relative symlinks' &&
    symlinks -rc . &&
    # Optional step. May break a software, use with care.
    title 'Stripping binaries' &&
    find . -xdev -type f -print0 |
    xargs --null --no-run-if-empty file --no-pad |
        grep -E '^.*: ELF.*x86-64.*not stripped.*$' |
        cut -d ':' -f 1 |
        xargs --no-run-if-empty strip --verbose --strip-all --discard-all &&
    # Optional step
    title 'Deleting bytecode files (*pyc)' &&
    find . -xdev -name '*.pyc' -print0 |
    xargs --null --no-run-if-empty rm --one-file-system -v &&
    find . -type d -empty -name '__pycache__' -print -delete &&
    # Optional step: Speed up start of jupyter server
    cat <<EOF >"${MAMBA_ROOT_PREFIX}/envs/${env_name}/etc/jupyter/jupyter_server_config.d/nochecks.json" &&
{
  "ServerApp": {
    "tornado_settings": {
      "page_config_data": {
        "buildCheck": false,
        "buildAvailable": false
      }
    }
  }
}
EOF
    # Create start wrapper: This is specific for a jupyter notebook and Python code with the byte cache code placed on a writable storage
    cat <<EOF >"${MAMBA_ROOT_PREFIX}"/start.sh &&
#!/bin/bash
env_name="${env_name}"
scratch="/itet-stor/\${USER}"
MAMBA_ROOT_PREFIX="\${scratch}/\${env_name}"
PYTHONPYCACHEPREFIX="\${scratch}/\${env_name}_pycache"
export MAMBA_ROOT_PREFIX PYTHONPYCACHEPREFIX
eval "\$(\${MAMBA_ROOT_PREFIX}/bin/micromamba shell hook --shell=bash)" &&
    micromamba run -n jupyter_notebook jupyter notebook --no-browser --port 5998 --ip "\$(hostname -f)"
EOF
    title 'Fixing permissions' &&
    chmod 755 "${MAMBA_ROOT_PREFIX}"/start.sh &&
    chmod --recursive --changes go-w,go+r "${MAMBA_ROOT_PREFIX}" &&
    find "${MAMBA_ROOT_PREFIX}" -xdev -perm /u+x -print0 |
    xargs --null --no-run-if-empty chmod --changes go+x &&
    title 'Creating squashfs image' &&
    mksquashfs "${MAMBA_ROOT_PREFIX}" "${MAMBA_ROOT_PREFIX}".sqsh -no-xattrs -comp zstd &&
    # Show how to start the wrapper
    title 'Start the environment with the following command' &&
    echo "squashfs-mount ${MAMBA_ROOT_PREFIX}.sqsh:${MAMBA_ROOT_PREFIX} -- ${MAMBA_ROOT_PREFIX}/start.sh"

Any self-contained software installed in /scratch/$USER/<software> can be compressed with mksquashfs and used with squashfs-mount as in the example above.

Available storage systems

Local node scratch

Primarily use the local /scratch of a compute node. This storage offers lowest access latency, but space is limited and can differ per node. To be fair to other users it's important to clean up after use.

Available space and harddisk type

These are listed in the Hardware tables for our compute clusters:

scratch cleanup

  1. scratch_clean is active on local /scratch of all nodes, meaning older data will be deleted automatically if space is needed. For details see the man page man scratch_clean.
    This is a safety-net which does automatic cleanup, where you have no control over which files are deleted.

  2. Always create a personal directory on a local scratch and clean it up after use! This way you're in control of deletion and scratch_clean will not have to clean up after you.
    Personal automatic cleanup can be achieved by adapting the following bash script snippet and adding it to your job submit script:

    my_local_scratch_dir="/scratch/${USER}"
    
    # List contents of my_local_scratch_dir to trigger automounting
    if ! ls "${my_local_scratch_dir}" 1>/dev/null 2>/dev/null; then
        if ! mkdir --parents --mode=700 "${my_local_scratch_dir}"; then
            echo 'Failed to create my_local_scratch_dir' 1>&2
            exit 1
        fi
    fi
    
    # Set a trap to remove my_local_scratch_dir when the job script ends
    trap "exit 1" HUP INT TERM
    trap 'rm -rf "${my_local_scratch_dir}"' EXIT
    
    # Syncronize a directory containing large files which are not in use by any other process:
    rsync -av --inplace <source directory> "${my_local_scratch_dir}"
    
    # Optional: Change the current directory to my_local_scratch_dir, exit if changing didn't succeed.
    cd "${my_local_scratch_dir}" || exit 1
    

Common node scratch

Local /scratch of nodes is available among nodes at /scratch_net/node_A as an automount (on demand). It is accessible exclusively on compute nodes from compute jobs. A use case for this kind of storage is running several compute jobs on different nodes using the same data.

  • Accessing data stored on /scratch on one node A from other nodes B, C, D, ... will impact I/O latency for all jobs running on node A!

  • You have to ensure writing data from nodes B, C, D, ... concurrently to /scratch on node A does not overwrite data already in use

  • scratch_clean is active (see above)!

  • Automatic cleanup per job as shown above has to be replaced by a final cleanup in the last job accessing the data

Public storage

Public storage is accessible widely: On personal workstations, file servers and compute nodes. It is used in the daily work by all D-ITET members.
This storage allows direct access to data from compute jobs without the need to transfer it to local /scratch. Latency is higher because of wide use and network bandwidth.
While this may look like a convenient storage to use for compute jobs, using public storage mandates strict use of the guidelines here to prevent blocking other users!

There are different types of public storage available at D-ITET. Make sure you understand what is available to you and which one to use for what purpose. Details about public storage available at D-ITET is summarized in the Storage overview.

Your supervisor or institutes/groups administrative/technical contact will tell you:

  • which storage is available to you from your institute/group
  • which storage to use for intermediate, generated data
  • which storage to use to store your final results

⚠ For storage without automated backup: Make sure to backup stored data yourself!
⚠ Better, don't store data worthy of a backup on a system without automated backup!

Transferring data

Transfer of a large file between any storage accessible within the D-ITET structure is most efficient with the following rsync commands:

# Minimal output 
rsync -a --inplace /path/to/large/origin/file /path/to/copy/of/origin/file
# Add on-the-fly compression if your file is uncompressed
rsync -a --inplace /path/to/large/origin/file /path/to/copy/of/origin/file --compress
# Add verbose output and a progress indicator 
rsync -a --inplace /path/to/large/origin/file /path/to/copy/of/origin/file --verbose --progress

In this example there is a significant reduction in use of resources (bandwidth, cpu, memory, time) if a previous version of the target file is already in place, as only changed blocks will be transferred.

A concrete example syncing the file dataset.parquet in the project folder project_one to a (existing) directory with your username on the local /scratch of your (ISG managed) workstation:

rsync -a --inplace /itet-stor/$USER/project_one/dataset.parquet /scratch/$USER/ -v --progress 

Tuning Weights & Biases (wandb)

If you use Weights & Biases (wandb), be aware it can create intense I/O on the storage it logs its metrics.

Quote: It is possible to log a huge amount of data quickly, and if you do that you might create disk I/O issues.

In a szenario where many HPC jobs run with wandb using the same storage system for job and wandb data, this can result in a slowdown of any I/O operation for all job submitters. To prevent this, setup wandb as follows:

Use a fast local scratch disk for main and cache directory

Set environment variables to relocate main and cache directories and create these directories in your (bash) job script:

WANDB_DIR="/scratch/${USER}/wandb_dir"
WANDB_CACHE_DIR="${WANDB_DIR}/.cache"
export WANDB_DIR WANDB_CACHE_DIR
mkdir -vp "${WANDB_CACHE_DIR}"

See Environment Variables for details.

If you want to keep this data, remove the cache, copy the main directory into a compressed tar archive away from the local /scratch at the end of a job to a backuped location like a project directory, then delete it from the local /scratch disk:

rm -r "${WANDB_CACHE_DIR}" &&
tar -czf "/itet-stor/${USER}/<your_project_directory>/wandb_${SLURM_JOB_ID}.tar.gz" "${WANDB_DIR}" &&
rm -r "${WANDB_DIR}"

To automate removal, setting a trap as in the example under Local scratch makes sense here as well.

Run wandb offline

Consider running wandb offline.
If necessary, sync metrics at the end of your job as explained in the link above.

Tune metrics collection

Consider tuning your metrics collection parameters for faster logging.

Services/HPCStorageIOBestPracticeGuidelines (last edited 2025-03-31 07:34:16 by stroth)