HPC Storage and Input/Output (I/O) best practices and guidelines

D-ITET ISG offers storage systems public to all department members as well as systems private to institute/groups. While access to data on public systems can be restricted as well as shared, the load placed on the underlying hardware cannot be restricted and is always shared. This implies on the systems currently employed at D-ITET, fair I/O cannot be guaranteed by technical means.

Fair I/O can only be maintained by adhering to the following guidelines.

The storage system is the main bottleneck for compute jobs on HPC systems compared to CPU/GPU and memory resources. Compute jobs - if set up improperly - can stall a storage system. The goal of this article is to explain how to:

Prepare your data and code

The worst imaginable case of using data on a storage system is reading/writing many small files and their metadata (creation date, modification date, size) concurrently and in parallel.
The best case is using only few large files as containers for data and code which provide the same features as storing files directly on system and additional optimizations to speed up access to their content.
Sizewise, large files may be in the terabyte range.

Data

Make use of a I/O library designed to parallelize, aggregate and efficiently manage I/O operations (descending order of relevance):

If your job generates a continuous stream of uncompressed output, consider piping it through a compressor before writing it to a file. We recommend using gzip with a low compression setting to keep CPU usage low:

my_program | gzip --fast --rsyncable my_output.gz

There are various compressors available on Linux systems, please investigate comparisons and use cases yourself.

Code

Code could be just a single statically compiled executable file or a conda environment with thousands of small files.

Ready-made software

If you're looking for a specific software, check if it is available as an AppImage. This is already a compressed image (read-only archive) of a directory structure containing software and all its dependencies.

Custom-built software with unavailable system dependencies

If you have to build software yourself which requires root permission to install dependencies and modify the underlying system, the easiest solution is to deploy it in an Apptainer container.
For use of apptainer on D-ITET infrastructure, see:

Apptainer containers created on private PCs may be transferred to the D-ITET infrastructure and used there as well.

Self-contained custom-built software

This is anything you can install with your permissions in a directory on the local /scratch of your ISG managed workstation, for example a conda environment.
The following script is an extensive example to create a portable conda environment to run a jupyter notebook with some optional optimizations.
Read the comments to decide which parts of the script match your use case and adapt them to your needs:

#!/bin/bash

# - Use Micromamba to set up a python environment using $conda_channels with $python_packages called $env_name
# - Optionally reduce space used by the environment by:
#   - Deduplicating files
#   - Stripping binaries
#   - Removing python bytecode
# - Compressing the environment into a squashfs image

# Minimal installation, takes ~1'
env_name='jupyter_notebook'
python_packages='notebook'
conda_channels='--channel conda-forge'

# Installation with pytorch and Cuda matching GPU driver in cluster:
# Takes more than 5'
#python_packages='notebook matplotlib scipy sqlite pytorch torchvision pytorch-cuda=11.8'
#conda_channels='--channel conda-forge --channel pytorch --channel nvidia'

micromamba_installer_url='https://micro.mamba.pm/api/micromamba/linux-64/latest'
scratch="/scratch/${USER}"
MAMBA_ROOT_PREFIX="${scratch}/${env_name}"
CONDA_PKGS_DIRS="${scratch}/${env_name}_pkgs"
PYTHONPYCACHEPREFIX="${scratch}/${env_name}_pycache"

# Generate a line of the current terminal window's width
line=$(printf '%*s\n' "${COLUMNS:-$(tput cols)}" '' | tr ' ' '-')

# Display underlined title to improve readability of script output
function title() {
    echo
    echo "$@"
    echo "${line}"
}

mkdir -v -p "${MAMBA_ROOT_PREFIX}" &&
    cd "${MAMBA_ROOT_PREFIX}" &&
    title 'Downloading latest Micromamba (static linked binary)' &&
    wget --output-document=- "${micromamba_installer_url}" |
    tar -xjv bin/micromamba &&
    # Set base path for Micromamba
    export MAMBA_ROOT_PREFIX CONDA_PKGS_DIRS PYTHONPYCACHEPREFIX &&
    # Initialize Micromamba
    eval "$(${MAMBA_ROOT_PREFIX}/bin/micromamba shell hook --shell=bash)" &&
    title "Creating environment '${env_name}'" &&
    micromamba create --yes --name ${env_name} ${python_packages} ${conda_channels} &&
    title 'Cleaning up Micromamba installation' &&
    micromamba clean --all --yes &&
    # Optional step
    title 'Deduplicating files' &&
    rdfind -makesymlinks true -makeresultsfile false . &&
    # Optional step
    title 'Converting absolute symlinks to relative symlinks' &&
    symlinks -rc . &&
    # Optional step. May break a software, use with care.
    title 'Stripping binaries' &&
    find . -xdev -type f -print0 |
    xargs --null --no-run-if-empty file --no-pad |
        grep -E '^.*: ELF.*x86-64.*not stripped.*$' |
        cut -d ':' -f 1 |
        xargs --no-run-if-empty strip --verbose --strip-all --discard-all &&
    # Optional step
    title 'Deleting bytecode files (*pyc)' &&
    find . -xdev -name '*.pyc' -print0 |
    xargs --null --no-run-if-empty rm --one-file-system -v &&
    find . -type d -empty -name '__pycache__' -print -delete &&
    # Optional step: Speed up start of jupyter server
    cat <<EOF >"${MAMBA_ROOT_PREFIX}/envs/${env_name}/etc/jupyter/jupyter_server_config.d/nochecks.json" &&
{
  "ServerApp": {
    "tornado_settings": {
      "page_config_data": {
        "buildCheck": false,
        "buildAvailable": false
      }
    }
  }
}
EOF
    # Create start wrapper: This is specific for a jupyter notebook and Python code with the byte cache code placed on a writable storage
    cat <<EOF >"${MAMBA_ROOT_PREFIX}"/start.sh &&
#!/bin/bash
env_name="${env_name}"
scratch="/scratch/\${USER}"
MAMBA_ROOT_PREFIX="\${scratch}/\${env_name}"
PYTHONPYCACHEPREFIX="\${scratch}/\${env_name}_pycache"
export MAMBA_ROOT_PREFIX PYTHONPYCACHEPREFIX
eval "\$(\${MAMBA_ROOT_PREFIX}/bin/micromamba shell hook --shell=bash)" &&
    micromamba run -n jupyter_notebook jupyter notebook --no-browser --port 5998 --ip "\$(hostname -f)"
EOF
    title 'Fixing permissions' &&
    chmod 755 "${MAMBA_ROOT_PREFIX}"/start.sh &&
    chmod --recursive --changes go-w,go+r "${MAMBA_ROOT_PREFIX}" &&
    find "${MAMBA_ROOT_PREFIX}" -xdev -perm /u+x -print0 |
    xargs --null --no-run-if-empty chmod --changes go+x &&
    title 'Creating squashfs image' &&
    mksquashfs "${MAMBA_ROOT_PREFIX}" "${MAMBA_ROOT_PREFIX}".sqsh -no-xattrs -comp zstd &&
    # Show how to start the wrapper
    title 'Start the environment with the following command' &&
    echo "squashfs-mount ${MAMBA_ROOT_PREFIX}.sqsh:${MAMBA_ROOT_PREFIX} -- ${MAMBA_ROOT_PREFIX}/start.sh"

Any self-contained software installed in /scratch/$USER/<software> can be compressed with mksquashfs and used with squashfs-mount as in the example above.

Available storage systems

Local node scratch

Primarily use the local /scratch of a compute node. This storage offers lowest access latency, but space is limited and can differ per node. To be fair to other users it's important to clean up after use.

Available space and harddisk type

These are listed in the Hardware tables for our compute clusters:

scratch cleanup

  1. scratch_clean is active on local /scratch of all nodes, meaning older data will be deleted automatically if space is needed. For details see the man page man scratch_clean.
    This is a safety-net which does automatic cleanup, where you have no control over which files are deleted.

  2. Always create a personal directory on a local scratch and clean it up after use! This way you're in control of deletion and scratch_clean will not have to clean up after you.
    Personal automatic cleanup can be achieved by adapting the following bash script snippet and adding it to your job submit script:

    my_local_scratch_dir="/scratch/${USER}"
    
    # List contents of my_local_scratch_dir to trigger automounting
    if ! ls "${my_local_scratch_dir}" 1>/dev/null 2>/dev/null; then
        if ! mkdir --parents --mode=700 "${my_local_scratch_dir}"; then
            echo 'Failed to create my_local_scratch_dir' 1>&2
            exit 1
        fi
    fi
    
    # Set a trap to remove my_local_scratch_dir when the job script ends
    trap "exit 1" HUP INT TERM
    trap 'rm -rf "${my_local_scratch_dir}"' EXIT
    
    # Syncronize a directory containing large files which are not in use by any other process:
    rsync -av --inplace <source directory> "${my_local_scratch_dir}"
    
    # Optional: Change the current directory to my_local_scratch_dir, exit if changing didn't succeed.
    cd "${my_local_scratch_dir}" || exit 1
    

Common node scratch

Local /scratch of nodes is available among nodes at /scratch_net/node_A as an automount (on demand). It is accessible exclusively on compute nodes from compute jobs. A use case for this kind of storage is running several compute jobs on different nodes using the same data.

Public storage

Public storage is accessible widely: On personal workstations, file servers and compute nodes. It is used in the daily work by all D-ITET members.
This storage allows direct access to data from compute jobs without the need to transfer it to local /scratch. Latency is higher because of wide use and network bandwidth.
While this may look like a convenient storage to use for compute jobs, using public storage mandates strict use of the guidelines here to prevent blocking other users!

There are different types of public storage available at D-ITET. Make sure you understand what is available to you and which one to use for what purpose. Details about public storage available at D-ITET is summarized in the Storage overview.

Your supervisor or institutes/groups administrative/technical contact will tell you:

⚠ For storage without automated backup: Make sure to backup stored data yourself!
⚠ Better, don't store data worthy of a backup on a system without automated backup!

Don't s!

Avoid slowing down your own work environment by introducing dependencies placed on a public storage. Concrete examples what to avoid are:

Transferring data

Transfer of a large file between any storage accessible within the D-ITET structure is most efficient with the following rsync commands:

# Minimal output
rsync -a --inplace /path/to/large/origin/file /path/to/copy/of/origin/file
# Add on-the-fly compression if your file is uncompressed
rsync -a --inplace /path/to/large/origin/file /path/to/copy/of/origin/file --compress
# Add verbose output and a progress indicator
rsync -a --inplace /path/to/large/origin/file /path/to/copy/of/origin/file --verbose --progress

In this example there is a significant reduction in use of resources (bandwidth, cpu, memory, time) if a previous version of the target file is already in place, as only changed blocks will be transferred.

A concrete example syncing the file dataset.parquet in the project folder project_one to a (existing) directory with your username on the local /scratch of your (ISG managed) workstation:

rsync -a --inplace /itet-stor/$USER/project_one/dataset.parquet /scratch/$USER/ -v --progress 

Tuning Weights & Biases (wandb)

If you use Weights & Biases (wandb), be aware it can create intense I/O on the storage it logs its metrics.

Quote: It is possible to log a huge amount of data quickly, and if you do that you might create disk I/O issues.

In a szenario where many HPC jobs run with wandb using the same storage system for job and wandb data, this can result in a slowdown of any I/O operation for all job submitters. To prevent this, setup wandb as follows:

Use a fast local scratch disk for main and cache directory

Set environment variables to relocate main and cache directories and create these directories in your (bash) job script:

WANDB_DIR="/scratch/${USER}/wandb_dir"
WANDB_CACHE_DIR="${WANDB_DIR}/.cache"
export WANDB_DIR WANDB_CACHE_DIR
mkdir -vp "${WANDB_CACHE_DIR}"

See Environment Variables for details.

If you want to keep this data, remove the cache, copy the main directory into a compressed tar archive away from the local /scratch at the end of a job to a backuped location like a project directory, then delete it from the local /scratch disk:

rm -r "${WANDB_CACHE_DIR}" &&
tar -czf "/itet-stor/${USER}/<your_project_directory>/wandb_${SLURM_JOB_ID}.tar.gz" "${WANDB_DIR}" &&
rm -r "${WANDB_DIR}"

To automate removal, setting a trap as in the example under Local scratch makes sense here as well.

Run wandb offline

Consider running wandb offline.
If necessary, sync metrics at the end of your job as explained in the link above.

Tune metrics collection

Consider tuning your metrics collection parameters for faster logging.

Services/HPCStorageIOBestPracticeGuidelines (last edited 2024-09-26 11:14:55 by stroth)