## page was renamed from Services/StorageUsageGuidelines
#rev 2024-08-12 stroth

<<TableOfContents()>>

= HPC Storage and Input/Output (I/O) best practices and guidelines =

D-ITET ISG offers storage systems public to all department members as well as systems private to institute/groups. While access to data on public systems can be ''restricted'' as well as ''shared'', the load placed on the underlying hardware cannot be ''restricted'' and is always ''shared''. This implies on the systems currently employed at D-ITET, fair I/O cannot be guaranteed by technical means.

'''Fair I/O can only be maintained by adhering to the following guidelines'''.

The storage system is the main bottleneck for compute jobs on HPC systems compared to CPU/GPU and memory resources. Compute jobs - if set up improperly - can stall a storage system. The goal of this article is to explain how to:
 * Maximize a job's I/O performance
 * Keep I/O from compute jobs low on storage systems


== Prepare your data and code ==

The '''worst imaginable case''' of using data on a storage system is '''reading/writing many small files''' and their metadata (creation date, modification date, size) concurrently and in parallel.<<BR>>
The '''best case''' is using only '''few large files as containers''' for data and code which provide the same features as storing files directly on system and additional optimizations to speed up access to their content.<<BR>>
Sizewise, large files may be in the terabyte range.


=== Data ===

Make use of a I/O library designed to parallelize, aggregate and efficiently manage I/O operations (descending order of relevance):
 * [[https://petastorm.readthedocs.io/en/latest/|Use Parquet storage from Python]],
 . [[https://parquet.apache.org/|Apache Parquet format]],
 . [[https://saturncloud.io/blog/how-to-write-data-to-parquet-with-python/|How to write data to parquet with python]],
 . [[https://www.blog.pythonlibrary.org/2024/05/06/how-to-read-and-write-parquet-files-with-python/|How to read and write Parquet files with Python]]
 * [[https://www.unidata.ucar.edu/software/netcdf/|NetCDF4]],
 . [[https://unidata.github.io/netcdf4-python/|NetCDF4 Python interface]]
 * [[https://www.hdfgroup.org/|HDF5]],
 . [[https://docs.h5py.org/en/stable/|HDF5 Python interface]],
 . [[https://realpython.com/storing-images-in-python/|Example of storing/acessing lots of images in Python]]

If your job generates a continuous stream of uncompressed output, consider piping it through a compressor before writing it to a file. We recommend using `gzip` with a low compression setting to keep CPU usage low: {{{#!highlight bash numbers=disable
my_program | gzip --fast --rsyncable my_output.gz
}}}
There are various compressors available on Linux systems, please investigate comparisons and use cases yourself.

==== Lazy variant for data aggregation ====

Simply pack up a directory of data with many files into a squashfs image. The image can be transferred to a different storage and mounted to an environment where it is accessible for chosen commands. Optionally use strong compression if the data isn't in already in a compressed format:

Compress a directory of uncompressed data with the strong `zstd` compressor: {{{#!highlight bash numbers=disable
mksquashfs <directory> <directory>.sqsh -no-xattrs -comp zstd
}}}

Pack a directory of compressed data: {{{#!highlight bash numbers=disable
mksquashfs <directory> <directory>.sqsh -no-xattrs -no-compression
}}}
The above is the best solution for compressed data, as `mqsuashfs` offers no option to skip compression completely.

To access the data: Create an environment where the squashfs image is mounted and available under the set `mount path` and the command `bash` in the environment: {{{#!highlight bash numbers=disable
squashfs-mount <directory>.sqsh:<mount path> -- bash
}}}


=== Code ===

Code could be just a single statically compiled executable file or a [[Programming/Languages/Conda|conda]] environment with thousands of small files.


==== Ready-made software ====

If you're looking for a specific software, check if it is available as an [[https://appimage.org/|AppImage]]. This is already a compressed image (read-only archive) of a directory structure containing software and all its dependencies.


==== Custom-built software with unavailable system dependencies ====

If you have to build software yourself which requires root permission to install dependencies and modify the underlying system, the easiest solution is to deploy it in an [[https://apptainer.org/|Apptainer container]].<<BR>>
For use of `apptainer` on D-ITET infrastructure, see:
 * [[Services/Apptainer|Apptainer]]
 * [[Services/SingularityBuilder|Singularity Builder]]
Apptainer containers created on private PCs may be transferred to the D-ITET infrastructure and used there as well.

==== Self-contained custom-built software ====

This is anything you can install with your permissions in a directory on the local `/scratch` of your ISG managed workstation, for example a `conda` environment.<<BR>>
The following script is an extensive example to create a portable `conda` environment to run a [[https://jupyter.org/|jupyter notebook]] with some optional optimizations.<<BR>>
Read the comments to decide which parts of the script match your use case and adapt them to your needs: {{{#!highlight bash numbers=disable
#!/bin/bash

# - Use Micromamba to set up a python environment using $conda_channels with $python_packages called $env_name
# - Optionally reduce space used by the environment by:
#   - Deduplicating files
#   - Stripping binaries
#   - Removing python bytecode
# - Compressing the environment into a squashfs image

# Minimal installation, takes ~1'
env_name='jupyter_notebook'
python_packages='notebook'
conda_channels='--channel conda-forge'

# Installation with pytorch and Cuda matching GPU driver in cluster:
# Takes more than 5'
#python_packages='notebook matplotlib scipy sqlite pytorch torchvision pytorch-cuda=11.8'
#conda_channels='--channel conda-forge --channel pytorch --channel nvidia'

micromamba_installer_url='https://micro.mamba.pm/api/micromamba/linux-64/latest'
scratch="/scratch/${USER}"
MAMBA_ROOT_PREFIX="${scratch}/${env_name}"
CONDA_PKGS_DIRS="${scratch}/${env_name}_pkgs"
PYTHONPYCACHEPREFIX="${scratch}/${env_name}_pycache"

# Generate a line of the current terminal window's width
line=$(printf '%*s\n' "${COLUMNS:-$(tput cols)}" '' | tr ' ' '-')

# Display underlined title to improve readability of script output
function title() {
    echo
    echo "$@"
    echo "${line}"
}

mkdir -v -p "${MAMBA_ROOT_PREFIX}" &&
    cd "${MAMBA_ROOT_PREFIX}" &&
    title 'Downloading latest Micromamba (static linked binary)' &&
    wget --output-document=- "${micromamba_installer_url}" |
    tar -xjv bin/micromamba &&
    # Set base path for Micromamba
    export MAMBA_ROOT_PREFIX CONDA_PKGS_DIRS PYTHONPYCACHEPREFIX &&
    # Initialize Micromamba
    eval "$(${MAMBA_ROOT_PREFIX}/bin/micromamba shell hook --shell=bash)" &&
    title "Creating environment '${env_name}'" &&
    micromamba create --yes --name ${env_name} ${python_packages} ${conda_channels} &&
    title 'Cleaning up Micromamba installation' &&
    micromamba clean --all --yes &&
    # Optional step
    title 'Deduplicating files' &&
    rdfind -makesymlinks true -makeresultsfile false . &&
    # Optional step
    title 'Converting absolute symlinks to relative symlinks' &&
    symlinks -rc . &&
    # Optional step. May break a software, use with care.
    title 'Stripping binaries' &&
    find . -xdev -type f -print0 |
    xargs --null --no-run-if-empty file --no-pad |
        grep -E '^.*: ELF.*x86-64.*not stripped.*$' |
        cut -d ':' -f 1 |
        xargs --no-run-if-empty strip --verbose --strip-all --discard-all &&
    # Optional step
    title 'Deleting bytecode files (*pyc)' &&
    find . -xdev -name '*.pyc' -print0 |
    xargs --null --no-run-if-empty rm --one-file-system -v &&
    find . -type d -empty -name '__pycache__' -print -delete &&
    # Optional step: Speed up start of jupyter server
    cat <<EOF >"${MAMBA_ROOT_PREFIX}/envs/${env_name}/etc/jupyter/jupyter_server_config.d/nochecks.json" &&
{
  "ServerApp": {
    "tornado_settings": {
      "page_config_data": {
        "buildCheck": false,
        "buildAvailable": false
      }
    }
  }
}
EOF
    # Create start wrapper: This is specific for a jupyter notebook and Python code with the byte cache code placed on a writable storage
    cat <<EOF >"${MAMBA_ROOT_PREFIX}"/start.sh &&
#!/bin/bash
env_name="${env_name}"
scratch="/scratch/\${USER}"
MAMBA_ROOT_PREFIX="\${scratch}/\${env_name}"
PYTHONPYCACHEPREFIX="\${scratch}/\${env_name}_pycache"
export MAMBA_ROOT_PREFIX PYTHONPYCACHEPREFIX
eval "\$(\${MAMBA_ROOT_PREFIX}/bin/micromamba shell hook --shell=bash)" &&
    micromamba run -n jupyter_notebook jupyter notebook --no-browser --port 5998 --ip "\$(hostname -f)"
EOF
    title 'Fixing permissions' &&
    chmod 755 "${MAMBA_ROOT_PREFIX}"/start.sh &&
    chmod --recursive --changes go-w,go+r "${MAMBA_ROOT_PREFIX}" &&
    find "${MAMBA_ROOT_PREFIX}" -xdev -perm /u+x -print0 |
    xargs --null --no-run-if-empty chmod --changes go+x &&
    title 'Creating squashfs image' &&
    mksquashfs "${MAMBA_ROOT_PREFIX}" "${MAMBA_ROOT_PREFIX}".sqsh -no-xattrs -comp zstd &&
    # Show how to start the wrapper
    title 'Start the environment with the following command' &&
    echo "squashfs-mount ${MAMBA_ROOT_PREFIX}.sqsh:${MAMBA_ROOT_PREFIX} -- ${MAMBA_ROOT_PREFIX}/start.sh"

}}}

Any self-contained software installed in `/scratch/$USER/<software>` can be compressed with `mksquashfs` and used with `squashfs-mount` as in the example above.


== Available storage systems ==


=== Local node scratch ===

Primarily use the local `/scratch` of a compute node. This storage offers lowest access latency, but space is limited and can differ per node. To be fair to other users it's '''important to clean up after use'''.


==== Available space and harddisk type ====

These are listed in the '''Hardware''' tables for our compute clusters:
 * [[Services/SLURM#Hardware|Arton nodes]] in D-ITET cluster
 * [[Services/SLURM-tik#Hardware|TIK nodes]] in D-ITET cluster
 * [[Services/SLURM-Biwi#Hardware|CVL/BMIC nodes]] in the CVL cluster
 * [[Services/SLURM-Snowflake#Hardware|Snowflake nodes]] in the D-ITET course cluster


==== scratch cleanup ====

 1. `scratch_clean` is active on local `/scratch` of all nodes, meaning older data will be deleted automatically if space is needed. For details see the man page `man scratch_clean`.<<BR>>This is a safety-net which does automatic cleanup, where you have no control over which files are deleted.

 1. Always create a personal directory on a local scratch and '''clean it up after use'''! This way you're in control of deletion and `scratch_clean` will not have to clean up after you.<<BR>>Personal automatic cleanup can be achieved by adapting the following bash script snippet and adding it to your [[Services/SLURM#sbatch_.2BIZI_Submitting_a_job|job submit script]]: {{{#!highlight bash numbers=disable
my_local_scratch_dir="/scratch/${USER}"

# List contents of my_local_scratch_dir to trigger automounting
if ! ls "${my_local_scratch_dir}" 1>/dev/null 2>/dev/null; then
    if ! mkdir --parents --mode=700 "${my_local_scratch_dir}"; then
        echo 'Failed to create my_local_scratch_dir' 1>&2
        exit 1
    fi
fi

# Set a trap to remove my_local_scratch_dir when the job script ends
trap "exit 1" HUP INT TERM
trap 'rm -rf "${my_local_scratch_dir}"' EXIT

# Syncronize a directory containing large files which are not in use by any other process:
rsync -av --inplace <source directory> "${my_local_scratch_dir}"

# Optional: Change the current directory to my_local_scratch_dir, exit if changing didn't succeed.
cd "${my_local_scratch_dir}" || exit 1
}}}

=== Common node scratch ===

Local `/scratch` of nodes is available among nodes at `/scratch_net/node_A` as an automount (on demand). It is accessible exclusively on compute nodes from compute jobs. A use case for this kind of storage is running several compute jobs on different nodes using the same data.

 * Accessing data stored on `/scratch` on one node `A` from other nodes `B, C, D, ...` will '''impact I/O latency for all jobs running on node `A`!'''
 * You have to ensure writing data from nodes `B, C, D, ...` concurrently to `/scratch` on node `A` does not overwrite data already in use
 * `scratch_clean` is active (see above)!
 * Automatic cleanup per job as shown above has to be replaced by a final cleanup in the last job accessing the data


=== Public storage ===

Public storage is accessible widely: On personal workstations, file servers and compute nodes. It is used in the daily work by all D-ITET members.<<BR>>
This storage allows direct access to data from compute jobs without the need to transfer it to local `/scratch`. Latency is higher because of wide use and network bandwidth.<<BR>>
While this may look like a convenient storage to use for compute jobs, '''using public storage mandates strict use of the guidelines here to prevent blocking other users'''!

There are different types of public storage available at D-ITET. Make sure you understand what is available to you and which one to use for what purpose. Details about public storage available at D-ITET is summarized in the [[Services/StorageOverview|Storage overview]].

Your supervisor or institutes/groups administrative/technical contact will tell you:
 * which storage is available to you from your institute/group
 * which storage to use for intermediate, generated data
 * which storage to use to store your final results

⚠ For storage without automated backup: '''Make sure to backup stored data yourself!'''<<BR>>
⚠ Better, '''don't store data worthy of a backup on a system without automated backup!'''


==== Don't s! ====

Avoid slowing down your own work environment by introducing dependencies placed on a public storage. Concrete examples what to avoid are:

 * Don't replace directories used by the operating system with links to a public storage!<<BR>>Such directories are for example: `~/.cache`, `~/.local`, `~/.config`
 * Don't initialize anything like `conda` environments placed on public storage in your login shell init script!<<BR>>Init scripts are for example: `~/.profile`, `~/.bashrc`, `~/.zhsrc`
 * Don't export environment variables to redirect caches or temporary files to a public storage!<<BR>>Such variables are for example: `TMPDIR`, `XDG_CACHE_HOME`, `PIP_CACHE_DIR`


=== Transferring data ===

Transfer of a large file between any storage accessible within the D-ITET structure is most efficient with the following `rsync` commands: {{{#!highlight bash numbers=disable
# Minimal output
rsync -a --inplace /path/to/large/origin/file /path/to/copy/of/origin/file
# Add on-the-fly compression if your file is uncompressed
rsync -a --inplace /path/to/large/origin/file /path/to/copy/of/origin/file --compress
# Add verbose output and a progress indicator
rsync -a --inplace /path/to/large/origin/file /path/to/copy/of/origin/file --verbose --progress
}}}
In this example there is a significant reduction in use of resources (bandwidth, cpu, memory, time) if a previous version of the target file is already in place, as only changed blocks will be transferred.

A concrete example syncing the file `dataset.parquet` in the project folder `project_one` to a (existing) directory with your username on the local `/scratch` of your (ISG managed) workstation:  {{{#!highlight bash numbers=disable
rsync -a --inplace /itet-stor/$USER/project_one/dataset.parquet /scratch/$USER/ -v --progress }}}


== Tuning Weights & Biases (wandb) ==

If you use [[https://wandb.ai/|Weights & Biases (wandb)]], be aware it can [[https://docs.wandb.ai/support/slow_training/|create intense I/O]] on the storage it logs its metrics, if logging parameters are beyond recommended limits. See [[https://docs.wandb.ai/guides/track/limits/|Experiments limits and performance]] for details.<<BR>>

In a szenario where many HPC jobs run with `wandb` using the same storage system for job and `wandb` data, this can result in a slowdown of any I/O operation for all job submitters. To prevent this, consider setting storage locations of `wandb` to a fast local disk with the relevant [[https://docs.wandb.ai/guides/track/environment-variables|Environment Variables]] (see example below).

=== Use a fast local scratch disk for main and cache directory ===

Set environment variables to relocate main and cache directories and create these directories in your (`bash`) job script: {{{#!highlight bash numbers=disable
WANDB_DIR="/scratch/${USER}/wandb_dir"
WANDB_CACHE_DIR="${WANDB_DIR}/.cache"
export WANDB_DIR WANDB_CACHE_DIR
mkdir -vp "${WANDB_CACHE_DIR}"
}}}
Check [[https://docs.wandb.ai/guides/track/environment-variables|Environment Variables]] for more environment variables to possibly speed up I/O, like `WANDB_ARTIFACT_DIR` and `WANDB_DATA_DIR`.

If you want to keep this data, remove the cache, copy the main directory into a compressed tar archive away from the local `/scratch` at the end of a job to a backuped location like a project directory, then '''delete it''' from the local `/scratch` disk: {{{#!highlight bash numbers=disable
rm -r "${WANDB_CACHE_DIR}" &&
tar -czf "/itet-stor/${USER}/<your_project_directory>/wandb_${SLURM_JOB_ID}.tar.gz" "${WANDB_DIR}" &&
rm -r "${WANDB_DIR}"
}}}
To automate removal, setting a `trap` as in the example under [[#Local_scratch|Local scratch]] makes sense here as well.


=== Run wandb offline ===

Consider [[https://docs.wandb.ai/support/run_wandb_offline/|running wandb offline]].<<BR>>
If necessary, sync metrics at the end of your job as explained in the link above.


== Related information ==

 * [[https://scicomp.ethz.ch/wiki/Best_Practices|Euler cluster best practices]]
 * [[https://scicomp.ethz.ch/wiki/Getting_started_with_clusters#Choosing_the_optimal_storage_system|Euler: choosing optimal storage]]
 * [[https://readme.phys.ethz.ch/storage/general_advice/|D-PHYS storage advice]]
 * [[https://www.researchgate.net/publication/338599610_Understanding_Data_Motion_in_the_Modern_HPC_Data_Center|Understanding Data Motion in the Modern HPC Data Center]]