## page was renamed from Services/StorageUsageGuidelines #rev 2024-08-12 stroth <> = HPC Storage and Input/Output (I/O) best practices and guidelines = D-ITET ISG offers storage systems public to all department members as well as systems private to institute/groups. While access to data on public systems can be ''restricted'' as well as ''shared'', the load placed on the underlying hardware cannot be ''restricted'' and is always ''shared''. This implies on the systems currently employed at D-ITET, fair I/O cannot be guaranteed by technical means. '''Fair I/O can only be maintained by adhering to the following guidelines'''. The storage system is the main bottleneck for compute jobs on HPC systems compared to CPU/GPU and memory resources. Compute jobs - if set up improperly - can stall a storage system. The goal of this article is to explain how to: * Maximize a job's I/O performance * Keep I/O from compute jobs low on storage systems == Prepare your data and code == The '''worst imaginable case''' of using data on a storage system is '''reading/writing many small files''' and their metadata (creation date, modification date, size) concurrently and in parallel.<
> The '''best case''' is using only '''few large files as containers''' for data and code which provide the same features as storing files directly on system and additional optimizations to speed up access to their content.<
> Sizewise, large files may be in the terabyte range. === Data === Make use of a I/O library designed to parallelize, aggregate and efficiently manage I/O operations (descending order of relevance): * [[https://petastorm.readthedocs.io/en/latest/|Use Parquet storage from Python]], . [[https://parquet.apache.org/|Apache Parquet format]], . [[https://saturncloud.io/blog/how-to-write-data-to-parquet-with-python/|How to write data to parquet with python]], . [[https://www.blog.pythonlibrary.org/2024/05/06/how-to-read-and-write-parquet-files-with-python/|How to read and write Parquet files with Python]] * [[https://www.unidata.ucar.edu/software/netcdf/|NetCDF4]], . [[https://unidata.github.io/netcdf4-python/|NetCDF4 Python interface]] * [[https://www.hdfgroup.org/|HDF5]], . [[https://docs.h5py.org/en/stable/|HDF5 Python interface]], . [[https://realpython.com/storing-images-in-python/|Example of storing/acessing lots of images in Python]] If your job generates a continuous stream of uncompressed output, consider piping it through a compressor before writing it to a file. We recommend using `gzip` with a low compression setting to keep CPU usage low: {{{#!highlight bash numbers=disable my_program | gzip --fast --rsyncable my_output.gz }}} There are various compressors available on Linux systems, please investigate comparisons and use cases yourself. === Code === Code could be just a single statically compiled executable file or a [[Programming/Languages/Conda|conda]] environment with thousands of small files. ==== Ready-made software ==== If you're looking for a specific software, check if it is available as an [[https://appimage.org/|AppImage]]. This is already a compressed image (read-only archive) of a directory structure containing software and all its dependencies. ==== Custom-built software with unavailable system dependencies ==== If you have to build software yourself which requires root permission to install dependencies and modify the underlying system, the easiest solution is to deploy it in an [[https://apptainer.org/|Apptainer container]].<
> For use of `apptainer` on D-ITET infrastructure, see: * [[Services/Apptainer|Apptainer]] * [[Services/SingularityBuilder|Singularity Builder]] Apptainer containers created on private PCs may be transferred to the D-ITET infrastructure and used there as well. ==== Self-contained custom-built software ==== This is anything you can install with your permissions in a directory on the local `/scratch` of your ISG managed workstation, for example a `conda` environment.<
> The following script is an extensive example to create a portable `conda` environment to run a [[https://jupyter.org/|jupyter notebook]] with some optional optimizations.<
> Read the comments to decide which parts of the script match your use case and adapt them to your needs: {{{#!highlight bash numbers=disable #!/bin/bash # - Use Micromamba to set up a python environment using $conda_channels with $python_packages called $env_name # - Optionally reduce space used by the environment by: # - Deduplicating files # - Stripping binaries # - Removing python bytecode # - Compressing the environment into a squashfs image # Minimal installation, takes ~1' env_name='jupyter_notebook' python_packages='notebook' conda_channels='--channel conda-forge' # Installation with pytorch and Cuda matching GPU driver in cluster: # Takes more than 5' #python_packages='notebook matplotlib scipy sqlite pytorch torchvision pytorch-cuda=11.8' #conda_channels='--channel conda-forge --channel pytorch --channel nvidia' micromamba_installer_url='https://micro.mamba.pm/api/micromamba/linux-64/latest' scratch="/scratch/${USER}" MAMBA_ROOT_PREFIX="${scratch}/${env_name}" CONDA_PKGS_DIRS="${scratch}/${env_name}_pkgs" PYTHONPYCACHEPREFIX="${scratch}/${env_name}_pycache" # Generate a line of the current terminal window's width line=$(printf '%*s\n' "${COLUMNS:-$(tput cols)}" '' | tr ' ' '-') # Display underlined title to improve readability of script output function title() { echo echo "$@" echo "${line}" } mkdir -v -p "${MAMBA_ROOT_PREFIX}" && cd "${MAMBA_ROOT_PREFIX}" && title 'Downloading latest Micromamba (static linked binary)' && wget --output-document=- "${micromamba_installer_url}" | tar -xjv bin/micromamba && # Set base path for Micromamba export MAMBA_ROOT_PREFIX CONDA_PKGS_DIRS PYTHONPYCACHEPREFIX && # Initialize Micromamba eval "$(${MAMBA_ROOT_PREFIX}/bin/micromamba shell hook --shell=bash)" && title "Creating environment '${env_name}'" && micromamba create --yes --name ${env_name} ${python_packages} ${conda_channels} && title 'Cleaning up Micromamba installation' && micromamba clean --all --yes && # Optional step title 'Deduplicating files' && rdfind -makesymlinks true -makeresultsfile false . && # Optional step title 'Converting absolute symlinks to relative symlinks' && symlinks -rc . && # Optional step. May break a software, use with care. title 'Stripping binaries' && find . -xdev -type f -print0 | xargs --null --no-run-if-empty file --no-pad | grep -E '^.*: ELF.*x86-64.*not stripped.*$' | cut -d ':' -f 1 | xargs --no-run-if-empty strip --verbose --strip-all --discard-all && # Optional step title 'Deleting bytecode files (*pyc)' && find . -xdev -name '*.pyc' -print0 | xargs --null --no-run-if-empty rm --one-file-system -v && find . -type d -empty -name '__pycache__' -print -delete && # Optional step: Speed up start of jupyter server cat <"${MAMBA_ROOT_PREFIX}/envs/${env_name}/etc/jupyter/jupyter_server_config.d/nochecks.json" && { "ServerApp": { "tornado_settings": { "page_config_data": { "buildCheck": false, "buildAvailable": false } } } } EOF # Create start wrapper: This is specific for a jupyter notebook and Python code with the byte cache code placed on a writable storage cat <"${MAMBA_ROOT_PREFIX}"/start.sh && #!/bin/bash env_name="${env_name}" scratch="/scratch/\${USER}" MAMBA_ROOT_PREFIX="\${scratch}/\${env_name}" PYTHONPYCACHEPREFIX="\${scratch}/\${env_name}_pycache" export MAMBA_ROOT_PREFIX PYTHONPYCACHEPREFIX eval "\$(\${MAMBA_ROOT_PREFIX}/bin/micromamba shell hook --shell=bash)" && micromamba run -n jupyter_notebook jupyter notebook --no-browser --port 5998 --ip "\$(hostname -f)" EOF title 'Fixing permissions' && chmod 755 "${MAMBA_ROOT_PREFIX}"/start.sh && chmod --recursive --changes go-w,go+r "${MAMBA_ROOT_PREFIX}" && find "${MAMBA_ROOT_PREFIX}" -xdev -perm /u+x -print0 | xargs --null --no-run-if-empty chmod --changes go+x && title 'Creating squashfs image' && mksquashfs "${MAMBA_ROOT_PREFIX}" "${MAMBA_ROOT_PREFIX}".sqsh -no-xattrs -comp zstd && # Show how to start the wrapper title 'Start the environment with the following command' && echo "squashfs-mount ${MAMBA_ROOT_PREFIX}.sqsh:${MAMBA_ROOT_PREFIX} -- ${MAMBA_ROOT_PREFIX}/start.sh" }}} Any self-contained software installed in `/scratch/$USER/` can be compressed with `mksquashfs` and used with `squashfs-mount` as in the example above. == Available storage systems == === Local node scratch === Primarily use the local `/scratch` of a compute node. This storage offers lowest access latency, but space is limited and can differ per node. To be fair to other users it's '''important to clean up after use'''. ==== Available space and harddisk type ==== These are listed in the '''Hardware''' tables for our compute clusters: * [[Services/SLURM#Hardware|Arton nodes]] in D-ITET cluster * [[Services/SLURM-tik#Hardware|TIK nodes]] in D-ITET cluster * [[Services/SLURM-Biwi#Hardware|CVL/BMIC nodes]] in the CVL cluster * [[Services/SLURM-Snowflake#Hardware|Snowflake nodes]] in the D-ITET course cluster ==== scratch cleanup ==== 1. `scratch_clean` is active on local `/scratch` of all nodes, meaning older data will be deleted automatically if space is needed. For details see the man page `man scratch_clean`.<
>This is a safety-net which does automatic cleanup, where you have no control over which files are deleted. 1. Always create a personal directory on a local scratch and '''clean it up after use'''! This way you're in control of deletion and `scratch_clean` will not have to clean up after you.<
>Personal automatic cleanup can be achieved by adapting the following bash script snippet and adding it to your [[Services/SLURM#sbatch_.2BIZI_Submitting_a_job|job submit script]]: {{{#!highlight bash numbers=disable my_local_scratch_dir="/scratch/${USER}" # List contents of my_local_scratch_dir to trigger automounting if ! ls "${my_local_scratch_dir}" 1>/dev/null 2>/dev/null; then if ! mkdir --parents --mode=700 "${my_local_scratch_dir}"; then echo 'Failed to create my_local_scratch_dir' 1>&2 exit 1 fi fi # Set a trap to remove my_local_scratch_dir when the job script ends trap "exit 1" HUP INT TERM trap 'rm -rf "${my_local_scratch_dir}"' EXIT # Syncronize a directory containing large files which are not in use by any other process: rsync -av --inplace "${my_local_scratch_dir}" # Optional: Change the current directory to my_local_scratch_dir, exit if changing didn't succeed. cd "${my_local_scratch_dir}" || exit 1 }}} === Common node scratch === Local `/scratch` of nodes is available among nodes at `/scratch_net/node_A` as an automount (on demand). It is accessible exclusively on compute nodes from compute jobs. A use case for this kind of storage is running several compute jobs on different nodes using the same data. * Accessing data stored on `/scratch` on one node `A` from other nodes `B, C, D, ...` will '''impact I/O latency for all jobs running on node `A`!''' * You have to ensure writing data from nodes `B, C, D, ...` concurrently to `/scratch` on node `A` does not overwrite data already in use * `scratch_clean` is active (see above)! * Automatic cleanup per job as shown above has to be replaced by a final cleanup in the last job accessing the data === Public storage === Public storage is accessible widely: On personal workstations, file servers and compute nodes. It is used in the daily work by all D-ITET members.<
> This storage allows direct access to data from compute jobs without the need to transfer it to local `/scratch`. Latency is higher because of wide use and network bandwidth.<
> While this may look like a convenient storage to use for compute jobs, '''using public storage mandates strict use of the guidelines here to prevent blocking other users'''! There are different types of public storage available at D-ITET. Make sure you understand what is available to you and which one to use for what purpose. Details about public storage available at D-ITET is summarized in the [[Services/StorageOverview|Storage overview]]. Your supervisor or institutes/groups administrative/technical contact will tell you: * which storage is available to you from your institute/group * which storage to use for intermediate, generated data * which storage to use to store your final results ⚠ For storage without automated backup: '''Make sure to backup stored data yourself!'''<
> ⚠ Better, '''don't store data worthy of a backup on a system without automated backup!''' ==== Don't s! ==== Avoid slowing down your own work environment by introducing dependencies placed on a public storage. Concrete examples what to avoid are: * Don't replace directories used by the operating system with links to a public storage!<
>Such directories are for example: `~/.cache`, `~/.local`, `~/.config` * Don't initialize anything like `conda` environments placed on public storage in your login shell init script!<
>Init scripts are for example: `~/.profile`, `~/.bashrc`, `~/.zhsrc` * Don't export environment variables to redirect caches or temporary files to a public storage!<
>Such variables are for example: `TMPDIR`, `XDG_CACHE_HOME`, `PIP_CACHE_DIR` === Transferring data === Transfer of a large file between any storage accessible within the D-ITET structure is most efficient with the following `rsync` commands: {{{#!highlight bash numbers=disable # Minimal output rsync -a --inplace /path/to/large/origin/file /path/to/copy/of/origin/file # Add on-the-fly compression if your file is uncompressed rsync -a --inplace /path/to/large/origin/file /path/to/copy/of/origin/file --compress # Add verbose output and a progress indicator rsync -a --inplace /path/to/large/origin/file /path/to/copy/of/origin/file --verbose --progress }}} In this example there is a significant reduction in use of resources (bandwidth, cpu, memory, time) if a previous version of the target file is already in place, as only changed blocks will be transferred. A concrete example syncing the file `dataset.parquet` in the project folder `project_one` to a (existing) directory with your username on the local `/scratch` of your (ISG managed) workstation: {{{#!highlight bash numbers=disable rsync -a --inplace /itet-stor/$USER/project_one/dataset.parquet /scratch/$USER/ -v --progress }}} == Tuning Weights & Biases (wandb) == If you use [[https://wandb.ai/|Weights & Biases (wandb)]], be aware it can [[https://docs.wandb.ai/guides/technical-faq/metrics-and-performance#will-wandb-slow-down-my-training|create intense I/O]] on the storage it logs its metrics. Quote: ''It is possible to log a huge amount of data quickly, and if you do that you might create disk I/O issues.'' In a szenario where many HPC jobs run with `wandb` using the same storage system for job and `wandb` data, this can result in a slowdown of any I/O operation for all job submitters. To prevent this, setup `wandb` as follows: === Use a fast local scratch disk for main and cache directory === Set environment variables to relocate main and cache directories and create these directories in your (`bash`) job script: {{{#!highlight bash numbers=disable WANDB_DIR="/scratch/${USER}/wandb_dir" WANDB_CACHE_DIR="${WANDB_DIR}/.cache" export WANDB_DIR WANDB_CACHE_DIR mkdir -vp "${WANDB_CACHE_DIR}" }}} See [[https://docs.wandb.ai/guides/track/environment-variables|Environment Variables]] for details. If you want to keep this data, remove the cache, copy the main directory into a compressed tar archive away from the local `/scratch` at the end of a job to a backuped location like a project directory, then '''delete it''' from the local `/scratch` disk: {{{#!highlight bash numbers=disable rm -r "${WANDB_CACHE_DIR}" && tar -czf "/itet-stor/${USER}//wandb_${SLURM_JOB_ID}.tar.gz" "${WANDB_DIR}" && rm -r "${WANDB_DIR}" }}} To automate removal, setting a `trap` as in the example under [[#Local_scratch|Local scratch]] makes sense here as well. === Run wandb offline === Consider [[https://docs.wandb.ai/guides/technical-faq/setup#can-i-run-wandb-offline|running wandb offline]].<
> If necessary, sync metrics at the end of your job as explained in the link above. === Tune metrics collection === Consider [[https://docs.wandb.ai/guides/track/limits|tuning your metrics collection parameters for faster logging]]. == Related information == * [[https://scicomp.ethz.ch/wiki/Best_Practices|Euler cluster best practices]] * [[https://scicomp.ethz.ch/wiki/Getting_started_with_clusters#Choosing_the_optimal_storage_system|Euler: choosing optimal storage]] * [[https://readme.phys.ethz.ch/storage/general_advice/|D-PHYS storage advice]] * [[https://www.researchgate.net/publication/338599610_Understanding_Data_Motion_in_the_Modern_HPC_Data_Center|Understanding Data Motion in the Modern HPC Data Center]]