#rev 2021-04-20 bonaccos

= Data Management =

== Choosing the optimal storage system ==

Quoting [[https://scicomp.ethz.ch/wiki/Getting_started_with_clusters#Choosing_the_optimal_storage_system|Choosing the optimal storage system]]
from the scientific computing wiki:

 When working on an HPC cluster that provides different storage
 categories/systems, the choice of which system to use can have a big
 influence of the performance of your workflow. In the best case you
 can speedup your workflow by quite a lot, whereas in the worst case
 the system administrator has to kill all your jobs and has to limit
 the number of concurrent jobs that you can run because your jobs slow
 down the entire storage system and this can affect other users jobs.

The same holds for the clusters and storage infrastructure as maintained
in the D-ITET environment.

When thus working (in particular from compute jobs in one of the
clusters) with storage systems please consider the following guidelines,
those are largely inspired by the mentioned scientific computing
guidelines.

 * Use local "scratch" disks as whenever possible. Most of the compute nodes have in meanwhile in SSD disks available as scratch storage.

 * [Applicable to BIWI] For working in parallel from a cluster with '''large''' files consider using the scale-out scratch place (beegfs02).

 * Do not create large number of small files in the [[Services/NetScratch|D-ITET NetScratch]] service storage (or the BeeGFS scratch or project filesystem), this can slow down not only you but the whole system. Consider the first item when working on data stored on those services whenever possible.

 * If on the data you need to perform very high I/O (e.g. opening and closing files at high rate, reading many small files per second, do short appends to files from various locations), then this will have a severe impact o the network attached storages or the scale out  filesystems. Use as much as possible data copied to local storage for those and only move back results to appropriate places.

 * If working a lot with large amount of small files, then keep those sensibly grouped in bigger archives which you can move to local storage in the job (as big files) and unpack those there and work on the local storage with them. Do not work on the large amount of small files on network attached storage or the cluster filesystems. Process them and group the results again in archives and move the results to appropriate places.

Respecting this guidelines can improve '''your''' '''own''' work performance
and at same time do not severely impact the performance of the storage
systems in a bad way (and for other users).

== References ==

 * [[https://scicomp.ethz.ch/wiki/Getting_started_with_clusters#Choosing_the_optimal_storage_system|Choosing the optimal storage system]]
 * [[https://readme.phys.ethz.ch/storage/general_advice/|D-PHYS General advice regarding storage]]
 * [[Services/StorageOverview|StorageOverview]] - Overview of our storages (with intended usage info, backup (yes/no), etc.)