Differences between revisions 3 and 47 (spanning 44 versions)
Revision 3 as of 2019-05-10 19:07:52
Size: 7408
Editor: stroth
Comment:
Revision 47 as of 2019-11-07 07:34:20
Size: 14710
Editor: stroth
Comment:
Deletions are marked like this. Additions are marked like this.
Line 3: Line 3:
= Set up a python development environment for data science =
The following procedure shows how to set up a python development environment with the [[https://conda.io/|conda]] packet manager and install [[https://pytorch.org/|pytorch]] and [[https://www.tensorflow.org/|tensorflow]].

== Install conda ==
 * Time to install: ~1'
 * Space required: ~350M

To provide conda, the minimal anaconda distribution '''miniconda''' can be installed and configured for the D-ITET infrastructure with the following bash script:
= Setting up a personal python development infrastructure =
This page shows how to [[#Installing_conda|set up a personal python development infrastructure]], how to [[#Using_conda|use it]], how to [[#Maintenance|maintain it]] and [[#Backup|make backups of your project environments]].

Some [[#Installation_examples|examples for software installation]] in the field of data sciences are provided.

The infrastructure is driven by the [[https://conda.io/|conda]] packet manager which accesses the [[https://repo.continuum.io/pkgs/|Anaconda repositories]] to install software.

After familiarizing yourself with `conda`, read this [[Programming/Languages/GPUCPU|collection of hints and explanations]] about available platforms on which to use your infrastructure and particularities of the software packages involved.

== Installing conda ==
 * Time to install: ~1.5 minutes
 * Space required: ~370M

To provide `conda`, the minimal anaconda distribution '''miniconda''' can be installed and configured for the D-ITET infrastructure with the following bash script:
Line 14: Line 20:
# Locations to store environments
LOCAL_SCRATCH="/scratch/${USER}"
NET_SCRATCH="/itet-stor/${USER}/net_scratch"
SPACE_MINIMUM_REQUIRED='5'

if [[ -z "${1}" ]]; then
    # Default install location
    OPTION='netscratch'
else
    OPTION="${1}"
fi

line=$(printf '%*s\n' "${COLUMNS:-$(tput cols)}" '' |tr ' ' '-')

# Display underlined title to improve readability of script output
function title()
{
    echo
    echo "$@"
    echo "${line}"
}

case "${OPTION}" in
    h|help|'-h'|'--help')
        title 'Possible installation options are:'
        echo 'Install conda to your local scratch disk:'
        echo "${BASH_SOURCE[0]} localscratch"
        echo
        echo 'Install conda to your directory on net_scratch:'
        echo "${BASH_SOURCE[0]}"
        echo 'or'
        echo "${BASH_SOURCE[0]} netscratch"
        echo
        echo 'Provide a custom location for installation'
        echo "${BASH_SOURCE[0]} /path/to/custom/location"
        echo
        echo "The recommended minimum space requirement for installation is ${SPACE_MINIMUM_REQUIRED='5'} G."
        exit 0
        ;;
    l|local|localscratch|'-l'|'-local'|'-localscratch')
        # If local scratch is made available through scratch_net, use its path in
        # order to be able to access it on other hosts through scratch_net
        scratch_net="/scratch_net/$(hostname -s)"
        if mountpoint -q "${scratch_net}"; then
            CONDA_BASE_DIR="${scratch_net}/${USER}"
        else
            CONDA_BASE_DIR="/scratch/${USER}"
        fi
        ;;
    n|net|netscratch|'-n'|'-net'|'-netscratch')
        CONDA_BASE_DIR="/itet-stor/${USER}/net_scratch"
        ;;
    *)
        if [[ -d "${1}" ]]; then
            CONDA_BASE_DIR="${1}"
        else
            title 'Warning!'
            echo "Directory '${1}' does not exist."
            exit 1
        fi
        ;;
esac

# Check available space on selected install location
SPACE_AVAILABLE=$(($(stat -f --format="%a*%S" ${CONDA_BASE_DIR})/1024/1024/1024))
if [[ ${SPACE_AVAILABLE} -lt ${SPACE_MINIMUM_REQUIRED} ]]; then
    title 'Warning!'
    echo "Available space on '${CONDA_BASE_DIR}' is ${SPACE_AVAILABLE} G."
    echo "This is less than the minimum recommendation of ${SPACE_MINIMUM_REQUIRED} G."
    read -p "Press 'y' if you want to continue installing anwyway: " -n 1 -r
    echo
    if [[ ! ${REPLY} =~ ^[Yy]$ ]]; then
        exit 1
    fi
fi

# Locations for conda installation, packet cache and environments
CONDA_INSTALL_DIR="${CONDA_BASE_DIR}/conda"
CONDA_PACKET_CACHE_DIR="${CONDA_BASE_DIR}/conda_pkgs"
CONDA_ENV_DIR="${CONDA_BASE_DIR}/conda_envs"
Line 22: Line 102:
[[ -z ${PYTHONPATH} ]] || unset PYTHONPATH if [[ -n ${PYTHONPATH} ]]; then
    unset PYTHONPATH
fi

# Check installation path, abort if it exists, otherwise create it
title 'Checking installation path'
if [[ -d "${CONDA_INSTALL_DIR}" ]]; then
    echo "The installation path '${CONDA_INSTALL_DIR}' already exists."
    echo "Aborting installation."
    exit 1
fi
Line 25: Line 115:
title 'Downloading and installing conda:'
Line 27: Line 118:
    && ./miniconda.sh -b -p "${NET_SCRATCH}/conda" \     && ./miniconda.sh -b -p "${CONDA_INSTALL_DIR}" \
Line 31: Line 122:
eval "$(${NET_SCRATCH}/conda/bin/conda shell.bash hook)"
conda config --add pkgs_dirs "${NET_SCRATCH}/conda_pkgs" --system
conda config --add envs_dirs "${LOCAL_SCRATCH}/conda_envs" --system
conda config --add envs_dirs "${NET_SCRATCH}/conda_envs" --system
title 'Configuring conda'
eval "$(${CONDA_INSTALL_DIR}/bin/conda shell.bash hook)"
conda config --add pkgs_dirs "${CONDA_PACKET_CACHE_DIR}" --system
conda config --add envs_dirs "${CONDA_ENV_DIR}" --system
Line 38: Line 129:
# Update conda and conda base environment
title 'Updating conda and conda base environment:'
conda update conda --yes
conda update -n 'base' --update-all --yes

# Clean installation
title 'Removing unused packages and caches:'
conda clean --all --yes

# Display information about this conda installation
title 'Information about this conda installation:'
conda info
Line 39: Line 143:
echo
echo
'Initialize conda immediately:'
echo "eval \"\$(${NET_SCRATCH}/conda/bin/conda shell.bash hook)\""
echo
echo
'Automatically initialize conda for furure shell sessions:'
echo "echo 'eval \"\$(${NET_SCRATCH}/conda/bin/conda shell.bash hook)\"' >> ${HOME}/.bashrc"
title 'Initialize conda immediately:'
echo "eval \"\$(${CONDA_INSTALL_DIR}/bin/conda shell.bash hook)\""
title 'Automatically initialize conda for future shell sessions:'
echo "echo '[[ -f ${CONDA_INSTALL_DIR}/bin/conda ]] && eval \"\$(${CONDA_INSTALL_DIR}/bin/conda shell.bash hook)\"' >> ${HOME}/.bashrc"
Line 47: Line 149:
echo
echo 'Completely remove conda:'
echo "rm -r ${NET_SCRATCH}/conda ${NET_SCRATCH}/conda_pkgs ${NET_SCRATCH}/conda_envs ${LOCAL_SCRATCH}/conda_envs ${HOME}/.conda"
}}}
Save this script as `install_conda.sh`, make it executable with
title 'Completely remove conda:'
echo "rm -r ${CONDA_INSTALL_DIR} ${CONDA_INSTALL_DIR}_pkgs ${CONDA_INSTALL_DIR}_envs ${HOME}/.conda"
}}}
and run the script to show options for choosing [[#conda-storage-locations|storage locations]] by issuing
{{{
./install_conda.sh help
}}}
Then run the script again with the option of your choosing to start the installation.
 * When the script ends it prints out information about the installation, commands to initialize `conda` immediately or every time you log in and a command to completely remove your `conda` installation.
 * Choose your preferred method of initializing `conda` as recommended by the script and note down the deletion command.

== conda storage locations ==
=== Pre-set install locations ===
The purpose of the install scripts' options is to store data according to its importance and prevent using up your quota. The difference between the two pre-set installation locations is:
 * '''netscratch''': fail-safe because it resides on a RAID but slower startup times as it is a network share
 * '''localscratch''': single point of failure because it is just one disk but faster startup times as it is a local disk
Neither of the pre-set locations has an automatic backup. Use the recommended [[#Backup|backup practice]] instead.

=== Custom install location ===
If you intend to use a custom install location, consult the [[Services/StorageOverview|storage overview]] to choose it adequately and follow these recommendations:
 * Reproducible, space consuming data like environments and package cache belongs into storage class ''SCRATCH''
 * Code written by yourself should be backuped regularly. It consumes a small amount of space therefore it's ideal location is in storage class ''HOME'' and additionally checked into your [[https://git.ee.ethz.ch/users/sign_in|git repository]].
 * Data generated over a long time period which would be time consuming to recreate from scratch and is in use regularly should be stored in the storage class ''PROJECT''.
 * Data generated as a final result which is not needed for ongoing work but needs to be available for later generations should be stored in the storage class ''ARCHIVE''.

=== conda directories ===
The installation creates the following two directories in the install location:
 * '''conda''': Contains the miniconda installation
 * '''conda_pkgs''': Contains the cache for downloaded and decompressed packages
Creating the [[#Create_an_environment_called_.22my_env.22_with_packages_.22package1.22_and_.22package2.22_installed|first environment]] creates an additional directory in the install location:
 * '''conda_envs''': Contains the created environment(s)

== Using conda ==
`conda` allows to seperate installed software packages from each other by creating so-called ''environments''. Using environments is best practice to generate deterministic and reproducible tools.

`conda` takes care of dependencies common to the packages it is asked to install. If two packages have a common dependency but define a differing range of version requirements of said dependency, `conda` chooses the highest common version number.
This means the dependency installed in an environment with both packages together might have a lower version number than in environments seperating both packages.

It is best practice to seperate packages in different environments if they don't need to interact.

For a complete guide to `conda` see the [[https://conda.io/projects/conda/en/latest/index.html|official documentation]].

The official [[https://conda.io/projects/conda/en/latest/user-guide/cheatsheet.html|cheat sheet]] contains a compact summary of common commands. An abbreviated list to get you started is shown below.

=== Installation examples ===
For `conda`, `python` itself is just a software package as any other. After analyzing all packages to be installed it decides which `python` version works for the whole environment. This means different environments may contain differing versions of `python`.

==== Creating an environment with a specific python version ====
 * Time to install: ~1 minute
 * Space required: ~140M
{{{
conda create --name py37 python=3.7.3
}}}
==== Creating an environment with the GPU version of pytorch and CUDA toolkit 10 ====
 * Time to install: ~5 minutes
 * Space required: ~2.5G
{{{
conda create --name pytcu10 pytorch torchvision cudatoolkit=10.0 --channel pytorch
}}}
==== Creating an environment with the GPU version of tensorflow and CUDA toolkit 10 ====
 * Time to install: ~5 minutes
 * Space required: ~2G
{{{
conda create --name tencu10 tensorflow-gpu cudatoolkit=10.0
}}}

=== Environments ===
`conda` automatically installs a default environment called ''base'' with a `python` interpreter, [[https://pypi.org/project/pip/|pip]] and other tools to start coding in python. Whether you want to use and extend this environment or create your own is up to you. At the time of writing this information it is not possible to remove the base environment.
==== Create an environment called "my_env" with packages "package1" and "package2" installed ====
{{{
conda create --name my_env package1 package2
}}}
==== Activate the environment called "my_env" ====
{{{
conda activate my_env
}}}
==== Deactivate the current environment ====
{{{
conda deactivate
}}}
==== List available environments ====
{{{
conda env list
}}}
==== Remove the environment called "my_env" ====
{{{
conda remove --name my_env --all
}}}
==== Create a cloned environment named "cloned_env" from "original_env" ====
{{{
conda create --name cloned_env --clone original_env
}}}
==== Export the active environment definition to the file "my_env.yml" ====
This command is also the basis for [[#Backup|backing up]] an environment.
{{{
conda env export --json --name my_env > my_env.yml
}}}
==== Recreate a previously exported environment ====
{{{
conda env create --file my_env.yml
}}}
==== Creates the environment "my_env" in the specified location ====
This example is for creating the environment on local scratch for faster disk access
{{{
conda create --prefix /scratch/$USER/conda_envs/my_env
}}}
==== Update an active environment ====
Make sure to create a [[#Backup|backup]] by exporting the active environment before updating.
{{{
conda update --update-all
}}}

=== Packages ===
==== Search for a package named "package1" ====
{{{
conda search package1
}}}
==== Search for packages with "pack" in their name ====
{{{
conda search *pack*
}}}
==== Install the package named "package1" in the active environment ====
{{{
conda install package1
}}}
==== List packages installed in the active environment ====
{{{
conda list
}}}
==== Add software channels ====
The list of available software can be extended by adding channels of selected repositories.
The priority of the channels is set in order of configuration. In the following example, [[https://conda-forge.org/|Conda-Forge]] has the highest priority over [[https://bioconda.github.io/|Bioconda]], with the default channel at the lowest priority.
{{{
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
}}}
==== Show software channels ====
The following command shows the available channels in order of priority (highest first):
{{{
conda config --show channels
}}}
=== Miscellaneous ===
==== Display information about the current conda installation ====
{{{
conda info
}}}
==== Change TMPDIR ====
If the message ''"Not enough space on partition mounted at /tmp."'' is shown during a package installation, set the TMPDIR variable to a location with enough available space:
{{{
export TMPDIR="/scratch/$USER/tmp"
}}}

=== Maintenance ===
The cache of installed packages will consume a lot of space over time. The default location set for the package cache resides on [[Services/NetScratch|NetScratch]], the terms of use for this storage area require you to clean your cache regularly.
==== Remove index cache, lock files, unused cache packages, and tarballs ====
{{{
conda clean --all
}}}
==== Update conda without any active environment ====
{{{
conda update conda
}}}

=== Backup ===
Regular backups are recommended to be able to reproduce an environment used at a certain point in time. Before installing or updating an environment, a backup should always be created in order to be able to revert the changes.

It is not necessary to backup environments themselves, it is sufficient to backup the files of environment exports to recreate them exactly.

For a simple backup of all environments the following script can be used:
Line 53: Line 320:
chmod +x install_conda.sh
}}}
and execute the script by issuing
{{{#!highlight bash numbers=disable
./install_conda
}}}
Choose your preferred method of initializing `conda` as recommended by the script.

== Conda storage locations ==
The directories listed in the command for complete `conda` removal contain the following data:
||`/itet-stor/$USER/net_scratch/conda`||The miniconda installation||
||`/itet-stor/$USER/net_scratch/conda_pkgs`||Downloaded packages||
||`/itet-stor/$USER/net_scratch/conda_envs`||Virtual environments on NAS||
||`/scratch/$USER/conda_envs`||Virtual environments on local disk||
||`/home/$USER/.conda`||Personal conda configuration||
The purpose of this configuration is to store reproducible and space consuming data outside of your `$HOME` to prevent using up your quota.

== Using Conda ==
`conda` allows to seperate installed software packages from each other by creating so-called ```environments```. Using environments is best practice to generate deterministic and reproducible tools.

`conda` takes care of dependencies common to the packages it is asked to install. If two packages have a common dependency but define a differing range of version requirements of said dependency, `conda` chooses the highest common version number.
This means the dependency installed in an environment with both packages together might have a lower version number than in environments separating both packages.

It is best practice to seperate packages in different environments if they don't need to interact.

For a complete guide to `conda` see the [[https://conda.io/projects/conda/en/latest/index.html|official documentation]].

=== Common commands ===
Common commands to get you started are listed here:
 * `conda create --name my_env package1 package2`
 . creates an environment called "my_env" with packages "package1" and "package2" installed
 * `conda activate my_env`
 . activates the environment called ''my_env''
 * `conda deactivate`
 . deactivates the current environment
 * `conda env list`
 . lists available environments
 * `conda remove --name my_env --all`
 . removes the environment called ''my_env''
 * `conda create --name cloned_env --clone original_env`
 . creates a cloned environment named ''cloned_env'' from ''original_env''
 * `conda env export > my_env.yml`
 . exports the active environment definition to the file ''my_env.yml''
 * `conda env create --file my_env.yml`
 . recreates a previously exported environment
 * `conda list`
 .lists packages installed in the active environment
 * `conda create --prefix /scratch/$USER/conda_envs/my_env`
 . creates the environment ''my_env'' in the specified location
The name of the default environment is `base`.

=== Installation examples ===
 * Time to install: ~5' per environment
 * Space required: ~1.5G packages, 3G per environment

The following examples show how to install `pytorch` for GPU with CUDA toolkit 9, 10 and without CUDA toolkit for CPU, as well as `tensorflow` in the same three variants:

{{{#!highlight bash numbers=disable
conda create --name pytcu9 pytorch torchvision cudatoolkit=9.0 --channel pytorch
conda create --name pytcu10 pytorch torchvision cudatoolkit=10.0 --channel pytorch
conda create --name pytcpu pytorch-cpu torchvision-cpu --channel pytorch
conda create --name tencu9 tensorflow-gpu cudatoolkit=9.0
conda create --name tencu10 tensorflow-gpu cudatoolkit=10.0
conda create --name tencpu tensorflow
}}}

=== Testing installations ===

==== Testing pytorch ====
To verify the successful installation of `pytorch` run the following python code:
{{{#!highlight python numbers=disable
from __future__ import print_function
import torch
x = torch.rand(5, 3)
print(x)
}}}
The output should be similar to the following:
{{{
tensor([[0.4813, 0.8839, 0.1568],
        [0.0485, 0.9338, 0.1582],
        [0.1453, 0.5322, 0.8509],
        [0.2104, 0.4154, 0.9658],
        [0.6050, 0.9571, 0.3570]])
}}}
To verify CUDA availability in `pytorch`, run the following code:
{{{#!highlight python numbers=disable
import torch
torch.cuda.is_available()
}}}
It should return ''True''.

==== Testing TensorFlow ====
The following code prints information about your `tensorflow` installation:
{{{#!highlight python numbers=disable
import tensorflow as tf
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))
}}}
Lines containing `device: XLA_` show which CPU/GPU devices are available.

A line containing `cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version` means the NVIDIA driver installed on the system you run the code is not compatible with the CUDA toolkit installed in the environment you run the code from.

== NVIDIA CUDA Toolkit ==
Which version of the CUDA toolkit is usable depends on the version of the NVIDIA driver installed on the machine you run your programs. The version can be checked by issuing the command `nvidia-smi` and looking for the number next to the text ''Driver Version''.

The CUDA compatibility document by NVIDIA shows a [[https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility__table-toolkit-driver|dependency matrix]] matching driver and toolkit versions.
#!/bin/bash

BACKUP_DIR="${HOME}/conda_env_backup"
MY_TIME_FORMAT='%Y-%m-%d_%H-%M-%S'

NOW=$(date "+${MY_TIME_FORMAT}")
[[ ! -d "${BACKUP_DIR}" ]] && mkdir "${BACKUP_DIR}"
ENVS=$(conda env list |grep '^\w' |cut -d' ' -f1)
for env in $ENVS; do
    echo "Exporting ${env} to ${BACKUP_DIR}/${env}_${NOW}.yml"
    conda env export --name "${env}"> "${BACKUP_DIR}/${env}_${NOW}.yml"
done
}}}

Setting up a personal python development infrastructure

This page shows how to set up a personal python development infrastructure, how to use it, how to maintain it and make backups of your project environments.

Some examples for software installation in the field of data sciences are provided.

The infrastructure is driven by the conda packet manager which accesses the Anaconda repositories to install software.

After familiarizing yourself with conda, read this collection of hints and explanations about available platforms on which to use your infrastructure and particularities of the software packages involved.

Installing conda

  • Time to install: ~1.5 minutes
  • Space required: ~370M

To provide conda, the minimal anaconda distribution miniconda can be installed and configured for the D-ITET infrastructure with the following bash script:

#!/bin/bash

SPACE_MINIMUM_REQUIRED='5'

if [[ -z "${1}" ]]; then
    # Default install location
    OPTION='netscratch'
else
    OPTION="${1}"
fi

line=$(printf '%*s\n' "${COLUMNS:-$(tput cols)}" '' |tr ' ' '-')

# Display underlined title to improve readability of script output
function title()
{
    echo
    echo "$@"
    echo "${line}"
}

case "${OPTION}" in
    h|help|'-h'|'--help')
        title 'Possible installation options are:'
        echo 'Install conda to your local scratch disk:'
        echo "${BASH_SOURCE[0]} localscratch"
        echo
        echo 'Install conda to your directory on net_scratch:'
        echo "${BASH_SOURCE[0]}"
        echo 'or'
        echo "${BASH_SOURCE[0]} netscratch"
        echo
        echo 'Provide a custom location for installation'
        echo "${BASH_SOURCE[0]} /path/to/custom/location"
        echo
        echo "The recommended minimum space requirement for installation is ${SPACE_MINIMUM_REQUIRED='5'} G."
        exit 0
        ;;
    l|local|localscratch|'-l'|'-local'|'-localscratch')
        # If local scratch is made available through scratch_net, use its path in
        # order to be able to access it on other hosts through scratch_net
        scratch_net="/scratch_net/$(hostname -s)"
        if mountpoint -q "${scratch_net}"; then
            CONDA_BASE_DIR="${scratch_net}/${USER}"
        else
            CONDA_BASE_DIR="/scratch/${USER}"
        fi
        ;;
    n|net|netscratch|'-n'|'-net'|'-netscratch')
        CONDA_BASE_DIR="/itet-stor/${USER}/net_scratch"
        ;;
    *)
        if [[ -d "${1}" ]]; then
            CONDA_BASE_DIR="${1}"
        else
            title 'Warning!'
            echo "Directory '${1}' does not exist."
            exit 1
        fi
        ;;
esac

# Check available space on selected install location
SPACE_AVAILABLE=$(($(stat -f --format="%a*%S" ${CONDA_BASE_DIR})/1024/1024/1024))
if [[ ${SPACE_AVAILABLE} -lt ${SPACE_MINIMUM_REQUIRED} ]]; then
    title 'Warning!'
    echo "Available space on '${CONDA_BASE_DIR}' is ${SPACE_AVAILABLE} G."
    echo "This is less than the minimum recommendation of ${SPACE_MINIMUM_REQUIRED} G."
    read -p "Press 'y' if you want to continue installing anwyway: " -n 1 -r
    echo
    if [[ ! ${REPLY} =~ ^[Yy]$ ]]; then
        exit 1
    fi
fi

# Locations for conda installation, packet cache and environments
CONDA_INSTALL_DIR="${CONDA_BASE_DIR}/conda"
CONDA_PACKET_CACHE_DIR="${CONDA_BASE_DIR}/conda_pkgs"
CONDA_ENV_DIR="${CONDA_BASE_DIR}/conda_envs"

# Installer of choice for conda
CONDA_INSTALLER_URL='https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh'

# Unset pre-existing python paths
if [[ -n ${PYTHONPATH} ]]; then
    unset PYTHONPATH
fi

# Check installation path, abort if it exists, otherwise create it
title 'Checking installation path'
if [[ -d "${CONDA_INSTALL_DIR}" ]]; then
    echo "The installation path '${CONDA_INSTALL_DIR}' already exists."
    echo "Aborting installation."
    exit 1
fi

# Downlad latest version of miniconda and install it
title 'Downloading and installing conda:'
wget -O miniconda.sh "${CONDA_INSTALLER_URL}" \
    && chmod +x miniconda.sh \
    && ./miniconda.sh -b -p "${CONDA_INSTALL_DIR}" \
    && rm ./miniconda.sh

# Configure conda
title 'Configuring conda'
eval "$(${CONDA_INSTALL_DIR}/bin/conda shell.bash hook)"
conda config --add pkgs_dirs "${CONDA_PACKET_CACHE_DIR}" --system
conda config --add envs_dirs "${CONDA_ENV_DIR}" --system
conda config --set auto_activate_base false
conda deactivate

# Update conda and conda base environment
title 'Updating conda and conda base environment:'
conda update conda --yes
conda update -n 'base' --update-all --yes

# Clean installation
title 'Removing unused packages and caches:'
conda clean --all --yes

# Display information about this conda installation
title 'Information about this conda installation:'
conda info

# Show how to initialize conda
title 'Initialize conda immediately:'
echo "eval \"\$(${CONDA_INSTALL_DIR}/bin/conda shell.bash hook)\""
title 'Automatically initialize conda for future shell sessions:'
echo "echo '[[ -f ${CONDA_INSTALL_DIR}/bin/conda ]] && eval \"\$(${CONDA_INSTALL_DIR}/bin/conda shell.bash hook)\"' >> ${HOME}/.bashrc"

# Show how to remove conda
title 'Completely remove conda:'
echo "rm -r ${CONDA_INSTALL_DIR} ${CONDA_INSTALL_DIR}_pkgs ${CONDA_INSTALL_DIR}_envs ${HOME}/.conda"

and run the script to show options for choosing storage locations by issuing

./install_conda.sh help

Then run the script again with the option of your choosing to start the installation.

  • When the script ends it prints out information about the installation, commands to initialize conda immediately or every time you log in and a command to completely remove your conda installation.

  • Choose your preferred method of initializing conda as recommended by the script and note down the deletion command.

conda storage locations

Pre-set install locations

The purpose of the install scripts' options is to store data according to its importance and prevent using up your quota. The difference between the two pre-set installation locations is:

  • netscratch: fail-safe because it resides on a RAID but slower startup times as it is a network share

  • localscratch: single point of failure because it is just one disk but faster startup times as it is a local disk

Neither of the pre-set locations has an automatic backup. Use the recommended backup practice instead.

Custom install location

If you intend to use a custom install location, consult the storage overview to choose it adequately and follow these recommendations:

  • Reproducible, space consuming data like environments and package cache belongs into storage class SCRATCH

  • Code written by yourself should be backuped regularly. It consumes a small amount of space therefore it's ideal location is in storage class HOME and additionally checked into your git repository.

  • Data generated over a long time period which would be time consuming to recreate from scratch and is in use regularly should be stored in the storage class PROJECT.

  • Data generated as a final result which is not needed for ongoing work but needs to be available for later generations should be stored in the storage class ARCHIVE.

conda directories

The installation creates the following two directories in the install location:

  • conda: Contains the miniconda installation

  • conda_pkgs: Contains the cache for downloaded and decompressed packages

Creating the first environment creates an additional directory in the install location:

  • conda_envs: Contains the created environment(s)

Using conda

conda allows to seperate installed software packages from each other by creating so-called environments. Using environments is best practice to generate deterministic and reproducible tools.

conda takes care of dependencies common to the packages it is asked to install. If two packages have a common dependency but define a differing range of version requirements of said dependency, conda chooses the highest common version number. This means the dependency installed in an environment with both packages together might have a lower version number than in environments seperating both packages.

It is best practice to seperate packages in different environments if they don't need to interact.

For a complete guide to conda see the official documentation.

The official cheat sheet contains a compact summary of common commands. An abbreviated list to get you started is shown below.

Installation examples

For conda, python itself is just a software package as any other. After analyzing all packages to be installed it decides which python version works for the whole environment. This means different environments may contain differing versions of python.

Creating an environment with a specific python version

  • Time to install: ~1 minute
  • Space required: ~140M

conda create --name py37 python=3.7.3

Creating an environment with the GPU version of pytorch and CUDA toolkit 10

  • Time to install: ~5 minutes
  • Space required: ~2.5G

conda create --name pytcu10 pytorch torchvision cudatoolkit=10.0 --channel pytorch

Creating an environment with the GPU version of tensorflow and CUDA toolkit 10

  • Time to install: ~5 minutes
  • Space required: ~2G

conda create --name tencu10 tensorflow-gpu cudatoolkit=10.0

Environments

conda automatically installs a default environment called base with a python interpreter, pip and other tools to start coding in python. Whether you want to use and extend this environment or create your own is up to you. At the time of writing this information it is not possible to remove the base environment.

Create an environment called "my_env" with packages "package1" and "package2" installed

conda create --name my_env package1 package2

Activate the environment called "my_env"

conda activate my_env

Deactivate the current environment

conda deactivate

List available environments

conda env list

Remove the environment called "my_env"

conda remove --name my_env --all

Create a cloned environment named "cloned_env" from "original_env"

conda create --name cloned_env --clone original_env

Export the active environment definition to the file "my_env.yml"

This command is also the basis for backing up an environment.

conda env export --json --name my_env > my_env.yml

Recreate a previously exported environment

conda env create --file my_env.yml

Creates the environment "my_env" in the specified location

This example is for creating the environment on local scratch for faster disk access

conda create --prefix /scratch/$USER/conda_envs/my_env

Update an active environment

Make sure to create a backup by exporting the active environment before updating.

conda update --update-all

Packages

Search for a package named "package1"

conda search package1

Search for packages with "pack" in their name

conda search *pack*

Install the package named "package1" in the active environment

conda install package1

List packages installed in the active environment

conda list

Add software channels

The list of available software can be extended by adding channels of selected repositories. The priority of the channels is set in order of configuration. In the following example, Conda-Forge has the highest priority over Bioconda, with the default channel at the lowest priority.

conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge

Show software channels

The following command shows the available channels in order of priority (highest first):

conda config --show channels

Miscellaneous

Display information about the current conda installation

conda info

Change TMPDIR

If the message "Not enough space on partition mounted at /tmp." is shown during a package installation, set the TMPDIR variable to a location with enough available space:

export TMPDIR="/scratch/$USER/tmp"

Maintenance

The cache of installed packages will consume a lot of space over time. The default location set for the package cache resides on NetScratch, the terms of use for this storage area require you to clean your cache regularly.

Remove index cache, lock files, unused cache packages, and tarballs

conda clean --all

Update conda without any active environment

conda update conda

Backup

Regular backups are recommended to be able to reproduce an environment used at a certain point in time. Before installing or updating an environment, a backup should always be created in order to be able to revert the changes.

It is not necessary to backup environments themselves, it is sufficient to backup the files of environment exports to recreate them exactly.

For a simple backup of all environments the following script can be used:

#!/bin/bash

BACKUP_DIR="${HOME}/conda_env_backup"
MY_TIME_FORMAT='%Y-%m-%d_%H-%M-%S'

NOW=$(date "+${MY_TIME_FORMAT}")
[[ ! -d "${BACKUP_DIR}" ]] && mkdir "${BACKUP_DIR}"
ENVS=$(conda env list |grep '^\w' |cut -d' ' -f1)
for env in $ENVS; do
    echo "Exporting ${env} to ${BACKUP_DIR}/${env}_${NOW}.yml"
    conda env export --name "${env}"> "${BACKUP_DIR}/${env}_${NOW}.yml"
done

Programming/Languages/Conda (last edited 2024-11-14 16:07:38 by stroth)