Revision 16 as of 2020-02-20 14:32:46

Clear message

Working with GPU or CPU in data sciences

This article is a based on the guide "Setting up a personal python development infrastructure" which is required reading to understand some of the concepts used here. The following information is intended for users of grid computing clusters, i.e. for staff members. It is a collection of hints and explanations to uses tools in the field of data sciences on the D-ITET computing infrastructure.

For an introduction to data sciences have a look at the Guided Data Science Resources. It is a community-sourced repository containing open source learning material about data sciences in general.

Platform information

Information about platform components can be shown by issuing the following commands in a shell:

NVIDIA CUDA Toolkit

The CUDA toolkit provides a development environment for creating high performance GPU-accelerated applications. It is a necessary software dependency for tools used in GPU computing.

Matching toolkit versions to installed driver

The version of the NVIDIA driver installed on a platform limits the version range of CUDA toolkits working with the driver. The driver version is subject to operating update policies and cannot be changed by a user with normal privileges. It is not uniform on servers an desktop clients.

For your projects to work it is crucial to

Installing a specific toolkit version with conda

The easiest way to install the CUDA toolkit is by using conda. Available versions can be shown with

conda search cudatoolkit

And the version matching the driver can be installed with the following command in an active environment:

conda install cudatoolkit=10.1

Missing features

The feature set of the anaconda package cudatoolkit is not identical to a toolkit installed manually from the package offered by NVIDIA. The NVIDIA Cuda Compiler nvcc is missing for example. At the time of writing this article the alternative was to install the package cudatoolkit-dev which downloads and installs a full CUDA toolkit.

Installing a specific toolkit version with its official installer

A complete toolkit can be installed outside of a conda virtual environment by using the official installer for the version of choice.

Download the installer

This will show either a download button or a wget command with the URL to download the installer:
http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.243_418.87.00_linux.run
Note, the minor versions of the toolkit and driver might not be reflected in NVIDIA's dependency matrix.

Install with normal user privileges

The following script facilitates installation and provides options to the installer in order to install it in a custom location and without elevated privileges. Please adapt the variables containing version numbers to the version of your choice. The

#!/bin/bash

# Adapt the following version numbers according to your needs
cuda_version_major='10.1'
cuda_version_minor='243'
driver_version_major='418'
driver_version_minor='87.00'
cuda_version="${cuda_version_major}.${cuda_version_minor}_${driver_version_major}.${driver_version_minor}"

# Adapt the following directory locations according to your needs
cuda_install_dir="/scratch/${USER}/cuda/${cuda_version}"
TMPDIR="/scratch/${USER}/tmp"

cuda_installer="cuda_${cuda_version}_linux.run"

mkdir -p "${cuda_install_dir}" "${TMPDIR}"
if [[ ! -f "${TMPDIR}/${cuda_installer}" ]]; then
    wget "http://developer.download.nvidia.com/compute/cuda/${cuda_version_major}/Prod/local_installers/${cuda_installer}" -O "${TMPDIR}/${cuda_installer}"
fi
if [[ ! -x "${TMPDIR}/${cuda_installer}" ]]; then
    chmod 700 "${TMPDIR}/${cuda_installer}"
fi
echo 'Installing, please be patient.'
if "${TMPDIR}/${cuda_installer}" --silent --override --toolkit --installpath="${cuda_install_dir}" --toolkitpath="${cuda_install_dir}" --no-man-page --tmpdir="${TMPDIR}"; then
    echo 'Done.'
    echo
    echo "To use CUDA Toolkit ${cuda_version_major}.${cuda_version_minor}, extend your environment as follows:"
    echo
    if [[ -z ${PATH} ]]; then
        echo "export PATH=${cuda_install_dir}/bin"
    else
        echo "export PATH=${cuda_install_dir}/bin:\${PATH}"
    fi
    if [[ -z ${LD_LIBRARY_PATH} ]]; then
        echo "export LD_LIBRARY_PATH=${cuda_install_dir}/lib64"
    else
        echo "export LD_LIBRARY_PATH=${cuda_install_dir}/lib64:\${LD_LIBRARY_PATH}"
    fi
else
    cat /tmp/cuda-installer.log
fi

Important reminder about working locally

If you're working locally, meaning on a managed Linux desktop or your private machine, always keep in mind:

cuDNN library

The cuDNN library is a GPU-accelerated library of primitives for deep neural networks. It is another dependency for GPU computing. In order to use it NVIDIA asks you to read the Software Level Agreement for the library. The library is registered by ISG to be used for research at D-ITET. If you use the library differently you are obliged to register it yourself.

conda automatically installs this library if it's a dependency of another package installed.

pytorch

pytorch is one of the main open source deep learning platforms in use at the time of writing this page. If you haven't done so already, read this installation example.

A good starting point for further information is the official pytorch documentation.

Testing pytorch

To verify the successful installation of pytorch run the following python code in your python interpreter:

import torch
x = torch.rand(5, 3)
print(x)

The output should be similar to the following:

tensor([[0.4813, 0.8839, 0.1568],
        [0.0485, 0.9338, 0.1582],
        [0.1453, 0.5322, 0.8509],
        [0.2104, 0.4154, 0.9658],
        [0.6050, 0.9571, 0.3570]])

Environment and platform information

The following example shows how to gather information which you can use for example to decide whether to run your code on CPU or GPU:

import torch
import sys
print('__Python VERSION:', sys.version)
print('__pyTorch VERSION:', torch.__version__)
print('__CUDA VERSION')
print('__CUDNN VERSION:', torch.backends.cudnn.version())
print('__Number CUDA Devices:', torch.cuda.device_count())
print('__Devices:')
from subprocess import call
call(["nvidia-smi", "--format=csv", "--query-gpu=index,name,driver_version,memory.total,memory.used,memory.free"])
print('Active CUDA Device: GPU', torch.cuda.current_device())
print ('Available devices ', torch.cuda.device_count())
print ('Current cuda device ', torch.cuda.current_device())

tensorflow

tensorflow is another popular open source platform for machine learning. If you haven't done so already, read this installation example.

Choose from the available tutorials to learn how to use it.

Platform information

The following code prints information about the capabilities of the platform you run your environment on:

import tensorflow as tf
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

Lines containing device:XLA_ show which CPU/GPU devices are available.

A line containing cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version means the NVIDIA driver installed on the system you run the code on is not compatible with the CUDA toolkit installed in the environment you run the code from.

An extensive list of device information can be shown with:

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

The module tf.test contains helpful functions to gather platform information:

Managing GPU resources

If your code is going to run on a GPU cluster you need to make sure you manage your use of GPU resources and use the following recommended configuration:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.allow_soft_placement = True
sess = tf.Session(config=config)