Differences between revisions 3 and 34 (spanning 31 versions)
Revision 3 as of 2019-05-14 10:57:55
Size: 4758
Editor: stroth
Comment:
Revision 34 as of 2023-10-16 13:52:05
Size: 14827
Editor: alders
Comment:
Deletions are marked like this. Additions are marked like this.
Line 1: Line 1:
= Working with GPU or CPU =
Calculations in data sciences run on CPUs and/or GPUs. If you're using tools for or write code in this field, at some point it will be necessary to decide where the calculations are executed. The following information is supposed to help with that decision.
#rev 2020-09-08 stroth

<<TableOfContents()>>

= Working with GPU or CPU in data sciences =
This article is a based on the guide "[[Programming/Languages/Conda|Setting up a personal python development infrastructure]]" which is required reading to understand some of the concepts used here. The following information is intended for users of [[Services/SLURM|grid computing clusters]], i.e. for staff members. It is a collection of hints and explanations to uses tools in the field of data sciences on the D-ITET computing infrastructure.

For an introduction to data sciences have a look at the [[https://github.com/Chris-Engelhardt/data_sci_guide|Guided Data Science Resources]]. It is a community-sourced repository containing open source learning material about data sciences in general.
Line 5: Line 11:
The D-ITET infrastructure managed by ISG uses NVIDIA GPUs and Intel CPUs exclusively. Available platforms are either managed Linux workstations with a single GPU or GPU clusters.

Information about these components can be shown by issuing the following commands in a shell:
Information about platform components can be shown by issuing the following commands in a shell:
Line 13: Line 17:
== GPU numbering ==
The numbering of GPUs can be confusing as it is non-uniform across different sources of information. One source of information is the so-called ''PCI bus number'', the other is the ''PCI device minor number''. They are generated differently and although their order might match, this cannot be taken for granted!

=== By PCI bus number: CUDA_VISIBLE_DEVICES ===
The environment variable `CUDA_DEVICE_ORDER` controls the numbering of GPUs in a CUDA context. It's default is `FASTEST_FIRST`, which sets the fastest available GPU to be the number 0 in `CUDA_VISIBLE_DEVICES`.<<BR>>
For details, see the section [[https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars|CUDA Environment Variables]] in the CUDA toolkit documentation.<<BR>>
As long as a node only has one type of GPUs installed, this numbering can be identical to the ordering enforced by setting `CUDA_DEVICE_ORDER=PCI_BUS_ID`.

=== By PCI device minor number: nvidia-smi/NVML ===
The command `nvidia-smi` which uses the [[https://developer.nvidia.com/nvidia-management-library-nvml|Nvidia Management Library]] (NVML) numbers GPUs based on the enumeration by the kernel driver. As this can change between node reboots it should not be used as a constant value.<<BR>>
For details see the related section in the `nvidia-smi` man page by issuing the command `man --pager='less +/--id=ID' nvidia-smi` in your shell.<<BR>>
A GPU can consistently be detected by its UUID or PCI bus ID as follows:
{{{#!highlight bash numbers=disable
nvidia-smi -q |grep -E '(GPU UUID|Minor Number|Bus Id)\s+:' |paste - - - |column -t
}}}

=== By PCI device minor number: Operating system/Kernel driver ===
The GPU ID used by the operating system in /dev/nvidia[0..n] is based on the ''PCI device minor number''. This number is generated by the kernel driver in a non-transparent way, it can change after a reboot.<<BR>>
A GPU can consistently be detected by its UUID or PCI bus ID as follows:
{{{#!highlight bash numbers=disable
grep -h -E '(GPU UUID|Device Minor|Bus Location):' /proc/driver/nvidia/gpus/*/information |paste - - - |column -t
}}}

Line 14: Line 42:
The [[https://developer.nvidia.com/cuda-toolkit|CUDA toolkit]] provides a development environment for creating high performance GPU-accelerated applications. It is a necessary software dependency for many tools used

=== Driver and toolkit versions ===
The CUDA compatibility document by NVIDIA contains a [[https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility__table-toolkit-driver|dependency matrix]] matching driver and toolkit versions.
The [[https://developer.nvidia.com/cuda-toolkit|CUDA toolkit]] provides a development environment for creating high performance GPU-accelerated applications. It is a necessary software dependency for tools used in GPU computing.

=== Matching toolkit versions to installed driver ===
The '''version''' of the NVIDIA '''driver''' installed on a platform limits the version range of CUDA '''toolkits''' working with the driver. The driver version is subject to operating system update policies and cannot be changed by a user with normal privileges. It is typically a lower version on desktop clients and a higher version on [[Services/SLURM|Slurm]] GPU nodes.

CUDA toolkits matching the driver of a system are provided in the SEPP package [[https://www.sepp.ee.ethz.ch/sepp-debian/cuda_toolkit-1x.x-sr.html|cuda_toolkit-1x.x-sr]]. A toolkit command like `nvcc` is started through a wrapper which selects the command version matching the driver of the system it is invoked from.<<BR>><<BR>>

If you set up your project in a conda environment, a CUDA toolkit is likely installed as a dependency of another tool you install in your environment.<<BR>>
When you install such a project it is crucial to
 * be aware you need different environments to run a project on desktop clients or Slurm GPU nodes
 * check the driver version on the system your project should run on with `nvidia-smi` and
 * consult NVIDIA's [[https://docs.nvidia.com/deploy/cuda-compatibility/index.html#binary-compatibility__table-toolkit-driver|dependency matrix]]
 * to choose the toolkit version matching the driver installed on the platform you use which also works as a dependency in your environment
Line 20: Line 58:
Assuming the CUDA toolkit is to be installed in a [[Programming/Languages/Conda|conda environment]], available versions can be shown with
{{{
The easiest way to install the CUDA toolkit is by using [[Programming/Languages/Conda|conda]]. Available versions can be shown with
{{{#!highlight bash numbers=disable
Line 24: Line 62:
And the version matching the can be installed with the following command in an active environment:
{{{
conda install cudatoolkit=10.0
}}}

The following examples show how to install a specfic `python` version, `pytorch` and `tensorflow` in an environment intended to be run either on a Linux managed client, a GPU cluster or a Linux machine without a NVIDIA GPU. The CUDA toolkit versions in the examples are derived from the version of the NVIDIA driver available on a given platform, which always has to be determined before installing an environment. For details see [[#NVIDIA-CUDA-Toolkit|the explanation below]].


 [[https://pytorch.org/|pytorch]] and [[https://www.tensorflow.org/|tensorflow]] including non-python dependencies like [[https://developer.nvidia.com/cuda-toolkit|CUDA toolkit]] and the [[https://developer.nvidia.com/cudnn|cuDNN library]].

A [[https://software.intel.com/en-us/articles/intel-optimization-for-tensorflow-installation-guide#Anaconda_Intel|CPU version of tensorflow optimized for Intel CPUs]] exists, which might be a tempting choice. Be aware that this version of `tensorflow` and installed dependencies will differ from versions installed from the default channel in the examples above.

As shown in the examples above, environments can be tailored to a platform for optimal performance. Make sure you set up environments for each platform you intend to use. The list of packages installed and their version numbers should be identical on all environments if you follow the examples. An identical list of versions in your environments will make sure your environments behabe identically on all platforms.



pytorch

what is it
checks
test for cuda/cpu

tensorflow



=== Testing installations ===

==== Testing pytorch ====
And the version matching the driver can be installed with the following command in an active environment:
{{{#!highlight bash numbers=disable
conda install cudatoolkit=<version number>
}}}
/!\ conda defines [[https://docs.conda.io/projects/conda/en/latest/user-guide/tasks/manage-virtual.html|virtual packages]] to solve dependencies of real packages on features installed on the operating system it's running on. They can be shown with
{{{#!highlight bash numbers=disable
conda info
}}}
The virtual package `__cuda=<version number>` matches the NVIDIA driver installed on the system. In order to force-install an environment depending on a different driver version, the virtual package can be overriden by setting the environment variable
{{{#!highlight bash numbers=disable
export CONDA_OVERRIDE_CUDA=<version number>
}}}
A typical use case is the local preparation on a Linux workstation of a environment to be run on a GPU cluster node:
 * `__cuda=10.1` is shown as the virtual package version on the Linux workstation
 * `__cuda=11.4` is shown as the virtual package version on a GPU node
 * `CONDA_OVERRIDE_CUDA=11.4` is set on the Linux workstation before force-installing the environment

==== Missing features ====
The feature set of the anaconda package `cudatoolkit` is incomplete compared to a toolkit installed with the official installer by NVIDIA. The [[https://docs.nvidia.com/cuda/cuda-compiler-driver-nvcc/index.html|NVIDIA Cuda Compiler]] `nvcc` is missing for example.
At the time of writing this article the alternative was to install the package [[https://anaconda.org/conda-forge/cudatoolkit-dev|cudatoolkit-dev]] which downloads and installs a full CUDA toolkit.<<BR>>
Make sure to [[Programming/Languages/Conda#Change_TMPDIR|set TMPDIR to a location with enough space]] before installing `cudatoolkit-dev`.

=== Installing a specific toolkit version with its official installer ===
A complete toolkit can be installed outside of a conda virtual environment by using the official installer for the version of choice.
==== Download the installer ====
 * Select a toolkit version from the [[https://developer.nvidia.com/cuda-toolkit-archive|toolkit archive]]
 * Select the following to download the installer:
 . Operating System: '''Linux'''
 . Architecture: '''x86_64'''
 . Distribution: ''any''
 . Version: ''any''
 . Installer Type: '''runfile (local)'''
This will show either a download button or a `wget` command with the URL to download the installer:<<BR>>
`http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.243_418.87.00_linux.run`<<BR>>
Note, the minor versions of the toolkit and driver might not be reflected in NVIDIA's dependency matrix.

==== Install with normal user privileges ====
The following script facilitates installation and provides options to the installer in order to install it in a custom location and without elevated privileges. Please adapt the variables containing version numbers to the version of your choice.

{{{#!highlight bash numbers=disable
#!/bin/bash

# Adapt the following version numbers according to your needs
cuda_version_major='10.1'
cuda_version_minor='243'
driver_version_major='418'
driver_version_minor='87.00'
cuda_version="${cuda_version_major}.${cuda_version_minor}_${driver_version_major}.${driver_version_minor}"

# Adapt the following directory locations according to your needs
cuda_install_dir="/scratch/${USER}/cuda/${cuda_version}"
TMPDIR="/scratch/${USER}/tmp"

cuda_installer="cuda_${cuda_version}_linux.run"

mkdir -p "${cuda_install_dir}" "${TMPDIR}"
if [[ ! -f "${TMPDIR}/${cuda_installer}" ]]; then
    wget "http://developer.download.nvidia.com/compute/cuda/${cuda_version_major}/Prod/local_installers/${cuda_installer}" -O "${TMPDIR}/${cuda_installer}"
fi
if [[ ! -x "${TMPDIR}/${cuda_installer}" ]]; then
    chmod 700 "${TMPDIR}/${cuda_installer}"
fi
echo 'Installing, please be patient.'
if "${TMPDIR}/${cuda_installer}" --silent --override --toolkit --installpath="${cuda_install_dir}" --toolkitpath="${cuda_install_dir}" --no-man-page --tmpdir="${TMPDIR}"; then
    echo 'Done.'
    echo
    echo "To use CUDA Toolkit ${cuda_version_major}.${cuda_version_minor}, extend your environment as follows:"
    echo
    if [[ -z ${PATH} ]]; then
        echo "export PATH=${cuda_install_dir}/bin"
    else
        echo "export PATH=${cuda_install_dir}/bin:\${PATH}"
    fi
    if [[ -z ${LD_LIBRARY_PATH} ]]; then
        echo "export LD_LIBRARY_PATH=${cuda_install_dir}/lib64"
    else
        echo "export LD_LIBRARY_PATH=${cuda_install_dir}/lib64:\${LD_LIBRARY_PATH}"
    fi
else
    cat /tmp/cuda-installer.log
fi
}}}

== Important reminder about working locally ==
If you're working locally, meaning on a managed Linux desktop or your private machine, always keep in mind:
 * '''The local GPU might not have enough memory for your project'''
 * '''The CUDA version you're using in your project environment might be too new for the driver installed locally'''

== cuDNN library ==
The [[https://developer.nvidia.com/cudnn|cuDNN library]] is a GPU-accelerated library of primitives for deep neural networks. It is another dependency for GPU computing.
In order to use it NVIDIA asks you to read the [[https://docs.nvidia.com/deeplearning/sdk/cudnn-sla/index.html|Software Level Agreement]] for the library. The library is registered by ISG D-ITET to be used for research at D-ITET. If you use the library differently you are obliged to register it yourself.

[[Programming/Languages/Conda|conda]] automatically installs this library if it's a dependency of another package installed.

== pytorch ==
[[https://pytorch.org/|pytorch]] is one of the main open source deep learning platforms in use at the time of writing this page. If you haven't done so already, read this [[Programming/Languages/Conda#Creating_an_environment_with_the_GPU_version_of_pytorch_and_CUDA_toolkit_10|installation example]].

A good starting point for further information is the [[https://pytorch.org/docs/stable/index.html|official pytorch documentation]].

=== Testing pytorch ===
Line 55: Line 164:
from __future__ import print_function
Line 68: Line 176:
To verify CUDA availability in `pytorch`, run the following code:
=== Environment and platform information ===
The following example shows how to gather information which you can use for example to decide whether to run your code on CPU or GPU:
Line 71: Line 181:
torch.cuda.is_available()
}}}
It should return ''True''.

==== Testing TensorFlow ====
The following code prints information about your `tensorflow` installation:
import sys
print('__Python VERSION:', sys.version)
print('__pyTorch VERSION:', torch.__version__)
print('__CUDA VERSION', torch.version.cuda)
print('__CUDNN VERSION:', torch.backends.cudnn.version())
print('__Number CUDA Devices:', torch.cuda.device_count())
print('__Device capability:', torch.cuda.get_device_capability())
print('__Devices:')
from subprocess import call
call(["nvidia-smi", "--format=csv", "--query-gpu=index,name,driver_version,memory.total,memory.used,memory.free"])
print('Active CUDA Device: GPU', torch.cuda.current_device())
print ('Available devices ', torch.cuda.device_count())
print ('Current cuda device ', torch.cuda.current_device())
}}}

== tensorflow ==
[[https://www.tensorflow.org/|tensorflow]] is another popular open source platform for machine learning. If you haven't done so already, read this [[Programming/Languages/Conda#Creating_an_environment_with_the_GPU_version_of_tensorflow_and_CUDA_toolkit_10|installation example]].

Choose from the [[https://www.tensorflow.org/tutorials/|available tutorials]] to learn how to use it.

=== Platform information ===
The following code prints information about the capabilities of the platform you run your environment on:
Line 81: Line 207:
Lines containing `device: XLA_` show which CPU/GPU devices are available.

A line containing `cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version` means the NVIDIA driver installed on the system you run the code is not compatible with the CUDA toolkit installed in the environment you run the code from.




== Additional buzzwords to find this article ==
 * Deep learning
 * Machine learning
 * Neural networks
 * Big Data
 
Lines containing `device:XLA_` show which CPU/GPU devices are available.

A line containing `cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version` means the NVIDIA driver installed on the system you run the code on is not compatible with the CUDA toolkit installed in the environment you run the code from.

An extensive list of device information can be shown with:
{{{#!highlight python numbers=disable
from tensorflow.python.client import device_lib
device_lib.list_local_devices()
}}}

The module [[https://www.tensorflow.org/api_docs/python/tf/test|tf.test]] contains helpful functions to gather platform information:
 * [[https://www.tensorflow.org/api_docs/python/tf/test/is_gpu_available|tf.test.is_gpu_available]]
 * [[https://www.tensorflow.org/api_docs/python/tf/test/gpu_device_name|tf.test.gpu_device_name]]

=== Managing GPU resources ===
If your code is going to run on a GPU cluster you need to make sure you [[https://www.tensorflow.org/guide/using_gpu|manage your use of GPU resources]] and use the following recommended configuration:
{{{#!highlight python numbers=disable
config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.allow_soft_placement = True
sess = tf.Session(config=config)
}}}

Working with GPU or CPU in data sciences

This article is a based on the guide "Setting up a personal python development infrastructure" which is required reading to understand some of the concepts used here. The following information is intended for users of grid computing clusters, i.e. for staff members. It is a collection of hints and explanations to uses tools in the field of data sciences on the D-ITET computing infrastructure.

For an introduction to data sciences have a look at the Guided Data Science Resources. It is a community-sourced repository containing open source learning material about data sciences in general.

Platform information

Information about platform components can be shown by issuing the following commands in a shell:

  • lscpu

  • Shows information about the CPUs, most relevantly the number of CPU cores available in the line starting with CPU(s):

  • nvidia-smi

  • Shows the NVIDIA driver version, the CUDA toolkit version and GPUs with their available memory

GPU numbering

The numbering of GPUs can be confusing as it is non-uniform across different sources of information. One source of information is the so-called PCI bus number, the other is the PCI device minor number. They are generated differently and although their order might match, this cannot be taken for granted!

By PCI bus number: CUDA_VISIBLE_DEVICES

The environment variable CUDA_DEVICE_ORDER controls the numbering of GPUs in a CUDA context. It's default is FASTEST_FIRST, which sets the fastest available GPU to be the number 0 in CUDA_VISIBLE_DEVICES.
For details, see the section CUDA Environment Variables in the CUDA toolkit documentation.
As long as a node only has one type of GPUs installed, this numbering can be identical to the ordering enforced by setting CUDA_DEVICE_ORDER=PCI_BUS_ID.

By PCI device minor number: nvidia-smi/NVML

The command nvidia-smi which uses the Nvidia Management Library (NVML) numbers GPUs based on the enumeration by the kernel driver. As this can change between node reboots it should not be used as a constant value.
For details see the related section in the nvidia-smi man page by issuing the command man --pager='less +/--id=ID' nvidia-smi in your shell.
A GPU can consistently be detected by its UUID or PCI bus ID as follows:

nvidia-smi -q |grep -E '(GPU UUID|Minor Number|Bus Id)\s+:' |paste - - - |column -t

By PCI device minor number: Operating system/Kernel driver

The GPU ID used by the operating system in /dev/nvidia[0..n] is based on the PCI device minor number. This number is generated by the kernel driver in a non-transparent way, it can change after a reboot.
A GPU can consistently be detected by its UUID or PCI bus ID as follows:

grep -h -E '(GPU UUID|Device Minor|Bus Location):' /proc/driver/nvidia/gpus/*/information |paste - - - |column -t

NVIDIA CUDA Toolkit

The CUDA toolkit provides a development environment for creating high performance GPU-accelerated applications. It is a necessary software dependency for tools used in GPU computing.

Matching toolkit versions to installed driver

The version of the NVIDIA driver installed on a platform limits the version range of CUDA toolkits working with the driver. The driver version is subject to operating system update policies and cannot be changed by a user with normal privileges. It is typically a lower version on desktop clients and a higher version on Slurm GPU nodes.

CUDA toolkits matching the driver of a system are provided in the SEPP package cuda_toolkit-1x.x-sr. A toolkit command like nvcc is started through a wrapper which selects the command version matching the driver of the system it is invoked from.

If you set up your project in a conda environment, a CUDA toolkit is likely installed as a dependency of another tool you install in your environment.
When you install such a project it is crucial to

  • be aware you need different environments to run a project on desktop clients or Slurm GPU nodes
  • check the driver version on the system your project should run on with nvidia-smi and

  • consult NVIDIA's dependency matrix

  • to choose the toolkit version matching the driver installed on the platform you use which also works as a dependency in your environment

Installing a specific toolkit version with conda

The easiest way to install the CUDA toolkit is by using conda. Available versions can be shown with

conda search cudatoolkit

And the version matching the driver can be installed with the following command in an active environment:

conda install cudatoolkit=<version number>

/!\ conda defines virtual packages to solve dependencies of real packages on features installed on the operating system it's running on. They can be shown with

conda info

The virtual package __cuda=<version number> matches the NVIDIA driver installed on the system. In order to force-install an environment depending on a different driver version, the virtual package can be overriden by setting the environment variable

export CONDA_OVERRIDE_CUDA=<version number>

A typical use case is the local preparation on a Linux workstation of a environment to be run on a GPU cluster node:

  • __cuda=10.1 is shown as the virtual package version on the Linux workstation

  • __cuda=11.4 is shown as the virtual package version on a GPU node

  • CONDA_OVERRIDE_CUDA=11.4 is set on the Linux workstation before force-installing the environment

Missing features

The feature set of the anaconda package cudatoolkit is incomplete compared to a toolkit installed with the official installer by NVIDIA. The NVIDIA Cuda Compiler nvcc is missing for example. At the time of writing this article the alternative was to install the package cudatoolkit-dev which downloads and installs a full CUDA toolkit.
Make sure to set TMPDIR to a location with enough space before installing cudatoolkit-dev.

Installing a specific toolkit version with its official installer

A complete toolkit can be installed outside of a conda virtual environment by using the official installer for the version of choice.

Download the installer

  • Select a toolkit version from the toolkit archive

  • Select the following to download the installer:
  • Operating System: Linux

  • Architecture: x86_64

  • Distribution: any

  • Version: any

  • Installer Type: runfile (local)

This will show either a download button or a wget command with the URL to download the installer:
http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.243_418.87.00_linux.run
Note, the minor versions of the toolkit and driver might not be reflected in NVIDIA's dependency matrix.

Install with normal user privileges

The following script facilitates installation and provides options to the installer in order to install it in a custom location and without elevated privileges. Please adapt the variables containing version numbers to the version of your choice.

#!/bin/bash

# Adapt the following version numbers according to your needs
cuda_version_major='10.1'
cuda_version_minor='243'
driver_version_major='418'
driver_version_minor='87.00'
cuda_version="${cuda_version_major}.${cuda_version_minor}_${driver_version_major}.${driver_version_minor}"

# Adapt the following directory locations according to your needs
cuda_install_dir="/scratch/${USER}/cuda/${cuda_version}"
TMPDIR="/scratch/${USER}/tmp"

cuda_installer="cuda_${cuda_version}_linux.run"

mkdir -p "${cuda_install_dir}" "${TMPDIR}"
if [[ ! -f "${TMPDIR}/${cuda_installer}" ]]; then
    wget "http://developer.download.nvidia.com/compute/cuda/${cuda_version_major}/Prod/local_installers/${cuda_installer}" -O "${TMPDIR}/${cuda_installer}"
fi
if [[ ! -x "${TMPDIR}/${cuda_installer}" ]]; then
    chmod 700 "${TMPDIR}/${cuda_installer}"
fi
echo 'Installing, please be patient.'
if "${TMPDIR}/${cuda_installer}" --silent --override --toolkit --installpath="${cuda_install_dir}" --toolkitpath="${cuda_install_dir}" --no-man-page --tmpdir="${TMPDIR}"; then
    echo 'Done.'
    echo
    echo "To use CUDA Toolkit ${cuda_version_major}.${cuda_version_minor}, extend your environment as follows:"
    echo
    if [[ -z ${PATH} ]]; then
        echo "export PATH=${cuda_install_dir}/bin"
    else
        echo "export PATH=${cuda_install_dir}/bin:\${PATH}"
    fi
    if [[ -z ${LD_LIBRARY_PATH} ]]; then
        echo "export LD_LIBRARY_PATH=${cuda_install_dir}/lib64"
    else
        echo "export LD_LIBRARY_PATH=${cuda_install_dir}/lib64:\${LD_LIBRARY_PATH}"
    fi
else
    cat /tmp/cuda-installer.log
fi

Important reminder about working locally

If you're working locally, meaning on a managed Linux desktop or your private machine, always keep in mind:

  • The local GPU might not have enough memory for your project

  • The CUDA version you're using in your project environment might be too new for the driver installed locally

cuDNN library

The cuDNN library is a GPU-accelerated library of primitives for deep neural networks. It is another dependency for GPU computing. In order to use it NVIDIA asks you to read the Software Level Agreement for the library. The library is registered by ISG D-ITET to be used for research at D-ITET. If you use the library differently you are obliged to register it yourself.

conda automatically installs this library if it's a dependency of another package installed.

pytorch

pytorch is one of the main open source deep learning platforms in use at the time of writing this page. If you haven't done so already, read this installation example.

A good starting point for further information is the official pytorch documentation.

Testing pytorch

To verify the successful installation of pytorch run the following python code in your python interpreter:

import torch
x = torch.rand(5, 3)
print(x)

The output should be similar to the following:

tensor([[0.4813, 0.8839, 0.1568],
        [0.0485, 0.9338, 0.1582],
        [0.1453, 0.5322, 0.8509],
        [0.2104, 0.4154, 0.9658],
        [0.6050, 0.9571, 0.3570]])

Environment and platform information

The following example shows how to gather information which you can use for example to decide whether to run your code on CPU or GPU:

import torch
import sys
print('__Python VERSION:', sys.version)
print('__pyTorch VERSION:', torch.__version__)
print('__CUDA VERSION', torch.version.cuda)
print('__CUDNN VERSION:', torch.backends.cudnn.version())
print('__Number CUDA Devices:', torch.cuda.device_count())
print('__Device capability:', torch.cuda.get_device_capability())
print('__Devices:')
from subprocess import call
call(["nvidia-smi", "--format=csv", "--query-gpu=index,name,driver_version,memory.total,memory.used,memory.free"])
print('Active CUDA Device: GPU', torch.cuda.current_device())
print ('Available devices ', torch.cuda.device_count())
print ('Current cuda device ', torch.cuda.current_device())

tensorflow

tensorflow is another popular open source platform for machine learning. If you haven't done so already, read this installation example.

Choose from the available tutorials to learn how to use it.

Platform information

The following code prints information about the capabilities of the platform you run your environment on:

import tensorflow as tf
sess = tf.Session(config=tf.ConfigProto(log_device_placement=True))

Lines containing device:XLA_ show which CPU/GPU devices are available.

A line containing cudaGetDevice() failed. Status: CUDA driver version is insufficient for CUDA runtime version means the NVIDIA driver installed on the system you run the code on is not compatible with the CUDA toolkit installed in the environment you run the code from.

An extensive list of device information can be shown with:

from tensorflow.python.client import device_lib
device_lib.list_local_devices()

The module tf.test contains helpful functions to gather platform information:

Managing GPU resources

If your code is going to run on a GPU cluster you need to make sure you manage your use of GPU resources and use the following recommended configuration:

config = tf.ConfigProto()
config.gpu_options.allow_growth = True
config.allow_soft_placement = True
sess = tf.Session(config=config)

Programming/Languages/GPUCPU (last edited 2023-10-16 13:52:05 by alders)