The focus of this guide is to cover the particulars of using the GPU nodes of Tetralith conveniently in a hands-on, practical manner focused on examples. It is not intended to cover GPU compute usage in general. References to more exhaustive background sources are included as a matter of convenience.
The guide assumes familiarity with the “Tetralith - Getting Started Guide”, if you have not already done so, please acquaint yourself with it. It is also assumed that you have a valid login account on Tetralith and a project with access to the GPU nodes of the cluster, the process is described in “Becoming a member of a project” and “Getting a login account”.
The examples in the guide will also assume that you are using the ThinLinc VNC solution to
access Tetralith, as this will provide the best user experience
running the examples or indeed anything else requiring X-windows graphics. Furthermore,
ThinLinc will provide session management where you can suspend and resume your
session on the Tetralith login node in a way akin to that of terminal multiplexers
(or their front-ends) such as screen
, byobu
or tmux
when using an SSH login.
That said, it is perfectly allowed to use SSH logins and work that way if that
is your preference, most parts of this user guide should work very well also in
this context.
A few GPU nodes (as of 2020-11-17, 8 nodes, but this can change) are
reserved for test and development. Jobs using these nodes can not be
longer than 60 minutes. You can also not use more than two such nodes
at any one time. To use these nodes, add --reservation=now
to your
interactive/sbatch command and make sure to request a timelimit of
less than 60 minutes (e.g -t 00:30:00
).
From your desktop to an interactive prompt on a Tetralith GPU node in three simple steps:
interactive -n 1 -c 32 --gpus-per-task=1 -t 60
This allocates 1 task comprising 32 CPU cores and 1 GPU for 60 minutes.
Tetralith only has nodes equipped with one GPU at the most, so allocating one
task and one GPU per task is the recommended way to use the GPU nodes.
Naturally, you should allocate as many cores as appropriate per task for your
situation, but we recommend allocating the full node to have exclusive access
to all of the node’s resources, including the scratch disk, to ensure you can
make full use of the GPU. This can alternatively be accomplished with the -N 1
switch in stead of -n 1 -c 32
above. You should now be presented with an
interactive prompt on a GPU node like
[username@tetralith1 ~]$ interactive -n 1 -c 32 --gpus-per-task=1 -t 60
salloc: Granted job allocation 969878
srun: Step created for job 969878
[username@n1296 ~]$
It is sometimes prudent to check that you have access to the GPU devices you have allocated, for instance
[username@n1296 ~]$ echo $CUDA_VISIBLE_DEVICES
0
The above means you have access to GPU device number 0, not that you have 0 devices. You can also check using the nvidia-smi
tool
[username@n1296 ~]$ nvidia-smi
Mon Aug 24 15:23:20 2020
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01 Driver Version: 440.33.01 CUDA Version: 10.2 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla T4 On | 00000000:3B:00.0 Off | 0 |
| N/A 36C P8 9W / 70W | 0MiB / 15109MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
An important point to be aware of is that data access between centre storage
under /home
and /proj
and the Tetralith GPU (and CPU) compute nodes is not
suitable for I/O intensive loads of small random-read/write character common in
for instance Machine Learning. For this reason, the GPU nodes are equipped with
scratch disks both larger (2 TB) and faster (2.5 GiB/s) compared to other
Tetralith nodes. These disks are suitable for this type of I/O load and are
available to each job under /scratch/local
like all scratch disks on
Tetralith. Just as on the rest of Tetralith, this scratch space is volatile,
accessible on a per-job basis, and will be cleared after each SLURM job.
Transferring data sets to the node local scratch space should by preference be
done in large contiguous chunks to maximize transfer speed. A good way to
achieve this is to store your data set on centre storage (/proj
is suggested)
in an uncompressed tar
archive and transfer this with direct unpacking to
scratch in your SLURM job on the allocated GPU node like
[username@n1296 ~]$ tar xf /proj/some/path/to/DATA_SET.tar -C /scratch/local
Avoid using compression of the tar
archive, it will add a timing overhead to
the data set unpacking in the order of 4x for gunzip
(.tar.gz archives) and
20x for bunzip2
(.tar.bz2 archives).
Depending on you computational workload, it can sometimes be beneficial to run
several jobs concurrently on the GPU. When running interactively, this can be
done using a terminal multiplexer such as GNU screen
. In a shell script
context such as a SLURM batch job, this is accomplished with the &
(background) operator combined with wait
, for instance like
# <script commands before>
some_program_A input1 &
some_program_B input2 &
wait
# <script commands after>
The wait
command makes the script pause at that point and wait for the
backgrounded processes to finish before the script progresses.
Access to the GPU nodes is provided to all Tetralith users, much like the large memory nodes are. For login access to the Tetralith GPU nodes NSC recommends using the “ThinLinc” VNC solution, as this will provide the most convenient and reliable access to graphical applications on Tetralith, but SSH can be used as well. Data transfers to Tetralith are described at “Getting data to and from the cluster”.
Project storage is hosted on the NSC Centre Storage, under the /proj
and
/home
directories, where /home
is a smaller backed up area intended for
precious data, and the former is non-backup volume storage, which is
typically where projects should store volume data not requiring backup to tape.
See NSC’s Centre Storage for more detailed
information.
While NSC Centre Storage is generally high-performance, some data access patterns are much less well suited for it than others. In particular, input-output (I/O) loads characterized by predominantly small random read-write operations will not perform well on Centre Storage. For these loads, node local scratch disks are available. General information about scratch space is available at Node local scratch storage. With respect to scratch space, the GPU nodes are different to the CPU-only nodes only in the size and performance, where the GPU-nodes have 2 TB scratch storage space and performance around 2.5 GB/s, 600k IOPs random-read.
The GPU nodes of Tetralith do not have any special status with respect to
SLURM, i.e they will be filled also with jobs not requesting a GPU, but after regular nodes, much in the
same way that the large memory nodes of Tetralith are. Allocating GPU resources
is done via the SLURM resource manager like on the rest of Tetralith, see
“Tetralith Batch Jobs”, with the
addition of the --gpus-per-task
switch. For example, allocating a single
node with a GPU for interactive work via a terminal on Tetralith can be done as
follows
[username@tetralith1 ~] $ interactive -n 1 -c 32 -t 1:00:00 --gpus-per-task=1
A batch job script analogous to the interactive
command above would look like
#!/bin/bash
#SBATCH -n 1
#SBATCH -c 32
#SBATCH -t 1:00:00
#SBATCH --gpus-per-task=1
# Job script commands follow
Using a GPU, there is no extra “charge” to your CPU hour allocation, making the use of the GPU nodes transparent to you as a user and equivalent to using other special features of the cluster such as large memory nodes.
Working interactively at the CLI prompt with SLURM allocated resources is most
conveniently done by using the NSC command interactive
, which is a wrapper
around the SLURM salloc
and srun
commands, and as such accepts the switches
supported by these. In short, the interactive
command launches a shell
process on the first allocated node of a job, drops you in the shell and lets you
interact with the allocated resources from there, much like any other CLI
prompt. More information about interactive
can be found at Running
applications under the “Interactive
jobs” heading. More information about batch jobs can be found at “Batch jobs
and scheduling”.
Running GPU jobs (i.e. no MPI communications involved), is typically done exactly as you would run them on your laptop or workstation, after having allocated resources via SLURM. It becomes a bit more involved when using MPI. In principle, when running applications built at NSC with the NSC provided build environments, the general process is described at “Running parallel jobs with mpprun”, but there are many application specific caveats to take into account, and the application documentation should be consulted when running GPU+MPI applications. It is currently not expected that performance should be very scalable, i.e. the situation where 2x GPUs yields close to 2x performance, when running on more nodes than one for GPU+MPI jobs, due to limitations in the interconnect driver and the data transport (PSM2) libraries in use. It should be functional though, only the GPUdirect feature over the OmniPath interconnect is currently unsupported at NSC.
Checking various aspects of your job’s status at NSC as it is running is
normally done using the jobsh
script to access nodes of the job. However,
this is not possible at present when it comes to monitoring GPU jobs, e.g. any
SLURM job allocated with --gpus-per-task
, due to
limitations in the resource manager. Instead, it is suggested that at least for
interactive jobs you use a terminal multiplexer at the GPU node such as tmux
or screen
before starting your job. These multiplexers will allow you to open
a second shell prompt within the job (check the tmux
or screen
man
pages
for how) to check GPU usage with nvidia-smi
or other tools as you go.
Transferring data sets to the node local scratch space should by
preference be done in large contiguous chunks to maximize transfer speed.
A good way to achieve this is to store your data set on centre storage
(/proj
is suggested) in an uncompressed tar
archive and transfer this with
direct unpacking to scratch in your SLURM job on the allocated GPU node like
[username@n1296 ~]$ tar xf /proj/some/path/to/DATA_SET.tar -C /scratch/local
Carrying out your data transfers this way, you can expect a data transfer speed
of about 700 MB/s to these scratch disks, e.g. a 42 GB tar
archive should be
unpacked to local disk in around a minute on the GPU nodes of Tetralith. On
regular “thin” Tetralith nodes, the performance is bottle-necked by the slower
local disk of these nodes to something like 200 MB/s.
Avoid using compression of the tar
archive, it will add a timing overhead to
the data set unpacking in the order of 4x for gunzip
(.tar.gz archives) and
20x for bunzip2
(.tar.bz2 archives). If you absolutely must use a compressed
tar archive, you can perform the decompression using a parallelized
implementation of your compression program like for instance pigz
when using
gzip
-compressed archives
tar xf --use-compress-program=pigz /proj/some/path/to/DATA_SET.tar.gz -C /scratch/local
The different parallel compression programs have variously well implemented parallelization, so your mileage may vary with respect to unpacking times.
The user environment at NSC is handled with the module
system as described on
our Module system page. Find out if there is GPU
support for the software title you’re interested in using
module avail YOUR_SW_TITLE
and see if it has a string in it indicating any
GPU capability, like “cuda”, “opencl” or “openacc” for instance. Also, check
out the software specific documentation
for your title, for details on how to use it at NSC.
All CUDA enabled modules suitable for use on the GPU nodes can be listed with
module avail cuda
, where cuda
is a case insensitive search string. An
important module is the
buildenv-gcccuda/<CUDA_VERSION>-<BASE_BUILDENV_VERSION>
module, which sets up
a CUDA toolkit build environment along with a GCC compiler, MPI and math
libraries custom built against this CUDA version. Initially, there is one
module available called buildenv-gcccuda/10.2-7.3.0-bare
(which will be
complemented down the line). To use it issue module load
buildenv-gcccuda/10.2-7.3.0-bare
on the CLI prompt.
You can of course also install to your own directories whatever CUDA version
you need and use that (with a suitable compiler). Note though, that NSC will be less
able to support you if you use your own CUDA installations. If you have no
requirement on MPI working with your own CUDA installation, you should be good to go as soon as
you set the following CUDA-specific environment variables; CUDA_HOME
, CUDA_ROOT
and
CUDA_PATH
, all pointing to the root of your CUDA installation, e.g. (assuming
you have downloaded the CUDA toolkit .run file)
[username@tetralith1 ~] $ cuda_installpath="/proj/<project_name>/some/path/CUDA/10.2.89_440.33.01"
[username@tetralith1 ~] $ sh ./cuda_10.2.89_440.33.01_linux.run --toolkit --silent --installpath=${cuda_installpath}
[username@tetralith1 ~] $ export CUDA_HOME=${cuda_installpath} CUDA_ROOT=${cuda_installpath} CUDA_PATH=${cuda_installpath}
There are also other ways you need to modify your environment, consult the CUDA toolkit installation manual. The reason for the many different CUDA root installation pointing variables is historical, in practise you may only need to set one of them for your purposes.
NSC most often recommends building your software on the login nodes as these have the most complete build environment and (for all practical purposes) identical hardware to the compute nodes. This simplifies the hardware and software discovery procedure when building often very complex software. However, this situation is different with respect to the GPUs of the compute nodes compared to the login nodes where they are of different hardware generations. NSC still recommends building your software on the login nodes, but if you encounter problems where you cannot set the appropriate GPU compute capability (to Turing architecture, CUDA compute capability 7.5) at build time and you depend on hardware discovery, build your code on the GPU compute nodes instead. This may very well produce its own set of problems, and in such case, don’t hesitate to contact support.
A highly recommended development environment on the GPU nodes is the Singularity container solution. Singularity is a container environment much like Docker, it can import Docker images directly, but is adapted to work in an HPC setting. All compute nodes of the Tetralith cluster have the container solution Singularity installed, see our Singularity page for more details on how it is set up at NSC, and https://sylabs.io for the canonical information. The NSC documentation has some notes on Singularity security and trusted sources of containers. In addition to the trusted sources mentioned there, you may trust the NVIDIA NGC container registry used in the examples below.
Being a container solution, using a suitable Singularity image as your user environment brings a lot of advantages such as for example; using a familiar operating system and CUDA environment of choice, the convenience of portability between systems and reproducibility of results. Additionally Singularity can import well optimized Docker containers directly from the NVIDIA NGC registry or Docker hub, and also offer the possibility of modifying these to fit your needs. Examples of how to do this are provided in the Development Tools section.
If you are using Python as you development platform, another approach to manage
your user environment is the Conda package management system. NSC recommends
using Conda over vanilla Python virtual environments, as this seems to be
a more favoured solution when it comes to Python codes on the GPU, and it has
several functional advantages as well. Python virtualenv’s certainly work
though, if that’s your preference. When installing Python modules into you
Conda environment requiring compilation, be sure to have loaded a suitable
build environment module, e.g. buildenv-gcccuda/10.2-7.3.0-bare
.
This guide will not cover Conda further, instead consult official documentation
at
https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html
for its use. Check which Python Conda modules are available with module avail
conda
or install some other version to your own directories. Since Conda
environments can be rather voluminous, a tip is to make your
/home/<username>/.conda
directory a symbolic link to some place in your
project directory (in principle /proj/some/path/to/your/conda/dir
).
Debuggers and profilers specific to CUDA are provided by NVIDIA, and can be
accessed when loading a CUDA-enabled build environment like
buildenv-gcccuda/10.2-7.3.0-bare
. Documentation on these can be found at
https://developer.nvidia.com/cuda-gdb,
https://developer.nvidia.com/nsight-eclipse-edition
and
https://docs.nvidia.com/cuda/profiler-users-guide/index.html.
At the time of writing (2020-09), the version of Singularity installed on the
GPU nodes is from the 3.6 series, but you should use a version compatible with
whatever is current on the cluster when making your own images. You can check
the current version on a compute node with rpm -qi singularity
. It is
important to understand that you will need to build your Singularity image on
a computer where you have administrator (i.e. super user or root) privileges.
Typically you would have this on your own laptop or workstation but not
anywhere on Tetralith (sudo
does not work here), i.e. any sudo
invocation
in the following examples assumes you are working on your local computer where
you have root privileges. Installing a recent enough version of Singularity on
your computer will not be covered here, check out the official documentation at
https://sylabs.io.
Here follows a few examples of what you can do with Singularity. The examples barely scratches the surface of what you can be done, consult the official documentation for a comprehensive guide.
This pulls a docker image from the NVIDIA NGC registry and makes a Singularity image (.sif) of it
sudo singularity pull tensorflow-20.03-tf2-py3.sif docker://nvcr.io/nvidia/tensorflow:20.03-tf2-py3
[sudo] password for user:
WARNING: Authentication token file not found : Only pulls of public images will succeed
INFO: Starting build...
Getting image source signatures
Copying blob sha256:423ae2b273f4c17ceee9e8482fa8d071d90c7d052ae208e1fe4963fceb3d6954
25.46 MiB / 25.46 MiB [====================================================] 5s
Copying blob sha256:de83a2304fa1f7c4a13708a0d15b9704f5945c2be5cbb2b3ed9b2ccb718d0b3d
34.54 KiB / 34.54 KiB [====================================================] 0s
---8<------ SNIP ------->8----
Copying config sha256:fabc6c87fbf06db2bbd63455d5e2e95ce5d7cfcc732dc3af632f89cb25627d7f
37.30 KiB / 37.30 KiB [====================================================] 0s
Writing manifest to image destination
Storing signatures
INFO: Creating SIF file...
INFO: Build complete: tensorflow-20.03-tf2-py3.sif
Check the NVIDIA NGC registry at https://ngc.nvidia.com/catalog/all for other images of interest. For example a PyTorch image using Python3 can be downloaded with
sudo singularity pull pytorch_20.03-py3.sif docker://nvcr.io/nvidia/pytorch:20.03-py3
As a side note, pulling from the official PyTorch Docker repository, the same can be accomplished with
sudo singularity pull pytorch_latest.sif docker://pytorch/pytorch:latest
The image pulled from the NGC or other place can potentially need to be adapted
to suit your requirements better. Then you will want to create a writeable sandboxed
directory from your .sif image, enter and modify, and finally create an updated
version of your image. For instance adding a package (here vim-gtk
) can be
done with
sudo singularity build --sandbox pytorch_latest pytorch_latest.sif
sudo singularity shell --writable pytorch_latest/
Singularity pytorch_latest/:~> apt-get update
Singularity pytorch_latest/:~> apt-get install vim-gtk
Singularity pytorch_latest/:~> exit
sudo singularity build pytorch_latest.v2.sif pytorch_latest
You can make any installations into the image, including with pip
or using the container image OS native build tools (gcc etc.) to manually build
and install whatever you may need. If you are doing very elaborate installs,
you may want to use persistent overlays instead, see
https://sylabs.io/guides/3.6/user-guide/persistent_overlays.html.
If you need more control over what goes into your image, you will need to use your own Singularity definition files (image recipes if you will). Using the following Singularity definition file we can for instance build a Singularity image containing PyTorch, CUDA 10.2 and lots of added extras
Bootstrap: library
From: ubuntu:18.04
%post
# Get the packages from a nearby location (well, if Sweden is close to you at least)
cat << EOF > /etc/apt/sources.list
deb http://se.archive.ubuntu.com/ubuntu/ bionic main restricted
deb http://se.archive.ubuntu.com/ubuntu/ bionic-updates main restricted
deb http://se.archive.ubuntu.com/ubuntu/ bionic universe
deb http://se.archive.ubuntu.com/ubuntu/ bionic-updates universe
deb http://se.archive.ubuntu.com/ubuntu/ bionic multiverse
deb http://se.archive.ubuntu.com/ubuntu/ bionic-updates multiverse
deb http://se.archive.ubuntu.com/ubuntu/ bionic-backports main restricted universe multiverse
deb http://security.ubuntu.com/ubuntu bionic-security main restricted
deb http://security.ubuntu.com/ubuntu bionic-security universe
deb http://security.ubuntu.com/ubuntu bionic-security multiverse
EOF
# Downloads the latest package lists
apt-get update -y
# Install required and reasonable extra distro packages
DEBIAN_FRONTEND=noninteractive apt-get -y --no-install-recommends install \
build-essential \
wget \
git \
software-properties-common \
python3 \
python3-tk \
python3-pip \
gdb \
freeglut3-dev \
dirmngr \
gpg-agent \
python3-setuptools \
python-dev \
python3-dev \
python3-wheel \
python3-pip \
vim-gtk \
nano \
openmpi-bin \
libopenmpi-dev \
openssh-client
# Install extras, the atom editor for instance like here
add-apt-repository ppa:webupd8team/atom
apt-get update -y
apt-get -y install atom
# Get the NVIDIA repos
apt-key adv --fetch-keys \
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
wget -O /etc/apt/preferences.d/cuda-repository-pin-600 \
https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin
add-apt-repository \
"deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"
# Install CUDA (everything and the kitchen sink style)
apt-get update -y
# Make sure to match the CUDA version to whatever other packages are being
# installed
DEBIAN_FRONTEND=noninteractive apt-get -y install \
cuda-10-2 \
cuda-toolkit-10-2 \
cuda-samples-10-2 \
cuda-documentation-10-2
# Reduce the size of the image by deleting the package lists we downloaded,
# which are no longer needed.
rm -rf /var/lib/apt/lists/*
# Install Python modules. Make sure the CUDA-utilising python packages are
# compatible with whatever CUDA version was installed above.
pip3 install \
numpy \
matplotlib \
jupyter \
torch \
torchvision \
tqdm
# Set up the environment to be used within the container.
%environment
export LC_ALL=C
export PATH=/usr/local/cuda/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/sbin:/usr/local/bin
export CPATH=/usr/local/cuda/include:$CPATH
export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
Build your image from your definition file (here my_image.def
) like
sudo singularity build my_image.sif my_image.def
NB: Singularity images can only be launched from under the /proj
file system, ie. the image must be placed there when launched.
Using a toy PyTorch learning example train_xor.py, we can execute it on the GPU by way of the PyTorch containing image (in effect executing it in that operating system environment) for instance like this
[username@n1296 ~] $ singularity shell --nv pytorch_20.03-py3.sif
Singularity> python train_xor.py
CUDA is available -- using GPU
iteration #1
loss: 0.24764549732208252
accuracy: 50.00%
iteration #2
loss: 0.2470039427280426
accuracy: 50.00%
---8<---- SNIP ---->8---
iteration #100
loss: 0.006982346531003714
accuracy: 100.00%
Singularity> exit
exit
[username@n1296 ~] $
The --nv
switch above to singularity is crucial for GPU execution as it
imports the relevant server host OS NVIDIA runtime libraries and devices (i.e.
from CentOS 7 of the GPU node) into the image to permit execution on the GPUs
from within the singularity container.
Guides, documentation and FAQ.
Applying for projects and login accounts.