Tetralith - GPU User Guide

The focus of this guide is to cover the particulars of using the GPU nodes of Tetralith conveniently in a hands-on, practical manner focused on examples. It is not intended to cover GPU compute usage in general. References to more exhaustive background sources are included as a matter of convenience.

The guide assumes familiarity with the “Tetralith - Getting Started Guide”, if you have not already done so, please acquaint yourself with it. It is also assumed that you have a valid login account on Tetralith and a project with access to the GPU nodes of the cluster, the process is described in “Becoming a member of a project” and “Getting a login account”.

The examples in the guide will also assume that you are using the ThinLinc VNC solution to access Tetralith, as this will provide the best user experience running the examples or indeed anything else requiring X-windows graphics. Furthermore, ThinLinc will provide session management where you can suspend and resume your session on the Tetralith login node in a way akin to that of terminal multiplexers (or their front-ends) such as screen, byobu or tmux when using an SSH login. That said, it is perfectly allowed to use SSH logins and work that way if that is your preference, most parts of this user guide should work very well also in this context.

A few GPU nodes (as of 2020-11-17, 8 nodes, but this can change) are reserved for test and development. Jobs using these nodes can not be longer than 60 minutes. You can also not use more than two such nodes at any one time. To use these nodes, add --reservation=now to your interactive/sbatch command and make sure to request a timelimit of less than 60 minutes (e.g -t 00:30:00).

Quick Start Guide

From your desktop to an interactive prompt on a Tetralith GPU node in three simple steps:

  1. Log in to Tetralith via the ThinLinc client (or SSH)
  2. Open a terminal console (via the desktop menu for instance) in the ThinLinc desktop. If you used SSH to log in, this step doesn’t apply since you’re already in a console.
  3. In the console, allocate a GPU node with the command: interactive -n 1 -c 32 --gpus-per-task=1 -t 60

This allocates 1 task comprising 32 CPU cores and 1 GPU for 60 minutes. Tetralith only has nodes equipped with one GPU at the most, so allocating one task and one GPU per task is the recommended way to use the GPU nodes. Naturally, you should allocate as many cores as appropriate per task for your situation, but we recommend allocating the full node to have exclusive access to all of the node’s resources, including the scratch disk, to ensure you can make full use of the GPU. This can alternatively be accomplished with the -N 1 switch in stead of -n 1 -c 32 above. You should now be presented with an interactive prompt on a GPU node like

[username@tetralith1 ~]$ interactive -n 1 -c 32 --gpus-per-task=1 -t 60
salloc: Granted job allocation 969878
srun: Step created for job 969878
[username@n1296 ~]$ 

It is sometimes prudent to check that you have access to the GPU devices you have allocated, for instance

[username@n1296 ~]$ echo $CUDA_VISIBLE_DEVICES
0

The above means you have access to GPU device number 0, not that you have 0 devices. You can also check using the nvidia-smi tool

[username@n1296 ~]$ nvidia-smi 
Mon Aug 24 15:23:20 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla T4            On   | 00000000:3B:00.0 Off |                    0 |
| N/A   36C    P8     9W /  70W |      0MiB / 15109MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

An important point to be aware of is that data access between centre storage under /home and /proj and the Tetralith GPU (and CPU) compute nodes is not suitable for I/O intensive loads of small random-read/write character common in for instance Machine Learning. For this reason, the GPU nodes are equipped with scratch disks both larger (2 TB) and faster (2.5 GiB/s) compared to other Tetralith nodes. These disks are suitable for this type of I/O load and are available to each job under /scratch/local like all scratch disks on Tetralith. Just as on the rest of Tetralith, this scratch space is volatile, accessible on a per-job basis, and will be cleared after each SLURM job.

Transferring data sets to the node local scratch space should by preference be done in large contiguous chunks to maximize transfer speed. A good way to achieve this is to store your data set on centre storage (/proj is suggested) in an uncompressed tar archive and transfer this with direct unpacking to scratch in your SLURM job on the allocated GPU node like

[username@n1296 ~]$ tar xf /proj/some/path/to/DATA_SET.tar -C /scratch/local

Avoid using compression of the tar archive, it will add a timing overhead to the data set unpacking in the order of 4x for gunzip (.tar.gz archives) and 20x for bunzip2 (.tar.bz2 archives).

Depending on you computational workload, it can sometimes be beneficial to run several jobs concurrently on the GPU. When running interactively, this can be done using a terminal multiplexer such as GNU screen. In a shell script context such as a SLURM batch job, this is accomplished with the & (background) operator combined with wait, for instance like

# <script commands before>

some_program_A input1 &
some_program_B input2 &
wait

# <script commands after>

The wait command makes the script pause at that point and wait for the backgrounded processes to finish before the script progresses.

Getting Access

Access to the GPU nodes is provided to all Tetralith users, much like the large memory nodes are. For login access to the Tetralith GPU nodes NSC recommends using the “ThinLinc” VNC solution, as this will provide the most convenient and reliable access to graphical applications on Tetralith, but SSH can be used as well. Data transfers to Tetralith are described at “Getting data to and from the cluster”.

Storage Available

Project storage is hosted on the NSC Centre Storage, under the /proj and /home directories, where /home is a smaller backed up area intended for precious data, and the former is non-backup volume storage, which is typically where projects should store volume data not requiring backup to tape. See NSC’s Centre Storage for more detailed information.

While NSC Centre Storage is generally high-performance, some data access patterns are much less well suited for it than others. In particular, input-output (I/O) loads characterized by predominantly small random read-write operations will not perform well on Centre Storage. For these loads, node local scratch disks are available. General information about scratch space is available at Node local scratch storage. With respect to scratch space, the GPU nodes are different to the CPU-only nodes only in the size and performance, where the GPU-nodes have 2 TB scratch storage space and performance around 2.5 GB/s, 600k IOPs random-read.

Allocating GPU Resources

The GPU nodes of Tetralith do not have any special status with respect to SLURM, i.e they will be filled also with jobs not requesting a GPU, but after regular nodes, much in the same way that the large memory nodes of Tetralith are. Allocating GPU resources is done via the SLURM resource manager like on the rest of Tetralith, see “Tetralith Batch Jobs”, with the addition of the --gpus-per-task switch. For example, allocating a single node with a GPU for interactive work via a terminal on Tetralith can be done as follows

[username@tetralith1 ~] $ interactive -n 1 -c 32 -t 1:00:00 --gpus-per-task=1

A batch job script analogous to the interactive command above would look like

#!/bin/bash
#SBATCH -n 1
#SBATCH -c 32
#SBATCH -t 1:00:00
#SBATCH --gpus-per-task=1

# Job script commands follow

Using a GPU, there is no extra “charge” to your CPU hour allocation, making the use of the GPU nodes transparent to you as a user and equivalent to using other special features of the cluster such as large memory nodes.

Running Jobs

Working interactively at the CLI prompt with SLURM allocated resources is most conveniently done by using the NSC command interactive, which is a wrapper around the SLURM salloc and srun commands, and as such accepts the switches supported by these. In short, the interactive command launches a shell process on the first allocated node of a job, drops you in the shell and lets you interact with the allocated resources from there, much like any other CLI prompt. More information about interactive can be found at Running applications under the “Interactive jobs” heading. More information about batch jobs can be found at “Batch jobs and scheduling”.

Running GPU jobs (i.e. no MPI communications involved), is typically done exactly as you would run them on your laptop or workstation, after having allocated resources via SLURM. It becomes a bit more involved when using MPI. In principle, when running applications built at NSC with the NSC provided build environments, the general process is described at “Running parallel jobs with mpprun”, but there are many application specific caveats to take into account, and the application documentation should be consulted when running GPU+MPI applications. It is currently not expected that performance should be very scalable, i.e. the situation where 2x GPUs yields close to 2x performance, when running on more nodes than one for GPU+MPI jobs, due to limitations in the interconnect driver and the data transport (PSM2) libraries in use. It should be functional though, only the GPUdirect feature over the OmniPath interconnect is currently unsupported at NSC.

Checking Your Job

Checking various aspects of your job’s status at NSC as it is running is normally done using the jobsh script to access nodes of the job. However, this is not possible at present when it comes to monitoring GPU jobs, e.g. any SLURM job allocated with --gpus-per-task, due to limitations in the resource manager. Instead, it is suggested that at least for interactive jobs you use a terminal multiplexer at the GPU node such as tmux or screen before starting your job. These multiplexers will allow you to open a second shell prompt within the job (check the tmux or screen man pages for how) to check GPU usage with nvidia-smi or other tools as you go.

Data Transfer To and From Centre Storage to Local Disk

Transferring data sets to the node local scratch space should by preference be done in large contiguous chunks to maximize transfer speed. A good way to achieve this is to store your data set on centre storage (/proj is suggested) in an uncompressed tar archive and transfer this with direct unpacking to scratch in your SLURM job on the allocated GPU node like

[username@n1296 ~]$ tar xf /proj/some/path/to/DATA_SET.tar -C /scratch/local

Carrying out your data transfers this way, you can expect a data transfer speed of about 700 MB/s to these scratch disks, e.g. a 42 GB tar archive should be unpacked to local disk in around a minute on the GPU nodes of Tetralith. On regular “thin” Tetralith nodes, the performance is bottle-necked by the slower local disk of these nodes to something like 200 MB/s.

Avoid using compression of the tar archive, it will add a timing overhead to the data set unpacking in the order of 4x for gunzip (.tar.gz archives) and 20x for bunzip2 (.tar.bz2 archives). If you absolutely must use a compressed tar archive, you can perform the decompression using a parallelized implementation of your compression program like for instance pigz when using gzip-compressed archives

tar xf --use-compress-program=pigz /proj/some/path/to/DATA_SET.tar.gz -C /scratch/local

The different parallel compression programs have variously well implemented parallelization, so your mileage may vary with respect to unpacking times.

Setting Up Your User Environment

The user environment at NSC is handled with the module system as described on our Module system page. Find out if there is GPU support for the software title you’re interested in using module avail YOUR_SW_TITLE and see if it has a string in it indicating any GPU capability, like “cuda”, “opencl” or “openacc” for instance. Also, check out the software specific documentation for your title, for details on how to use it at NSC.

CUDA Development Environment

All CUDA enabled modules suitable for use on the GPU nodes can be listed with module avail cuda, where cuda is a case insensitive search string. An important module is the buildenv-gcccuda/<CUDA_VERSION>-<BASE_BUILDENV_VERSION> module, which sets up a CUDA toolkit build environment along with a GCC compiler, MPI and math libraries custom built against this CUDA version. Initially, there is one module available called buildenv-gcccuda/10.2-7.3.0-bare (which will be complemented down the line). To use it issue module load buildenv-gcccuda/10.2-7.3.0-bare on the CLI prompt.

You can of course also install to your own directories whatever CUDA version you need and use that (with a suitable compiler). Note though, that NSC will be less able to support you if you use your own CUDA installations. If you have no requirement on MPI working with your own CUDA installation, you should be good to go as soon as you set the following CUDA-specific environment variables; CUDA_HOME, CUDA_ROOT and CUDA_PATH, all pointing to the root of your CUDA installation, e.g. (assuming you have downloaded the CUDA toolkit .run file)

[username@tetralith1 ~] $ cuda_installpath="/proj/<project_name>/some/path/CUDA/10.2.89_440.33.01"
[username@tetralith1 ~] $ sh ./cuda_10.2.89_440.33.01_linux.run --toolkit --silent --installpath=${cuda_installpath}
[username@tetralith1 ~] $ export CUDA_HOME=${cuda_installpath} CUDA_ROOT=${cuda_installpath} CUDA_PATH=${cuda_installpath}

There are also other ways you need to modify your environment, consult the CUDA toolkit installation manual. The reason for the many different CUDA root installation pointing variables is historical, in practise you may only need to set one of them for your purposes.

Building Software

NSC most often recommends building your software on the login nodes as these have the most complete build environment and (for all practical purposes) identical hardware to the compute nodes. This simplifies the hardware and software discovery procedure when building often very complex software. However, this situation is different with respect to the GPUs of the compute nodes compared to the login nodes where they are of different hardware generations. NSC still recommends building your software on the login nodes, but if you encounter problems where you cannot set the appropriate GPU compute capability (to Turing architecture, CUDA compute capability 7.5) at build time and you depend on hardware discovery, build your code on the GPU compute nodes instead. This may very well produce its own set of problems, and in such case, don’t hesitate to contact support.

Singularity

A highly recommended development environment on the GPU nodes is the Singularity container solution. Singularity is a container environment much like Docker, it can import Docker images directly, but is adapted to work in an HPC setting. All compute nodes of the Tetralith cluster have the container solution Singularity installed, see our Singularity page for more details on how it is set up at NSC, and https://sylabs.io for the canonical information. The NSC documentation has some notes on Singularity security and trusted sources of containers. In addition to the trusted sources mentioned there, you may trust the NVIDIA NGC container registry used in the examples below.

Being a container solution, using a suitable Singularity image as your user environment brings a lot of advantages such as for example; using a familiar operating system and CUDA environment of choice, the convenience of portability between systems and reproducibility of results. Additionally Singularity can import well optimized Docker containers directly from the NVIDIA NGC registry or Docker hub, and also offer the possibility of modifying these to fit your needs. Examples of how to do this are provided in the Development Tools section.

Conda Python

If you are using Python as you development platform, another approach to manage your user environment is the Conda package management system. NSC recommends using Conda over vanilla Python virtual environments, as this seems to be a more favoured solution when it comes to Python codes on the GPU, and it has several functional advantages as well. Python virtualenv’s certainly work though, if that’s your preference. When installing Python modules into you Conda environment requiring compilation, be sure to have loaded a suitable build environment module, e.g. buildenv-gcccuda/10.2-7.3.0-bare.

This guide will not cover Conda further, instead consult official documentation at https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html for its use. Check which Python Conda modules are available with module avail conda or install some other version to your own directories. Since Conda environments can be rather voluminous, a tip is to make your /home/<username>/.conda directory a symbolic link to some place in your project directory (in principle /proj/some/path/to/your/conda/dir).

Development tools

Debuggers and Profilers

Debuggers and profilers specific to CUDA are provided by NVIDIA, and can be accessed when loading a CUDA-enabled build environment like buildenv-gcccuda/10.2-7.3.0-bare. Documentation on these can be found at https://developer.nvidia.com/cuda-gdb, https://developer.nvidia.com/nsight-eclipse-edition and https://docs.nvidia.com/cuda/profiler-users-guide/index.html.

Singularity

At the time of writing (2020-09), the version of Singularity installed on the GPU nodes is from the 3.6 series, but you should use a version compatible with whatever is current on the cluster when making your own images. You can check the current version on a compute node with rpm -qi singularity. It is important to understand that you will need to build your Singularity image on a computer where you have administrator (i.e. super user or root) privileges. Typically you would have this on your own laptop or workstation but not anywhere on Tetralith (sudo does not work here), i.e. any sudo invocation in the following examples assumes you are working on your local computer where you have root privileges. Installing a recent enough version of Singularity on your computer will not be covered here, check out the official documentation at https://sylabs.io.

Here follows a few examples of what you can do with Singularity. The examples barely scratches the surface of what you can be done, consult the official documentation for a comprehensive guide.

Convert an NVIDIA NGC Docker Image to Singularity

This pulls a docker image from the NVIDIA NGC registry and makes a Singularity image (.sif) of it

sudo singularity pull tensorflow-20.03-tf2-py3.sif docker://nvcr.io/nvidia/tensorflow:20.03-tf2-py3
[sudo] password for user:

WARNING: Authentication token file not found : Only pulls of public images will succeed
INFO:    Starting build...
Getting image source signatures
Copying blob sha256:423ae2b273f4c17ceee9e8482fa8d071d90c7d052ae208e1fe4963fceb3d6954
 25.46 MiB / 25.46 MiB [====================================================] 5s
Copying blob sha256:de83a2304fa1f7c4a13708a0d15b9704f5945c2be5cbb2b3ed9b2ccb718d0b3d
 34.54 KiB / 34.54 KiB [====================================================] 0s

---8<------ SNIP ------->8----

Copying config sha256:fabc6c87fbf06db2bbd63455d5e2e95ce5d7cfcc732dc3af632f89cb25627d7f
 37.30 KiB / 37.30 KiB [====================================================] 0s
Writing manifest to image destination
Storing signatures
INFO:    Creating SIF file...
INFO:    Build complete: tensorflow-20.03-tf2-py3.sif

Check the NVIDIA NGC registry at https://ngc.nvidia.com/catalog/all for other images of interest. For example a PyTorch image using Python3 can be downloaded with

sudo singularity pull pytorch_20.03-py3.sif docker://nvcr.io/nvidia/pytorch:20.03-py3

As a side note, pulling from the official PyTorch Docker repository, the same can be accomplished with

sudo singularity pull pytorch_latest.sif docker://pytorch/pytorch:latest

Adapt a Singularity Image to Suit Your Needs

The image pulled from the NGC or other place can potentially need to be adapted to suit your requirements better. Then you will want to create a writeable sandboxed directory from your .sif image, enter and modify, and finally create an updated version of your image. For instance adding a package (here vim-gtk) can be done with

sudo singularity build --sandbox pytorch_latest pytorch_latest.sif
sudo singularity shell --writable pytorch_latest/
Singularity pytorch_latest/:~> apt-get update
Singularity pytorch_latest/:~> apt-get install vim-gtk
Singularity pytorch_latest/:~> exit
sudo singularity build pytorch_latest.v2.sif pytorch_latest

You can make any installations into the image, including with pip or using the container image OS native build tools (gcc etc.) to manually build and install whatever you may need. If you are doing very elaborate installs, you may want to use persistent overlays instead, see https://sylabs.io/guides/3.6/user-guide/persistent_overlays.html.

Build an Image From Scratch Using a Singularity Definition File

If you need more control over what goes into your image, you will need to use your own Singularity definition files (image recipes if you will). Using the following Singularity definition file we can for instance build a Singularity image containing PyTorch, CUDA 10.2 and lots of added extras

Bootstrap: library
From: ubuntu:18.04

%post

# Get the packages from a nearby location (well, if Sweden is close to you at least)
cat << EOF > /etc/apt/sources.list
deb http://se.archive.ubuntu.com/ubuntu/ bionic main restricted
deb http://se.archive.ubuntu.com/ubuntu/ bionic-updates main restricted
deb http://se.archive.ubuntu.com/ubuntu/ bionic universe
deb http://se.archive.ubuntu.com/ubuntu/ bionic-updates universe
deb http://se.archive.ubuntu.com/ubuntu/ bionic multiverse
deb http://se.archive.ubuntu.com/ubuntu/ bionic-updates multiverse
deb http://se.archive.ubuntu.com/ubuntu/ bionic-backports main restricted universe multiverse
deb http://security.ubuntu.com/ubuntu bionic-security main restricted
deb http://security.ubuntu.com/ubuntu bionic-security universe
deb http://security.ubuntu.com/ubuntu bionic-security multiverse
EOF

# Downloads the latest package lists
apt-get update -y

# Install required and reasonable extra distro packages
DEBIAN_FRONTEND=noninteractive apt-get -y --no-install-recommends install \
    build-essential \
    wget \
    git \
    software-properties-common \
    python3 \
    python3-tk \
    python3-pip \
    gdb \
    freeglut3-dev \
    dirmngr \
    gpg-agent \
    python3-setuptools \
    python-dev \
    python3-dev \
    python3-wheel \
    python3-pip \
    vim-gtk \
    nano \
    openmpi-bin \
    libopenmpi-dev \
    openssh-client

# Install extras, the atom editor for instance like here
add-apt-repository ppa:webupd8team/atom
apt-get update -y
apt-get -y install atom

# Get the NVIDIA repos
apt-key adv --fetch-keys \
    https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub

wget -O /etc/apt/preferences.d/cuda-repository-pin-600 \
    https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin

add-apt-repository \
    "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"

# Install CUDA (everything and the kitchen sink style)
apt-get update -y

# Make sure to match the CUDA version to whatever other packages are being
# installed
DEBIAN_FRONTEND=noninteractive apt-get -y install \
    cuda-10-2 \
    cuda-toolkit-10-2 \
    cuda-samples-10-2 \
    cuda-documentation-10-2

# Reduce the size of the image by deleting the package lists we downloaded,
# which are no longer needed.
rm -rf /var/lib/apt/lists/*

# Install Python modules. Make sure the CUDA-utilising python packages are
# compatible with whatever CUDA version was installed above.

pip3 install \
    numpy \
    matplotlib \
    jupyter \
    torch \
    torchvision \
    tqdm

# Set up the environment to be used within the container.
%environment
    export LC_ALL=C
    export PATH=/usr/local/cuda/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/sbin:/usr/local/bin
    export CPATH=/usr/local/cuda/include:$CPATH
    export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH
    export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Build your image from your definition file (here my_image.def) like

sudo singularity build my_image.sif my_image.def

Using Your Singularity Image on the GPU Nodes

NB: Singularity images can only be launched from under the /proj file system, ie. the image must be placed there when launched.

Using a toy PyTorch learning example train_xor.py, we can execute it on the GPU by way of the PyTorch containing image (in effect executing it in that operating system environment) for instance like this

[username@n1296 ~] $ singularity shell --nv pytorch_20.03-py3.sif
Singularity> python train_xor.py 
CUDA is available -- using GPU
iteration #1
  loss: 0.24764549732208252
  accuracy: 50.00%
iteration #2
  loss: 0.2470039427280426
  accuracy: 50.00%

---8<---- SNIP ---->8---

iteration #100
  loss: 0.006982346531003714
  accuracy: 100.00%
Singularity> exit
exit
[username@n1296 ~] $ 

The --nv switch above to singularity is crucial for GPU execution as it imports the relevant server host OS NVIDIA runtime libraries and devices (i.e. from CentOS 7 of the GPU node) into the image to permit execution on the GPUs from within the singularity container.


User Area

User support

Guides, documentation and FAQ.

Getting access

Applying for projects and login accounts.

System status

Everything OK!

No reported problems

Self-service

SUPR
NSC Express