Sigma - GPU User Guide

This guide is currently very much a work in progress, published during the Pilot phase of the Sigma GPU nodes, which is between July and September of 2020. As the GPU nodes go into operation in September, this guide should be fairly complete and this disclaimer will be removed. In its current state, the guide refers to potentially incomplete or temporary features, put in place to facilitate the resource's use in the pilot phase and may not perfectly reflect its use once in final operation.

The focus of this guide is to cover the particulars of using the GPU nodes of Sigma conveniently in a hands-on, practical manner focused on examples. It is not intended to cover GPU compute usage in general. References to more exhaustive background sources are included as a matter of convenience.

The guide assumes familiarity with the "Sigma - Getting Started Guide", if you have not already done so, please acquaint yourself with it. It is also assumed that you have a valid login account on Sigma and a project with access to the GPU nodes of the cluster, the process is described in "Becoming a member of a project" and "Getting a login account".

The examples in the guide will also assume that you are using the ThinLinc VNC solution to access Sigma, as this will provide the best user experience running the examples or indeed anything else requiring X-windows graphics. Furthermore, ThinLinc will provide session management where you can suspend and resume your session on the Sigma login node in a way akin to that of terminal multiplexers (or their front-ends) such as screen, byobu or tmux when using an SSH login. That said, it is perfectly allowed to use SSH logins and work that way if that is your preference, most parts of this user guide should work very well also in this context.

Quick Start Guide

From your desktop to an interactive prompt on a Sigma GPU node in three simple steps:

  1. Log in to Sigma via the ThinLinc client (or SSH)
  2. Open a terminal console (via the desktop menu for instance) in the ThinLinc desktop
  3. In the console, allocate a GPU with the command: interactive -n 1 -c 9 --gpus-per-task=v100:1 -t 60 -A <your_account_string> --reservation=gpu

This allocates 1 task comprising 9 CPU cores and 1 V100 GPU for 60 minutes using your project account (i.e. you should fill in something like LiU-gpu-XXXX-YYYYY), i.e. a quarter of a node is allocated. It also specifies to use the reservation gpu, which is the name of the reservation containing the GPU nodes. Only specific users or project accounts are allowed to submit jobs (like this one) to a reservation. You should now be presented with an interactive prompt on a GPU node like

[username@sigma ~]$ interactive -n 1 -c 9 --gpus-per-task=v100:1 -t 60 -A  --reservation=gpu
salloc: Granted job allocation 958371
srun: Step created for job 958371
[username@n2017 ~]$ 

It is sometimes prudent to check that you have access to the GPU devices you have allocated, for instance

[username@n2017 ~]$ echo $CUDA_VISIBLE_DEVICES
0

The above means you have access to GPU device number 0 out of devices 0 – 3, not that you have 0 devices. You can also check using the nvidia-smi tool (2 GPUs allocated in this example).

[username@sigma ~]$ interactive -n 1 -c 9 -A <your_account_string> -t 60 --reservation=gpu --gpus-per-task=v100:2
salloc: Granted job allocation 958413
srun: Step created for job 958413
[username@n2017 ~]$ nvidia-smi 
Thu Jun 25 17:36:34 2020       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 440.64.00    Driver Version: 440.64.00    CUDA Version: 10.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:61:00.0 Off |                    0 |
| N/A   39C    P0    41W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  Off  | 00000000:62:00.0 Off |                    0 |
| N/A   43C    P0    55W / 300W |      0MiB / 32510MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

An important point to be aware of, especially on Sigma, is that data access between centre storage under /home and /proj and the Sigma GPU (and CPU) compute nodes is not suitable for I/O intensive loads of small random-read/write character common in for instance Machine Learning. Fortunately, the GPU nodes are equipped with large and fast NVMe SSD disks (14 TB in total) suitable for this type of I/O load, available to each job under /scratch/local. Note though, that this scratch space is volatile, accessible on a per-job basis, and will be cleared after each SLURM job.

Transferring data sets for training to the node local scratch space should by preference be done in large contiguous chunks to maximize transfer speed. A good way to achieve this is to store your data set on centre storage (/proj is suggested) in an uncompressed tar archive and transfer this with direct unpacking to scratch in your SLURM job on the allocated GPU node like

[username@n2017 ~]$ tar xf /proj/some/path/to/DATA_SET.tar -C /scratch/local

Avoid using compression of the tar archive, it will add a timing overhead to the data set unpacking in the order of 4x for gunzip (.tar.gz archives) and 20x for bunzip2 (.tar.bz2 archives).

Getting Access

Access to the GPU nodes is provided to members of projects with a time allocation on the GPU nodes. Applying for a project with time allocation on the GPU nodes is described at "Applying for a new project" under the "LiU Local projects" heading, and becoming member of a project with a time allocation on the GPU nodes is described at "Becoming a member of a project". Members also need a login account on Sigma, a process described at "Getting a login account".

Since the GPUs of Sigma are a scarce resource at present, project proposals will need to motivate the use of the GPUs from a technical perspective, i.e. describe how the project will use the GPUs and why GPUs are required for the project as opposed to other compute resources. New projects will also need to renew the project application (a simple procedure in SUPR) after a few months, after which project renewal follows the regular Sigma schedule of once per year. Reasonable types of motivations for a project's use of the GPU nodes can range from (in principle) the more light-weight "We want to test software X's GPU capabilities" reason to the more compelling "Software Y and Z perform 10x as well on a GPU as compared to on the CPUs, and can't competitively be run in other ways".

For login access to the Sigma GPU nodes NSC recommends using the "ThinLinc" VNC solution, as this will provide the most convenient and reliable access to graphical applications on Sigma, but SSH can be used as well. Data transfers to Sigma are described at "Getting data to and from the cluster".

Storage Available

Project storage is hosted on the NSC Centre Storage, under the /proj and /home directories, where /home is a smaller backed up area intended for precious data, and the former is non-backup volume storage, which is typically where projects should store volume data not requiring backup to tape. See NSC's Centre Storage for more detailed information.

While NSC Centre Storage is generally high-performance, some data access patterns are much less well suited for it than others. In particular, input-output (I/O) loads characterized by predominantly small random read-write operations will not perform well on Centre Storage. For these loads, node local scratch disks are available. General information about scratch space is available at Node local scratch storage. With respect to scratch space, the GPU nodes are different to the CPU-only nodes only in the size and performance, where the GPU-nodes have 14 TB scratch storage space and performance around 6.3 GB/s, 500k IOPs random-read.

Allocating GPU Resources

The GPU nodes are intended for using the GPUs they are equipped with. That is, jobs not requiring a GPU should not be run here, but should be run on the regular CPU nodes of Sigma. This is quite natural, but is stated to make clear that any systematic CPU-only use of the nodes constitutes a misuse of the resource.

Allocating GPU resources is done via the SLURM resource manager like on the rest of Sigma, see "Sigma Batch Jobs" and "Tetralith Batch Jobs", but there are a few extras to bring up here. In addition to other SLURM allocation switches, three more pieces of information are required to allocate a GPU resource

  1. The --gpus-per-task=<num_gpus> switch, e.g. --gpus-per-task=2
  2. The -A <project account> switch, where the project account must have an allocation on the GPUs.
  3. The --reservation=gpu switch.

The --gpus-per-task switch can be further specified with GPU type and the amount separated by a ":". At present in Sigma, the only GPUs available are Tesla V100, but if in a future other GPUs are added, say a Tesla A100, specifying which one to allocate becomes important. The V100 GPUs in Sigma are specified with the v100 label to --gpus-per-task, and allocating for instance three v100 per task can be done with --gpus-per-task=v100:3.

For example, allocating a single V100 GPU for interactive work via a terminal on Sigma can be done as follows

[username@sigma ~] $ interactive -n 1 -c 9 -t 1:00:00 --gpus-per-task=v100:1 -A <account_string> --reservation=gpu

A batch job script analogous to the interactive command above would look like

#!/bin/bash
#SBATCH -n 1
#SBATCH -c 9
#SBATCH -t 1:00:00
#SBATCH --gpus-per-task=v100:1
#SBATCH -A <account_string>
#SBATCH --reservation=gpu

# Job script commands follow

As of recent SLURM releases, there has been many more features added for the control of allocations containing GPUs. This has brought added complexity to the allocation of these resources, and there are now very many ways in which you can give conflicting allocation directives. Often, this is not followed by a clear error message from SLURM, or indeed any, and following an unfortunate allocation there may be an infinite wait for non-existent combinations of resources to become available.

The NSC recommended way to allocate resources is to specify how many tasks to allocate and to these tasks specify what resources they should have, as shown in the examples. This also extends to the switches --mem, --mem-per-cpu and --mem-per-gpu which could be needed, but can cause allocation conflicts and are therefore mutually exclusive. NSC recommends to avoid the --mem switch, since it acts on a per-node basis, and instead use --mem-per-gpu or --mem-per-cpu options. Following the example allocation with -n 1 -c 9 --gpus-per-task=1, to allocate one quarter of the available memory per node (~90 GB), you would add the option --mem-per-gpu=90G or --mem-per-cpu=10G. The default is to allocate tot_mem_avail / num_cpus (i.e. 360 / 36) to each allocated CPU core, so there should most often not be a need to specifying this switch.

Running Jobs

Working interactively at the CLI prompt with SLURM allocated resources is most conveniently done by using the NSC command interactive, which is a wrapper around the SLURM salloc and srun commands, and as such accepts the switches supported by these. In short, the interactive command launches a shell process on the first allocated node of a job, drops you in the shell and lets you interact with the allocated resources from there, much like any other CLI prompt. More information about interactive can be found at Running applications under the "Interactive jobs" heading. More informations about batch jobs can be found at "Batch jobs and scheduling".

Running GPU jobs (i.e. no MPI communications involved), be they single or multi GPU, is typically done exactly as you would run them on your laptop or workstation, after having allocated resources via SLURM. It becomes a bit more involved when using MPI. In principle, when running applications built at NSC with the NSC provided build environments, the general process is described at "Running parallel jobs with mpprun", but there are many application specific caveats to take into account, and the application documentation should be consulted when running GPU+MPI applications. It is currently not expected that performance should be very high when running on more nodes than one for GPU+MPI jobs, due to limitations in the interconnect driver and the data transport (PSM2) libraries in use. It should be functional though, only the GPUdirect feature over the OmniPath interconnect is unsupported. This is an area actively being worked upon at NSC.

Checking Your Job

The SLURM job queue to the GPU nodes can be checked with squeue -R gpu. Checking various aspects of your job's status at NSC as it is running is normally done using the jobsh script to access nodes of the job. However, this is not possible at present when it comes to monitoring GPU jobs, due to limitations in the resource manager. Instead, it is suggested that at least for interactive jobs you use a terminal multiplexer at the GPU node such as tmux or screen before starting your job. These multiplexers will allow you to open a second shell prompt within the job (check the tmux or screen man pages for how) to check GPU usage with nvidia-smi or other tools as you go.

Data Transfer To and From Centre Storage to Local Disk

Transferring data sets to the node local scratch space should by preference be done in large contiguous chunks to maximize transfer speed. A good way to achieve this is to store your data set on centre storage (/proj is suggested) in an uncompressed tar archive and transfer this with direct unpacking to scratch in your SLURM job on the allocated GPU node like

[username@n2017 ~]$ tar xf /proj/some/path/to/DATA_SET.tar -C /scratch/local

Carrying out your data transfers this way, you can expect a data transfer speed of about 1 GB/s, e.g. a 60 GB tar archive should be unpacked to local disk in around a minute on the GPU nodes. On regular Sigma nodes, the performance is bottle-necked by the slower local disk of these nodes to something like 200 MB/s.

Avoid using compression of the tar archive, it will add a timing overhead to the data set unpacking in the order of 4x for gunzip (.tar.gz archives) and 20x for bunzip2 (.tar.bz2 archives). If you absolutely must use a compressed tar archive, you can perform the decompression using a parallelized implementation of your compression program like for instance pigz when using gzip-compressed archives

tar xf --use-compress-program=pigz /proj/some/path/to/DATA_SET.tar.gz -C /scratch/local

The different parallel compression programs have variously well implemented parallelization, so your mileage may vary with respect to unpacking times.

Setting Up Your User Environment

CUDA Development Environment

The user environment at NSC is handled with the module system as described on or Module system page. All CUDA enabled modules suitable for use on the GPU nodes can be listed with module avail cuda, where cuda is a case insensitive search string. Alternatively search for your software title using module avail <your_sw_title> and see if it has a string in it indicating any GPU capability, like "cuda", "opencl" or "openacc" for instance.

An important module is the buildenv-gcccuda/<CUDA_VERSION>-<BASE_BUILDENV_VERSION> module, which sets up a CUDA toolkit build environment along with a GCC compiler, MPI and math libraries custom built against this CUDA version. There is a hidden module available called buildenv-gcccuda/.10.2-7.3.0-bare, which due to its preliminary (hidden) status is not listed when doing module avail cuda. To use it issue module load buildenv-gcccuda/.10.2-7.3.0-bare on the CLI prompt. To list all available gcccuda modules, hidden as well, issue module --show-hidden avail gcccuda, be aware though, that hidden modules may be removed or changed without notice (they are hidden for a reason).

You can of course also install to your own directories whatever CUDA version you need and use that (with a suitable compiler). Note though, that NSC will be less able to support you if you use your own CUDA installations. If you have no requirement on MPI working with your own CUDA installation, you should be good to go as soon as you set the following CUDA-specific environment variables; CUDA_HOME, CUDA_ROOT and CUDA_PATH, all pointing to the root of your CUDA installation, e.g. (assuming you have downloaded the CUDA toolkit .run file)

[username@sigma ~] $ cuda_installpath="/proj/<project_name>/some/path/CUDA/10.2.89_440.33.01"
[username@sigma ~] $ sh ./cuda_10.2.89_440.33.01_linux.run --toolkit --silent --installpath=${cuda_installpath}
[username@sigma ~] $ export CUDA_HOME=${cuda_installpath} CUDA_ROOT=${cuda_installpath} CUDA_PATH=${cuda_installpath}

There are also other ways you need to modify your environment, consult the CUDA toolkit installation manual. The reason for the many different CUDA root installation pointing variables is historical, in practise you may only need to set one of them for your purposes.

Singularity

A highly recommended development environment on the GPU nodes is the Singularity container solution. All compute nodes of the Sigma cluster have the container solution Singularity installed, see our Singularity page for more details on how it is set up at NSC and https://sylabs.io for the canonical information. The NSC documentation has some notes on Singularity security and trusted sources of containers. In addition to the trusted sources mentioned there, you may trust the NVIDIA NGC container registry used in the examples below.

Being a container solution, using a suitable Singularity image as your user environment brings a lot of advantages such as for example; using a familiar operating system and CUDA environment of choice, the convenience of portability between systems and reproducibility of results. Additionally Singularity can import well optimized Docker containers directly from the NVIDIA NGC registry, and also offer the possibility of modifying these to fit your needs. Examples of how to do this are provided in the Development Tools section.

Conda Python

If you are using Python as you development platform, another approach to manage your user environment is the Conda package management system. NSC recommends using Conda over vanilla Python virtual environments, as this seems to be a more favoured solution when it comes to Python codes on the GPU, and it has several functional advantages as well. When installing Python modules into you Conda environment requiring compilation, be sure to have loaded a suitable build environment module, e.g. buildenv-gcccuda/.10.2-7.3.0-bare.

This guide will not cover Conda further, instead consult official documentation at https://docs.conda.io/projects/conda/en/latest/user-guide/getting-started.html for its use. Check which Python Conda modules are available with module avail conda or install some other version to your own directories. Since Conda environments can be rather voluminous, a tip is to make your /home/<username>/.conda directory a symbolic link to some place in your project directory (in principle /proj/some/path/to/your/conda/dir).

Development tools

Singularity

The current version of Singularity installed on the GPU nodes is from the 3.5 series. It is important to understand that you will need to build your Singularity image on a computer where you have administrator (i.e. super user or root) privileges. Typically you would have this on your own laptop or workstation but not anywhere on Sigma (sudo does not work here), i.e. any sudo invocation in the following examples assumes you are doing it on your local computer. Installing a recent enough version of Singularity on your computer will not be covered here, check out the official documentation at https://sylabs.io.

Here follows a few examples of what you can do with Singularity. The examples barely scratches the surface of what you can be done, consult the official documentation for a comprehensive guide.

Convert an NVIDIA NGC Docker Image to Singularity

This pulls a docker image from the NVIDIA NGC registry and makes a Singularity image (.sif) of it

sudo singularity pull tensorflow-20.03-tf2-py3.sif docker://nvcr.io/nvidia/tensorflow:20.03-tf2-py3
[sudo] password for user:

WARNING: Authentication token file not found : Only pulls of public images will succeed
INFO:    Starting build...
Getting image source signatures
Copying blob sha256:423ae2b273f4c17ceee9e8482fa8d071d90c7d052ae208e1fe4963fceb3d6954
 25.46 MiB / 25.46 MiB [====================================================] 5s
Copying blob sha256:de83a2304fa1f7c4a13708a0d15b9704f5945c2be5cbb2b3ed9b2ccb718d0b3d
 34.54 KiB / 34.54 KiB [====================================================] 0s

---8<------ SNIP ------->8----

Copying config sha256:fabc6c87fbf06db2bbd63455d5e2e95ce5d7cfcc732dc3af632f89cb25627d7f
 37.30 KiB / 37.30 KiB [====================================================] 0s
Writing manifest to image destination
Storing signatures
INFO:    Creating SIF file...
INFO:    Build complete: tensorflow-20.03-tf2-py3.sif

Check the NVIDIA NGC registry at https://ngc.nvidia.com/catalog/all for other images of interest. For example a PyTorch image using Python3 can be downloaded with

sudo singularity pull pytorch_20.03-py3.sif docker://nvcr.io/nvidia/pytorch:20.03-py3

As a side note, pulling from the official PyTorch Docker repository, the same can be accomplished with

sudo singularity pull pytorch_latest.sif docker://pytorch/pytorch:latest

Adapt a Singularity Image to Suit Your Needs

The image pulled from the NGC or other place can potentially need to be adapted to suit your requirements better. Then you will want to create a writeable sandboxed directory from your .sif image, enter and modify, and finally create an updated version of your image. For instance adding a package (here vim-gtk) can be done with

sudo singularity build --sandbox pytorch_latest pytorch_latest.sif
sudo singularity shell --writable pytorch_latest/
Singularity pytorch_latest/:~> apt-get update
Singularity pytorch_latest/:~> apt-get install vim-gtk
Singularity pytorch_latest/:~> exit
sudo singularity build pytorch_latest.v2.sif pytorch_latest

You can make any installations into the image, including with pip or using the container image OS native build tools (gcc etc.) to manually build and install whatever you may need. If you are doing very elaborate installs, you may want to use persistent overlays instead, see https://sylabs.io/guides/3.5/user-guide/persistent_overlays.html.

Build an Image From Scratch Using a Singularity Definition File

If you need more control over what goes into your image, you will need to use your own Singularity definition files (image recipes if you will). Using the following Singularity definition file we can for instance build a Singularity image containing PyTorch, CUDA 10.2 and lots of added extras

Bootstrap: library
From: ubuntu:18.04

%post

# Get the packages from a nearby location (well, if Sweden is close to you at least)
cat << EOF > /etc/apt/sources.list
deb http://se.archive.ubuntu.com/ubuntu/ bionic main restricted
deb http://se.archive.ubuntu.com/ubuntu/ bionic-updates main restricted
deb http://se.archive.ubuntu.com/ubuntu/ bionic universe
deb http://se.archive.ubuntu.com/ubuntu/ bionic-updates universe
deb http://se.archive.ubuntu.com/ubuntu/ bionic multiverse
deb http://se.archive.ubuntu.com/ubuntu/ bionic-updates multiverse
deb http://se.archive.ubuntu.com/ubuntu/ bionic-backports main restricted universe multiverse
deb http://security.ubuntu.com/ubuntu bionic-security main restricted
deb http://security.ubuntu.com/ubuntu bionic-security universe
deb http://security.ubuntu.com/ubuntu bionic-security multiverse
EOF

# Downloads the latest package lists
apt-get update -y

# Install required and reasonable extra distro packages
DEBIAN_FRONTEND=noninteractive apt-get -y --no-install-recommends install \
    build-essential \
    wget \
    git \
    software-properties-common \
    python3 \
    python3-tk \
    python3-pip \
    gdb \
    freeglut3-dev \
    dirmngr \
    gpg-agent \
    python3-setuptools \
    python-dev \
    python3-dev \
    python3-wheel \
    python3-pip \
    vim-gtk \
    nano \
    openmpi-bin \
    libopenmpi-dev \
    openssh-client

# Install extras, the atom editor for instance like here
add-apt-repository ppa:webupd8team/atom
apt-get update -y
apt-get -y install atom

# Get the NVIDIA repos
apt-key adv --fetch-keys \
    https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub

wget -O /etc/apt/preferences.d/cuda-repository-pin-600 \
    https://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/cuda-ubuntu1804.pin

add-apt-repository \
    "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/ /"

# Install CUDA (everything and the kitchen sink style)
apt-get update -y

# Make sure to match the CUDA version to whatever other packages are being
# installed
DEBIAN_FRONTEND=noninteractive apt-get -y install \
    cuda-10-2 \
    cuda-toolkit-10-2 \
    cuda-samples-10-2 \
    cuda-documentation-10-2

# Reduce the size of the image by deleting the package lists we downloaded,
# which are no longer needed.
rm -rf /var/lib/apt/lists/*

# Install Python modules. Make sure the CUDA-utilising python packages are
# compatible with whatever CUDA version was installed above.

pip3 install \
    numpy \
    matplotlib \
    jupyter \
    torch \
    torchvision \
    tqdm

%environment
    export LC_ALL=C
    export PATH=/usr/local/cuda/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/sbin:/usr/local/bin
    export CPATH=/usr/local/cuda/include:$CPATH
    export LIBRARY_PATH=/usr/local/cuda/lib64:$LIBRARY_PATH
    export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH

Build your image from your definition file (here my_image.def) like

sudo singularity build my_image.sif my_image.def

Using Your Singularity Image on the GPU Nodes

Using a toy PyTorch learning example train_xor.py, we can execute it by way of the PyTorch containing image (in effect executing it in that operating system environment) like this

[username@n2017 ~] $ singularity shell --nv pytorch_20.03-py3.sif
Singularity> python train_xor.py 
CUDA is available -- using GPU
iteration #1
  loss: 0.24764549732208252
  accuracy: 50.00%
iteration #2
  loss: 0.2470039427280426
  accuracy: 50.00%

---8<---- SNIP ---->8---

iteration #100
  loss: 0.006982346531003714
  accuracy: 100.00%
Singularity> exit
exit
[username@n2017 ~] $ 

The --nv switch above to singularity is very important as it imports the relevant host OS NVIDIA runtime libraries and devices (i.e. from CentOS 7 of the GPU node) into the image to permit execution on the GPUs from within the container.


User Area

User support

Guides, documentation and FAQ.

Getting access

Applying for projects and login accounts.

System status

Everything OK!

No reported problems

Self-service

SUPR
NSC Express