Berzelius GPU User Guide

1. CUDA
2. SLURM
3. Interactive Sessions
- 3.1 Default Behavior
- 3.2 Common Usage Examples
4. Submitting Batch Jobs
5. NVIDIA Multi-Instance GPU (MIG)
6. Multi-node Jobs
- 6.1 MPI-based Applications
- 6.2 Multi-node GPU Jobs with Apptainer
7. GPU Reservations
8. Resource Allocations Costs
9. GPU Usage Efficiency Policy
10. Running Multiple Tasks Concurrently within A Single Job

1. CUDA

CUDA (Compute Unified Device Architecture) is NVIDIA’s parallel computing platform and programming model. On Berzelius, both the CUDA driver and CUDA toolkit are installed, but their versions may differ — and this is normal.

1.1 CUDA Driver Version

The CUDA driver version is reported by the nvidia-smi tool. This version corresponds to the GPU driver installed on the system, which manages the GPU hardware and enables GPU acceleration for compatible software.

Note: The driver version does not have to match the CUDA Toolkit version, but it must be new enough to support the toolkit version you plan to use.

1.2 CUDA Toolkit Version

The CUDA Toolkit includes compilers (nvcc), libraries, and development tools for building and running GPU-accelerated applications. You can check the version of the currently loaded toolkit with nvcc -V.

Many Python libraries or frameworks (like PyTorch, TensorFlow, etc.) typically include the necessary CUDA runtime libraries as part of the installation. This means you do not need to install or load the full CUDA Toolkit module just to use these frameworks.

1.3 Current Driver on Berzelius

At the time of writing, the CUDA driver on the compute nodes corresponds to the CUDA 12.0 release, but it includes compatibility support for CUDA 12.2. Here’s a sample nvidia-smi output:

Fri Apr 18 21:15:34 2025       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.230.02             Driver Version: 535.230.02   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA A100-SXM4-80GB          On  | 00000000:07:00.0 Off |                    0 |
| N/A   25C    P0              59W / 400W |      0MiB / 81920MiB |      0%      Default |
|                                         |                      |             Disabled |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|  No running processes found                                                           |
+---------------------------------------------------------------------------------------+

2. SLURM

SLURM (Simple Linux Utility for Resource Management) is an open-source, highly configurable workload manager and job scheduler widely used on high-performance computing (HPC) systems. On Berzelius, SLURM is used to allocate compute resources—such as GPUs, CPU cores, and memory—and to manage both batch and interactive jobs.

2.1 Resource Allocation Guidelines

When allocating a partial node, NSC recommends keeping the allocation of CPU cores proportional to the number of GPUs you request. Each compute node has 8 GPUs and 128 CPU cores, so a 1:8 ratio applies.

For example:

Requesting 1 GPU –> Allocate 16 CPU tasks (128 / 8)
Requesting 2 GPUs –> Allocate 32 CPU tasks
Requesting 8 GPUs –> Allocate 128 CPU tasks (full node)

The default memory allocation follows this CPU core count:

On thin nodes, each CPU core corresponds to approximately 7995 MB of RAM
On fat nodes (with 2 TB RAM), each core gets approximately 16 GB
Use the flag -C fat to specifically request a fat node

If you simply use the --gpus=X flag, SLURM will automatically allocate a proportional number of CPU cores and RAM for you, according to the defaults above.

There are CPU nodes available in the partition “berzelius-cpu”. To use a CPU node the flag --partition=berzelius-cpu must be set in the job description.

2.2 Multi-Node Jobs

If your job requests more than 8 GPUs, it will span multiple nodes. In such cases, your software must support multi-node execution (e.g., through MPI or distributed frameworks like PyTorch DDP or Horovod). Be sure to verify this capability in your environment before submitting large-scale jobs.

2.3 Conflicting Resource Directives

There are many ways to unintentionally provide conflicting SLURM directives (e.g., requesting 4 GPUs but only 8 cores), which can result in job failures or incorrect resource allocation.

Recommendation: Before submitting a large or long-running batch job, start with an interactive session to verify that your SLURM settings behave as expected. Use the interactive tool to launch a temporary job and inspect the allocated resources.

3. Interactive Sessions

An interactive session allows you to run commands on the cluster in real time, making it ideal for tasks like development, testing, debugging, and exploring data.

3.1 Default Behavior

Running a basic interactive command like:

interactive --gpus=1

will by default allocate:

1 GPU
16 CPU cores
128 GB RAM
2 hours wall time

These defaults follow the proportional resource allocation policy (1/8 of a node per GPU).

3.2 Common Usage Examples

Request 1 GPU with default settings (16 CPU cores, 128 GB RAM, 2 hours):
```
interactive --gpus=1
```

Request 1 GPU on a specific node:

interactive --gpus=1 --nodelist=node045

Request 1 GPU using a specific project account (use projinfo to list available accounts):
```
interactive --gpus=1 -A <your-project-account>
```
Request 2 GPUs for 30 minutes (The time limit format is days-hours:minutes:seconds):
```
interactive --gpus=2 -t 00-00:30:00
```

Request a full node (8 GPUs) for 6 hours:

interactive -N 1 --exclusive -t 6:00:00

Request 1 GPU from a fat node for 30 minutes:
```
interactive --gpus=1 -C fat -t 30
```

The feature flag -C fat restricts job placement to fat nodes (2 TB RAM). The equivalent for thin nodes is -C thin. If no flag is specified, SLURM may place your job on either type.

Note on GPUh Cost: When using -C fat, your job is assigned 254 GB of system memory per GPU. However, fat node GPUs are charged at double the GPU-hour rate, even if your job was not explicitly submitted with -C fat but landed there.

Request 12 CPU cores from a CPU node for 180 minutes:
```
interactive -p berzelius-cpu -n1 -c12 -t 180
```

Note on GPUh Cost: When using -p berzelius-cpu, your job is assigned to a node without GPUs but with substantially more performant CPUs as compared to the GPU nodes. The cost for 16 CPU cores is the same as for 1 thin node GPU. Your jobs can not land on CPU node by chance, you must specify the partition with the flag -p berzelius-cpu or --partition=berzelius-cpu.

4. Submitting Batch Jobs

In the context of HPC clusters, batch jobs are non-interactive computational tasks submitted to a job scheduler for deferred execution. Batch submission is the standard way to efficiently manage long-running or resource-intensive workloads on Berzelius.

4.1 Writing a Batch Job Script

To submit a batch job, you first need to create a job script—typically named something like batch_script.sh. Here’s a basic example:

#!/bin/bash
# SLURM batch job script for Berzelius

#SBATCH -A <your-project-account>     # Replace with your project account name
#SBATCH --gpus=4                      # Request 4 GPUs
#SBATCH -t 3-00:00:00                 # Wall time: 3 days (72h)

# Load your environment
module load Miniforge3/24.7.1-2-hpc1-bdist
mamba activate pytorch-2.6.0

# Execute your code
python train_model.py

A more detailed introduction of batch jobs can be found here.

4.2 Submitting the Job

Once your script is ready, submit it using:

sbatch batch_script.sh

You can monitor the job with:

squeue -u <your username>

4.3 Wall Time Policy

The maximum wall time for jobs on Berzelius is 3 days (72 hours). This limit ensures fair scheduling and reasonable job turnover for all users. If your work requires more time, consider splitting it into multiple shorter runs or using checkpointing.

4.4 NSC boost-tools

To provide greater flexibility when working with SLURM on Berzelius, NSC offers a set of utilities collectively known as NSC boost-tools. We currently provide three tools:

nsc-boost-priority: Increase the priority of a job.
nsc-boost-timelimit: Extend the time limit of a job.
nsc-boost-reservation: Reserve nodes for a specific time period.

5. NVIDIA Multi-Instance GPU (MIG)

NVIDIA Multi-Instance GPU (MIG) is a feature which allows a single GPU to be partitioned into multiple smaller GPU instances, each of which can be allocated to different tasks or users. This technology helps improve GPU utilization and resource allocation in multi-user and multi-workload environments.

Nodes in the reservation 1g.10gb have MIG feature enabled. Each 1g.10gb-instance is equipped with

1/7th of the A100s’ compute capabilities,
10GB VRAM,
2 cores / 4 threads,
32GB RAM.

If your job require more resources than the fix amount you should not use this reservation.

interactive --reservation=1g.10gb

6. Multi-node Jobs

Running multi-node jobs on Berzelius is fully supported and follows standard practices for MPI-parallel applications and distributed GPU workloads.

6.1 MPI-based Applications

For traditional MPI applications, you can use:

mpirun or mpiexec, available via the module:

module load buildenv-gcccuda/11.8.0-gcc11.3.0

srun --mpi=<type>, where <type> is one of pmi2, pmix, or pmix_v3 (all supported by SLURM).
mpprun, the standard NSC wrapper for MPI job launching, compatible with applications built using NSC’s toolchains.

6.2 Multi-node GPU Jobs with Apptainer

Running multi-node jobs with GPUs inside Apptainer containers (formerly Singularity) can be more complex due to MPI and GPU passthrough requirements. If you’re using NVIDIA NGC containers, you may still be able to use mpirun with proper environment setup.

Steps:

Load the recommended module:

module load buildenv-gcccuda/11.8.0-gcc11.3.0

Follow Apptainer’s MPI guidelines, especially for binding host MPI to the container: Apptainer MPI Documentation.

We have a few examples of multi-node jobs available for your reference.

7. GPU Reservations

Please refer to the NSC boost-tools for how to reserve GPUs/nodes for a specific time period.

8. Resource Allocations Costs

Depending on the type of resources allocated to a job the cost in GPUh will vary. Using feature flags it is possible to select either “thin” (A100 40GB) or “fat” (A100 80GB) for a job, a job not specifying either can use either. MIG GPUs are accessed through the MIG reservation.

GPU	Internal SLURM cost	GPUh Cost per hour	Accessed through
MIG 1g.10gb	4	0.25	–reservation=1g.10gb
A100 40GB	16	1	-C “thin”, or no flag
A100 80GB	32	2	-C “fat”, or no flag

CPU nodes are accessible in the berzelius-cpu partition and the cost for 16 CPU cores is the same as for one A100 40GB.

9. GPU Usage Efficiency Policy

As the demand for time on Berzelius is high, we must ensure that allocated GPU resources are used efficiently. The performance of all running jobs is continuously monitored by automated systems, and users are encouraged to monitor their own jobs as well.

9.1 Monitoring GPU Efficiency

You can use the tool jobgraph to visualize GPU usage:

jobgraph -j <jobID>

This command generates a .png file showing how the job is utilizing resources. For job arrays, make sure to use the raw job ID (i.e., the base job ID, not the individual array task ID).

You can also log into the node running your job using:

jobsh -j <jobID>

Inside the job environment, tools such as nvidia-smi and nvtop provide detailed, real-time GPU statistics.

9.2 Expected Power Usage

Properly utilizing jobs typically pull 200W or more per GPU, with many AI/ML workloads reaching 300W+. In contrast, idle GPUs consume approximately 50–60W. Jobs with consistently low GPU usage are likely not utilizing the allocated resources effectively.

9.3 Automatic Termination of Inefficient Jobs

To maintain fair usage and cluster efficiency, jobs that fall below certain thresholds may be automatically canceled by the system. The key criteria are:

The exponential moving average power utilization per GPU falls below 100W.
The job is scheduled without any GPU allocation.

9.4 Exceptions to Auto-Termination

The following job types are exempt from this automatic cancellation policy:

Jobs that have not yet run for one hour, allowing time for setup and data preprocessing.
Interactive jobs started using the interactive tool, for up to 8 hours.
Jobs running within reservations, including devel and safe.
Jobs running on the MIG node.
Projects that are explicitly whitelisted — please contact us if this may apply to your use case.

Note: These criteria are intentionally simplified and will become stricter over time. For example, the grace period for interactive jobs may be shortened in future policy updates.

9.5 Notifications of Auto-Termination

Users are informed hourly about any job cancellations caused by inefficiency, to avoid excessive spam—especially in cases where many jobs in a job array are affected. We recommend reviewing canceled job details to identify and fix potential inefficiencies in your workload.

10. Running Multiple Tasks Concurrently within A Single Job

In some cases, individual tasks do not fully utilize a GPU (e.g., low power usage), resulting in poor resource efficiency. One way to increase throughput is to run multiple such tasks concurrently within the same job. This strategy can make better use of allocated GPU time, even if each task runs slightly slower on its own.

This method is useful for:

Lightweight inference tasks (e.g., LLM prompt generation)
Small batch-size training runs
Alphafold jobs with low GPU utilization
Jobs that benefit from node-local scratch (to amortize data transfer)

Example Scenario

Step 1: Prepare the Input File

We create a simple file data.txt, with one word per line:

[user@berzelius1 xargs-example]$ cat data.txt | wc -l
24

[user@berzelius1 xargs-example]$ head -n 3 data.txt
anniversary
annotated
annotation

Step 2: The Poem Generator Script

We have a script poet.sh that generates a poem from a single word:

[user@berzelius1 xargs-example]$ srun --gpus=1 ./poet.sh "hello" 2>/dev/null
Hello,
I'm a simple greeting
...

Step 3: Use a Wrapper to Store Output

A wrapper script poet_wrapper.sh runs poet.sh and saves the output to a file:

[user@berzelius1 xargs-example]$ cat poet_wrapper.sh
#!/bin/bash

mkdir -p results                # Create output directory
./poet.sh "$1" > results/$1.txt # Run task and redirect output

Step 4: SLURM Batch Script with Concurrent Execution

We now submit a job with a script like concurrent_poet.sh, which launches multiple tasks using xargs:

[user@berzelius1 xargs-example]$ cat concurrent_poet.sh
#!/bin/bash
#SBATCH -J concurrent_poet
#SBATCH --gpus=1

CONCURRENT_TASKS=4

# -P $CONCURRENT_TASKS: Run this many tasks in parallel.
# -I {}: Placeholder for each line of input.
# -d '\n': Treat input lines as newline-delimited.
cat data.txt | \
  xargs -d '\n' -I {} -P $CONCURRENT_TASKS ./poet_wrapper.sh "{}"

Step 5: Results

After the job finishes, all output is available under the results/ directory:

[user@berzelius1 xargs-example]$ ls results/ | wc -l
24

[user@berzelius1 xargs-example]$ ls results/ | head -n 3
allergy.txt
anniversary.txt
annotated.txt

[user@berzelius1 xargs-example]$ head -n 5 results/annotated.txt
And as I go through life, with each passing year,
I find that there's so much more to learn and share.
Each memory becomes an annotation,
A marking of my thoughts and emotions,
As I explore and grow along the way.

Best Practices

Benchmark your workload: Test different concurrency levels (-P) to find the most efficient setup.
Use restart logic: Skip already-completed outputs to make jobs restartable.
Split long task lists: If your job would exceed the 72-hour limit, divide input into smaller batches per job.
Split long pre- and postprocessing tasks to run on CPU nodes.

If your job benefits from an exclusive node (e.g., large scratch usage), and you want to utilize all 8 GPUs:

Add #SBATCH --exclusive and #SBATCH -N 1
Modify your script to distribute tasks across multiple GPUs
Use logic like CUDA_VISIBLE_DEVICES or run separate xargs blocks per GPU