Berzelius

Berzelius is the premier AI/ML cluster at NSC. It was donated to NSC by the Knut and Alice Wallenberg foundation in 2020 and it was installed in the spring of 2021. In the spring of 2023 it was expanded to its current size. It is used for research by Swedish academic research groups.

Access for projects to Berzelius is granted by NSC via an application process in the project repository SUPR. Berzelius open for project applications from Swedish Academic researchers as described more closely on Resource Allocations on Berzelius.

Berzelius is an NVIDIA® SuperPOD consisting of 94 NVIDIA® DGX-A100 compute nodes supplied by Atos/Eviden and 8 CPU nodes also supplied by Eviden. The original 60 “thin” DGX-A100 nodes are each equipped with 8 NVIDIA® A100 Tensor Core GPUs, 2 AMD Epyc™ 7742 CPUs, 1 TB RAM and 15 TB of local NVMe SSD storage. The A100 GPUs have 40 GB on-board HBM2 VRAM. The 34 newer DGX-A100 nodes “fat” are each equipped with 8 NVIDIA® A100 Tensor Core GPUs, 2 AMD Epyc™ 7742 CPUs, 2 TB RAM and 30 TB of local NVMe SSD storage. The A100 GPUs have 80 GB on-board HBM2 VRAM. The CPU nodes are each equipped with 2 AMD Epyc™ 9534 CPUs, 1.1 TB RAM and 6.4 TB os local NVMe SSD storage.

Fast compute interconnect is provided via 8x NVIDIA® Mellanox® HDR per DGX connected in a non-blocking fat-tree topology. In addition, every node is equipped with NVIDIA® Mellanox® HDR dedicated storage interconnect.

Shared, central storage accessible from all compute nodes of the cluster is provided by a DDN A³I storage cluster consisting of 4 AI400X and 2 AI400X2 all NVMe SSD storage servers using the high bandwidth interconnect end-to-end to the GPUs. The total accessible storage space is 1.5 PB and is shared between all projects. Aggregate data read bandwidth from the storage is ~320 GB/s.

Compute resources are allocated via the SLURM resource manager. User access to the system login nodes is provided via SSH and the ThinLinc remote desktop solution.

All nodes have a local disk where applications can store temporary files. The size of this disk (available to jobs as /scratch/local) is 15 TB on “thin” nodes, 30 TB on “fat” nodes, and 6.4 TB on CPU nodes, and is shared between all jobs using the node.

Additional Information

Berzelius Getting Started

Berzelius Resource Allocation

Berzelius GPU User Guide

Berzelius Software Guide

Berzelius Common Datasets

Berzelius Press Release

Berzelius

Additional Information

User support

Getting access

Everything OK!

Self-service