Job-local storage using BeeGFS

On some of our clusters (currently Tetralith and Sigma, but this feature could be made available on other systems), we offer a way to combine the local disks or RAM memory in each compute node of a multi-node job into one large area for temporary files.

This feature uses BeeOND to create an on-demand BeeGFS file system that exists for as long as your job is running, mounted on /scratch/job on all nodes in the job.

NOTE: you should NOT use this feature if you just want local scratch space in a single compute node. All compute nodes automatically have /scratch/local available for all jobs.

Using the local disk in the compute nodes

To enable this feature for a job, request the feature (constraint) "scratchjob" for your job.

BeeGFS will only have optimal performance if all the compute node disks are of the same size and type. On Tetralith and Sigma, different node types have have different disk types. Therefore I suggest that you request nodes of one type for your job.

Some examples:

Four "diskS" nodes, giving a total of ~800 GiB (4 * 200 GB SSD): sbatch -N4 -C scratchjob,diskS

Two "diskM" nodes, giving a total of ~1700 GiB (2 * 960 GB SSD) : sbatch -N2 -C scratchjob,diskM

Three "diskL" nodes (only in Tetralith), giving a total of ~5000 GiB (3 * 2 TB NVME): sbatch -N3 -C scratchjob,diskL

Since the number of nodes with large disks are limited¹, only use them if you really need the larger (and faster in the case of the "gpu" nodes) disks.

Using the RAM in the compute nodes

Ti enable this feature, request the feature (constraint) "scratchjobram" for your job. This will create a file system using a RAM disk (if your application only uses a small amount of RAM, approximately 90% of the RAM in each compute node can be used for /scratch/job).

BeeGFS will only have optimal performance if all the compute nodes have the same amount of RAM. Therefore I suggest that you request nodes of one type for your job.

Some examples:

Four "thin" nodes, giving a total of ~180 GiB (96 GB RAM per node): sbatch -N4 -C scratchjobram,thin

Two "fat", giving a total of ~370 GiB (384 GB RAM per node): sbatch -N2 -C scratchjobram,fat

As the Tetralith GPU nodes only have 96 GB RAM each, it does not make sense to request them for scratchjobram unless you will also use the GPUs for your application.

Performance

The RAM disk will usually be much faster, but less space is available per node, and it also limits how much RAM your application can use.

Like most distributed file systems, BeeGFS gives you best performance if you use large files.

BeeGFS on a single node has approximately the same performance as using the local disk directly (i.e /scratch/local): 300-500 MiB/s (but some CPU overhead, so you probably want to use /scratch/local instead).

BeeGFS on multiple nodes will generally give aggregated performance² up to 500 MiB/s per node in the job. If you use the RAM disk, you can get several GiB/s per node.

Please note that BeeGFS can use significant amounts of CPU when the application is doing I/O to /scratch/job. For optimal overall application performance it might make sense to not use all teh CPU cores for the application, but leave some for BeeGFS. How many? It depends on how much I/O you will be doing. If you will be using job-local storage for a large number of jobs it might make sense to do some benchmarking first with different number of cores reserved for BeeOND to find out where the optimum is for your job type.

If you need help in making your jobs use /scratch/job, please contact NSC Support.

Tetralith has 1678 "diskS", 60 "diskM" and 170 "diskl" nodes. Sigma has four "diskM" nodes and 104 "diskS" nodes.↩
you might need to perform I/O from several processes/threads/MPI ranks per node and/or from multiple nodes to reach the maximum performance available.↩

Job-local storage using BeeGFS

Using the local disk in the compute nodes

Using the RAM in the compute nodes

Performance

User support

Getting access

Everything OK!

Self-service