Differences between Bi and Krypton

The Bi cluster is the replacement for Krypton. This page outlines the key differences between Bi and Krypton. It also documents some of the experiences from the pilot testing phase. If you have been using Krypton before, the information here might help you in migrating your jobs to Bi.

Hardware differences

Bi has 16 cores per compute node, just like Krypton. If you have a working job configuration for Krypton, you should be able to run exactly the same job on Bi -- it will just run much faster (typical improvement 50%).

Bi has hyper-threading available, making each physical core capable of appearing as two virtual cores. You need to enable it using --ntasks-per-core=2. Then mpprun automatically starts 32 MPI ranks per compute node. If you do not do that, you get 16 MPI ranks per compute nodes (as long as you don't change that using other parameters). Hyper-threading makes some applications like Arome run faster (about 10%). See below for more information about hyper-threading and Slurm. Note: during the pilot phase until 2015-02-25, hyper-threading was on by default.

Bi has Intel Xeon E5v3 processors of the "Haswell" generation. Haswell CPUs have improved vectorization with AVX2 instructions. In theory, up to 8 floating points instructions can be handled per clock cycle (up from 4 using AVX). To benefit from this, you need to recompile your software with high optimization (like -O2 -xCORE-AVX2) or at least link with an external library that has AVX2 support (like Intel's MKL).

Bi has 64 GB of memory in the thin compute nodes. This is twice the amount of Krypton. The memory speed has also improved. Bi has 1866 Mhz DDR4 memory. In low-level memory benchmarks like STREAM, we can see up to 30% improvement. For certain applications, this can lead to substantial speed-up, even without recompiling them.

Bi has Intel Truescale Infiniband (previous known as Qlogic Truescale) -- earlier clusters at NSC have had Infiniband from Mellanox. As a user, you will probably not notice this, but if you are using your own MPI library, you may have to supply special flags or recompile it with "PSM" or "TMI" support to get the best performance. In low-level benchmarks, we have seen that Truescale Infiniband is especially strong at small messages (high "packet rate").

Software/configuration differences

Node sharing is available, so you can run more than one job on a node. See Scheduling policy on Bi.

You cannot use normal ssh NODENAME to login to a node where you are running a job. Use jobsh -j JOBID NODENAME instead.

The newer compiler wrappers and module system (same as on Triolith) are used.

A new implementation of interactive is used.

Examples on how to launch jobs

Bi has the Slurm job scheduling system, like earlier clusters at NSC. Below, we present some example on how to launch parallel jobs with different kinds of parallelization.

Pure MPI (without hyper-threading)

This is the simplest way of running. The job script below will launch the job on e.g. 8 compute nodes and you will get 16 MPI ranks per node (1 per core). Run like this if you want everything to be as similar as possible to Krypton.

#!/bin/bash
# SBATCH -J jobname
# SBATCH -t HH:MM:SS
# SBATCH -N 8
...
mpprun binary.x

Pure MPI (with hyper-threading)

If you want to activate hyperthreading and run using MPI only, you need to tell Slurm that you want 2 MPI ranks per core. Mpprun will then launch 32 MPI ranks per node automatically. You also need to send a special option to underlying MPI (PSM_RANKS_PER_CONTEXT=auto). Please note that this is not a recommended way of running, it is better to use MPI+OpenMP parallelization (see below) with hyper-threading.

#!/bin/bash
# SBATCH -J jobname
# SBATCH -t HH:MM:SS
# SBATCH -N 8
# SBATCH --ntasks-per-core=2
...
export PSM_RANKS_PER_CONTEXT=auto
mpprun binary.x

Hybrid MPI + OpenMP parallelization

In this case, each MPI rank will spawn a number of OpenMP threads. You can have up to 2 OpenMP threads per core. There are many possible combinations. We expect that the following combinations are likely run well:

16 MPI ranks x 2 OpenMP threads = 1 MPI rank per physical core and 2 OpenMP threads per virtual core. Job script

#!/bin/bash
# SBATCH -J jobname
# SBATCH -t HH:MM:SS
# SBATCH -N 8
# SBATCH --ntasks-per-node=16
...

export OMP_NUM_THREADS=2
mpprun binary.x

2 MPI ranks x 16 OpenMP threads = 1 MPI ranks per socket and 16 OpenMP threads on each socket.

#!/bin/bash
# SBATCH -J jobname
# SBATCH -t HH:MM:SS
# SBATCH -N 8
# SBATCH --ntasks-per-node=2
...

export OMP_NUM_THREADS=16
mpprun binary.x

Instead of giving the flag --ntasks-per-node, you can also affect the number tasks per node indirectly by giving e.g. --ntasks-per-core=2. This effectively enables hyperthreading and starts 32 MPI ranks per node.

Various tricks

IntelMPI is rather slow at launching big jobs on Bi and Frost (big as in 100+ nodes) if you use just mpiexec.hydra. The startup time can be improved by setting export I_MPI_HYDRA_PMI_CONNECT=alltoall in the job script. Please note that the IntelMPI module and the mpprun program does this automatically for you.
With OpenMP threading, try export KMP_AFFINITY=scatter to change thread affinity.

Suggestions of what to test

Try to recompile your software on Bi with the new compilers. Use the module buildenv/2015-1.
Run the job like you did on Krypton (using 16 cores/node). Check the output for correctness and then look at the speed. The job should run faster on Bi. If you have problems with numerical reproducibility, try running on Bi with export MKL_CBWR = "AVX2" or export MKL_CBWR = "AVX".
Next try using hyper-threading (32 cores/node), especially if your program has OpenMP parallelization. You may have to test different combinations of number of OpenMP threads and MPI ranks to get the best performance.

Nemo-specific information

These are some specific tips for the Nemo code supplied by Torgny and the vendor's own testing. Suitable compiler options are:

%FC          ifort  -c -cpp -Nmpi
%FCFLAGS     -r8 -i4 -O3 -fp-model precise -xCORE-AVX2 -ip -unroll-aggressive 
%FFLAGS      -r8 -i4 -O3 -fp-model precise -xCORE-AVX2 -ip -unroll-aggressive 
%LD          ifort -O3 -fp-model precise -assume byterecl -convert big_endian -Nmpi

Example batch-script for a 16-node Nemo run. Here, we are not using hyperthreading, as NEMO does not benefit from that. Thre is also no OpenMP usage.

#!/bin/sh 
#SBATCH -N 16
#SBATCH -t 01:00:00 
.......................................
time mpprun -np 255  ./nemo.exe
...........................................................

Arome-specific information

Some early experiences from the Arome benchmarking.

Compiler options

See example Arome "makeup" file below.

Example batch-script

Suppose we want to run a 48 node Arome job using Intel MPI. In this case, we want to:

use hyperthreading which Arome benefits from
have the IO-server on one node
use OpenMP with two threads per MPI rank.

The script would look like:

#!/bin/sh 
#SBATCH  -J Forecast 
#SBATCH  -N 49 
#SBATCH --ntasks-per-node=16 
#SBATCH  -t 01:00:00 
.................
export NPROCX=16 
export NPROCY=48 
export NPROC_IO=16 
export NPROC=$(( $NPROCX * $NPROCY )) 
export TOTPROC=$(( $NPROCX * $NPROCY + $NPROC_IO )) 
export NSTRIN=$NPROC 
export NSTROUT=$NPROC 
export OMP_NUM_THREADS=2 
export KMP_STACKSIZE=128m 

........................................................................NAMELIST etc....................

time mpprun  LINK_TO_MASTERODB -maladin -vmeteo -eHARM -c001 -t$TSTEP -fh$FCLEN -asli || exit

Tuning tips for performance

NPROMA=-32 seems to work fine

Speedup launching of MPI-jobs:

export I_MPI_HYDRA_PMI_CONNECT=alltoall

Improve MPI-performance by tweaking some of the MPI routines alternatives:

export I_MPI_ADJUST_ALLREDUCE=6 
export I_MPI_ADJUST_BARRIER=1 
export I_MPI_ADJUST_ALLTOALLV=2

Improve dynamic memory allocation:

export MALLOC_MMAP_MAX_=0 
export MALLOC_TRIM_THRESHOLD_=- 1

Improve performance for larger values of OMP_NUM_THREADS (4 and bigger):

export KMP_AFFINITY=compact 
export I_MPI_PIN_DOMAIN=omp:platform

Sometimes it can be beneficial to reduce the number of ranks, for example run 15 ranks on each node, each with 2 OpenMP threads seems to reduce the variability of runtime. See example in the table below for 96 nodes.

Miscellaneous environment variables

To enable reproducible output, independent of MPI-rank distribution and number of OpenMP-threads:

export MKL_CBWR=SSE4_1

Outstanding issues (2015-02.02)

For very large number of MPI-ranks (ca 2500 and more) there is an additional overhead for each I/O-step. It is not clear yet why this happens.

Arome processor distribution suggestions

With I/O-server:

Total number of nodes	49	65	97	97	145	194
I/O nodes	1	1	1	1	1	2
Compute nodes	48	64	96	96	144	192
NPROCX	16	16	30	32	48	32
NPROCY	48	64	48	48	48	48
OMP_NUM_THREADS	2	2	2	2	2	4
NPROC_IO	16	16	16	16	16	32
NSTRIN	NPROC	NPROC	NPROC	NPROC	NPROC/2	NPROC/4
NSTROUT	NPROC	NPROC	NPROC	NPROC	NPROC	NPROC
MPI-ranks/node	16	16	15	16	16	8
Hyperthreading	Yes	Yes	Yes	Yes	Yes	Yes

No I/O-server

Total number of nodes	48	64	96	96	144	192
Compute nodes	48	64	96	96	144	192
NPROCX	16	16	30	32	48	32
NPROCY	48	64	48	48	48	48
OMP_NUM_THREADS	2	2	2	2	2	4
NPROC_IO	0	0	0	0	0	32
NSTRIN	NPROC	NPROC	NPROC	NPROC	NPROC/2	NPROC/4
NSTROUT	NPROC	NPROC	NPROC	NPROC	NPROC	NPROC
MPI-ranks/node	16	16	15	16	16	8
Hyperthreading	Yes	Yes	Yes	Yes	Yes	Yes

Example Arome "makeup" file

MOD=mod 
 
FOPT=-noauto -convert big_endian -assume byterecl -openmp -openmp-threadprivate=compat -O3 -fpe0 -fp-model precise -fp-speculation=safe -ftz 
COPT=-O2 -fp-model precise -openmp -fp-speculation=safe -openmp-threadprivate=compat 

DEFS=-DLINUX -DLITTLE -DLITTLE_ENDIAN -DHIGHRES -DADDRESS64 -DPOINTER_64 -D_ABI64 -DBLAS \ 
     -DSTATIC_LINKING -DINTEL -D_RTTOV_DO_DISTRIBCOEF -DINTEGER_IS_INT \ 
     -DREAL_8 -DREAL_BIGGER_THAN_INTEGER -DUSE_SAMIO -D_RTTOV_DO_DISTRIBCOEF -DNO_CURSES \ 
     -DFA=fa -DLFI=lfi -DARO=aro -DOL=ol -DASC=asc -DTXT=txt 

CC=icc -g -traceback -Nmpi 
CCFLAGS=$(COPT) $(DEFS) -Dlinux -DFOPEN64 

FC=ifort -Nmpi -g -traceback 
FCFLAGS=$(FOPT) $(DEFS) 

FCFREE=-free 
FCFIXED=-nofree 
AUTODBL=-r8 

LD=ifort -Nmpi -O3 -g -traceback -fp-model precise -fpe0 -ftz 
LDFLAGS=-pc 64 -openmp 
MKLROOT=/software/apps/intel/composer_xe_2015.1.133/mkl 

# System-dependent libraries - ALWAYS LOADED - (absolute filename or short name) : 
LD_SYS01 = -lpthread -lm 

# INTEL Math Kernel Library 
LD_LANG01 = $(MKLROOT)/lib/intel64/libmkl_blas95_lp64.a 
LD_LANG02 = $(MKLROOT)/lib/intel64/libmkl_lapack95_lp64.a 
LD_LANG03 = -L$(MKLROOT)/lib/intel64 -lmkl_intel_lp64 
LD_LANG04 = -lmkl_core 
LD_LANG05 = -lmkl_intel_thread 
# MPI: 
LD_MPI01 = -L$(I_MPI_ROOT)/intel64/lib -ldl -lrt -lpthread 

SYSLIBS= $(LD_SYS01) \ 
         $(LD_LANG01) $(LD_LANG02) $(LD_LANG03) $(LD_LANG04) $(LD_LANG05) $(LD_MPI01) \ 
         $(GRIB_API_LIB) 

#INCLDIRS=$(GRIB_API_INCLUDE) -I$(NETCDFINCLUDE) 
INCLDIRS=$(GRIB_API_INCLUDE) 

RANLIB=ls -l 

PRESEARCH=-Wl,--start-group 
POSTSEARCH=-Wl,--end-group 

MPIDIR=/software/apps/intel/impi/5.0.2.044/intel64//lib 
MPIDIR_INCL=/software/apps/intel/impi/5.0.2.044/intel64/include 

YACCLEX_LIBS=-lm 

LDCC=icc -Nmpi -O3 -DLINUX -w -lifcore $(LD_MPI01) 

NPES=1 

AUXSOURCES=sources.linux 

# comma-separated list of external module references 
EXTMODS=hdf5