XDS, XDSAPP benchmark

XDS, XDSAPP benchmark at Aurora vs. Triolith

Properties	NSC Triolith	LUNARC Aurora
Number of cores per compute node	16	20
Filesystem	GPFS	Lustre
Read cache	static, permanent	dynamic

Table1.Triolith vs Aurora. NSC Triolith has 16 cores per node while LUNARC Aurora has 20 cores per node. NSC Triolith is using the GPFS filesystem and Lunarc Aurora the Lustre filesystem. GPFS has a static (and permanent) read-cache size and on Triolith this is small, because it “removes” RAM from the compute node. In contrast Lustre has dynamic read-cache, so if the file you read fits into the RAM of the compute node and this RAM isn’t needed for the data processing, then the second time you read data from the file, you will effectively read it from RAM instead, which is much faster than reading from disk.

The different number of cores and filesystems at Aurora and Triolith has a few interesting consequences when running XDS and its derivatives XDSGUI, XDSAPP, autoPROC and XDSme.

XDS and its derivatives runs faster at LUNARC Aurora compared to NSC Triolith
Running eiger datasets with the neggia library is more efficient than using H5ToXDS

XDS benchmark with H5ToXDS and dectris-neggia library

For all XDS benchmark runs listed in Table 2-5 below we use 1.3Å EIGER_16M_Nov2015.tar.bz2 data from dectris (900 frames) as done before - see XDS benchmarking.

The XDS benchmark runs were done using now outdated version 1 of forkxds that had a more strict SLURM allocation requirement

number of nodesXnumber of tasks per node=MAXIMUM_NUMBER_OF_JOBS
number of nodesXnumber of tasks per nodeXnumber of cpu's per task=MAXIMUM_NUMBER_OF_JOBSXMAXIMUM_NUMBER_OF_PROCESSORS

The current version of forkxds is more forgiving with respect to SLURM allocation requirement

MAXIMUM_NUMBER_OF_JOBSXMAXIMUM_NUMBER_OF_PROCESSORS=number of nodesXcores per node=total number of cores

The XDS benchmark below was done by first creating XDS.INP

module load generate_XDS.INP
generate_XDS.INP insu6_1_master.h5

then adding 2-3 lines to XDS.INP being

MAXIMUM_NUMBER_OF_JOBS=8
MAXIMUM_NUMBER_OF_PROCESSORS=10
and H5ToXds will be used in absence of dectris-neggia library

To use dectris-neggia library instead of H5ToXDS add
LIB=/sw/pkg/presto/software/Neggia/1.0.1-goolf-PReSTO-1.7.20/lib/dectris-neggia.so 
to XDS.INP at LUNARC Aurora 
or
LIB=/proj/xray/presto/software/Neggia/1.0.1-goolf-PReSTO-1.7.20/lib/dectris-neggia.so
to XDS.INP at NSC Triolith

then edit xds.script to match XDS.INP parameters above

#!/bin/sh
#SBATCH -t 0:15:00
#SBATCH --nodes=4
#SBATCH --ntasks-per-node=2
#SBATCH --cpus-per-task=10
#SBATCH -A snic2017-1-199
#SBATCH --mail-type=ALL
#SBATCH --mail-user=martin.moche@ki.se
module load XDS
xds_par

then run xds.script by

sbatch xds.script

LUNARC Aurora with dectris-neggia library

Aurora with neggia	nodes	cores	JOBS	PROCESSORS	INIT	COLSPOT	INTEGRATE	TOTAL
–nodes=1	1	20	1	20	7.1	24.8	91.7	136 s
–nodes=1	1	20	2	10	8.0	22.7	65.2	109 s
–nodes=1	1	20	4	5	11.8	21.7	63.5	111 s
–nodes=2	2	40	2	20	6.8	16.2	46.9	83 s
–nodes=2	2	40	4	10	8.2	13.0	37.4	72 s
–nodes=2	2	40	8	5	12.1	12.1	33.8	72 s
–nodes=4	4	80	4	20	7.0	9.1	24.4	52 s
–nodes=4	4	80	8	10	8.3	7.7	20.9	50 s
–nodes=4	4	80	16	5	12.3	7.5	84.4	118 s
–nodes=8	8	160	8	20	7.0	7.0	14.5	40 s
–nodes=8	8	160	16	10	8.1	5.3	12.2	38 s

Table2. At LUNARC Aurora with neggia library we obtained the shortest TOTAL wall-clock times - compare with Table3-5

LUNARC Aurora with H5ToXDS

Aurora with H5ToXDS	nodes	cores	JOBS	PROCESSORS	INIT	COLSPOT	INTEGRATE	TOTAL
–nodes=1	1	20	1	20	8.2	33.0	105.7	159 s
–nodes=1	1	20	2	10	9.7	29.8	81.7	135 s
–nodes=1	1	20	4	5	15.0	28.4	78.1	136 s
–nodes=2	2	40	2	20	7.7	17.9	56.8	95 s
–nodes=2	2	40	4	10	9.6	17.6	44.1	85 s
–nodes=2	2	40	8	5	9.5	16.4	43.5	83 s
–nodes=4	4	80	4	20	8.0	11.0	27.3	59 s
–nodes=4	4	80	8	10	11.4	9.4	23.6	58 s
–nodes=4	4	80	16	5	15.1	9.3	25.4	64 s

Table3. At LUNARC Aurora with H5ToXDS the TOTAL wall-clock time is longer than when using neggia library - compare to Table 2

NSC Triolith with dectris-neggia library

Triolit with neggia	nodes	cores	JOBS	PROCESSORS	INIT	COLSPOT	INTEGRATE	TOTAL
–nodes=1	1	16	1	16	29.8	41.3	306.9	393 s
–nodes=1	1	16	2	8	36.3	44.2	229.5	325 s
–nodes=1	1	16	4	4	31.2	50.0	120	220 s
–nodes=2	2	32	2	16	23.4	29.5	253.2	321 s
–nodes=2	2	32	4	8	33.3	37.8	110.3	197 s
–nodes=2	2	32	8	4	22.8	26.9	58.1	125 s
–nodes=4	4	64	4	16	24.6	38.3	123.6	201 s
–nodes=4	4	64	8	8	21.2	23.2	59.2	119 s

Table4. NSC Triolith with the Neggia library

NSC Triolith with H5ToXDS

Triolith & H5ToXDS	nodes	cores	JOBS	PROCESSORS	INIT	COLSPOT	INTEGRATE	TOTAL
–nodes=1	1	16	1	16	14.6	83.4	266.1	379 s
–nodes=1	1	16	2	8	17.7	58.5	189.8	282 s
–nodes=1	1	16	4	4	26.9	57.5	152.9	255 s
–nodes=2	2	32	2	16	14.4	39.4	136.5	205 s
–nodes=2	2	32	4	8	18.0	32.0	94.0	162 s
–nodes=2	2	32	8	4	27.0	30.6	76.6	152 s
–nodes=4	4	64	4	16	14.9	21.6	68.6	122 s
–nodes=4	4	64	8	8	17.8	19.5	47.3	102 s

Table5. NSC Triolith with H5ToXDS

XDSAPP benchmarks

For all XDSAPP benchmark below we again use 1.3Å EIGER_16M_Nov2015.tar.bz2 data from dectris (900 frames) and an sbatch script of xdsapp instead of the GUI. In the xdsapp sbatch script

-j corresponds to MAXIMUM_NUMBER_OF_JOBS in XDS.INP
-c corresponds to MAXIMUM_NUMBER_OF_PROCESSORS in XDS.INP
j X c=number of nodes X cores per node = total number of cores

explained in detail under XDS multi-node-scripts.

Example script running xdsapp with 4 JOBS and 10 PROCESSORS per job when allocating 40 cores (2 nodes) at LUNARC Aurora

#!/bin/sh
#SBATCH -t 00:20:00
#SBATCH --nodes=2 --exclusive
#SBATCH -A snic2017-1-xxx
#SBATCH --mail-type=ALL
#SBATCH --mail-user=name.surname@lu.se
module load XDSAPP
xdsapp --cmd \
--dir /lunarc/nobackup/users/mochma/test_suite_NSC/bench_xdsapp/2-2-10 \
-j 4 \
-c 10 \
-i /lunarc/nobackup/users/mochma/test_suite_NSC/eiger/empty/2015_11_10/insu6_1_data_000001.h5

EIGER_16M_Nov2015.tar.bz2	NSC Triolith	LUNARC Aurora
cores = JOBS X PROCESSORS	runtime	runtime
16 = 1 X 16	18 min 17 sec	not tested
20 = 1 X 20	not tested	9 min 12 sec
16 = 2 X 8	12 min 50 sec	not tested
20 = 2 X 10	not tested	7 min 36 sec
16 = 4 X 4	11 min 22 sec	not tested
20 = 4 X 5	not tested	6 min 11 sec
20 = 5 X 4	not tested	6 min 23 sec
32 = 2 X 16	12 min 43 sec	not tested
40 = 2 X 20	not tested	5 min 11 sec
32 = 4 X 8	9 min 2 sec	not tested
40 = 4 X 10	not tested	5 min 15 sec
32 = 8 X 4	7 min 17 sec	not tested
40 = 8 X 5	not tested	4 min 35 sec

Table6. 1.3Å eiger data from dectris (EIGER_16M_Nov2015.tar.bz2) processed by XDSAPP by a sbatch script. For this dataset XDSAPP make three XDS runs and Lunarc Aurora is faster than NSC Triolith with a factor of 2. The dectris-neggia library has been added to XDSAPP version 2.99 and therefore is therefore always used when processing eiger data with XDSAPP.