Optimizing NPAR and NSIM

Our VASP users at NSC are sometimes asking about how to set NSIM and NPAR for good parallel performance in VASP. I wrote about NPAR before. But what about the NSIM parameter? The VASP manual says that NSIM=4 by default. It means that 4 bands are optimized in parallel in the RMM-DIIS algorithm, which allows VASP to exploit matrix-matrix BLAS operations instead of matrix-vector operations.

But the NSIM/NPAR parameters should be adjusted based on actual underlying hardware (network, typ of processor, caches etc). Here are some results for the 24-atom PbSO4 cell running on a single Kappa compute node. Each bar in the chart below represents the average of three runs.

NSIM on Kappa for PbSO4

It looks like NPAR and NSIM are largely indepedent factors, with NPAR being the most important one. Varying NPAR can give a performance boost of up to ca 50%, while varying NSIM gives about 10%. The internal variability between runs is less than 1% in this case, so the differences are real. We can conclude that NPAR=1 is optimal for a single-node job, as expected, and that NSIM=2 might be beneficial, instead of keeping the default NSIM.

A more realistic example is a 128-atom Li2FeSiO4 supercell. This one we run on 4 nodes (32 cores) on Matter. It is a highly symmetric case with 512 bands. Like before, 3 runs were made for each data point.

NSIM on Matter for Li128

We find the best performance for NPAR=2/4, in line with previous results. But here, the default NSIM=4 setting seems to produce the worst performance, and the influence of NSIM is higher (up to +20% speed). The optimal choice seems to be NSIM=16.

It is tempting to conjecture that NSIM should be increased even more for larger jobs. To investigate the upper limit of VASP jobs, let us look at the NiSi supercell with 504-atoms. It takes about 23 minutes to run a full SCF cycle on 32 Matter nodes.

The outcome is not so encouraging, however:

NSIM on Matter for NiSi

NSIM=16 does not deliver an increase in performance, and the influence of smaller NSIM values is dwarfed by other measurement errors. In this case, a likely culprit is the variation in MPI performance when running over many Infiniband switches. So for large jobs, NSIM seems to make less difference. You can leave it at default value.

In conclusion:

  • Use NPAR = 1 and NSIM = 2 for single-node jobs
  • Use NPAR = nodes/2 (or nodes) and NSIM=2 for medium jobs. If you have enough bands per core and want to optimize, you can try NSIM=8/16 and see if it helps.
  • Use NPAR = sqrt(nodes) and NSIM=4 for large jobs.