Understanding the Limits of VASP Parallel Scaling

In a previous post, Running big on Lindgren, we saw how VASP could be run with up to 2000 cores if a sufficiently big supercell was used. But where does this limit come from?

What happens is that the fraction of time that a program spends waiting for network communication grows as you increase the number of MPI ranks. When there are too many ranks, there is no gain anymore, and adding more cores only creates communication overhead. This is where our calculation “stops scaling” in everyday speak. In many cases, performance might even go down when you add more cores.

We can see this effect by measuring the amount of time spent in MPI calls. One tool that can do this is the mpiP library. MPIP is simple library that you need to link to your program. It will then intercept any MPI calls from your program and collect statistics. When increasing the number of ranks in a VASP calculation, it can look like this:

MPI communciations overhead in VASP

(The graph above was generated from runtimes on the Matter cluster at NSC, but the picture should be the same on Lindgren.) It is clear that this problem cannot be subdivided to arbitrary degree. With 256 and more cores, the individual chunk of work that each core has to do is so small that it takes longer time send to results to other cores than to actually calculate them.

Another way to look at it is to pinpoint where in the program the bad scaling arises. It is often a performance critical serial routine, or an MPI parallelized subroutine, which is heavily dependent on network performance. Conveniently, VASP has built-in timing libraries that measures contributions of several important subroutines in the program. As far as I know, the timings are accurate.

Timings of subroutines in VASP

This graph shows the share of the total CPU time spent in each subroutine. “24c” means 24 cpu cores, or one computer node in Lindgren. The maximum number of 1008 cores is equal to 42 compute nodes.

To understand this data, first note that each column is normalized to 100% runtime, because we are not interested in the runtime per se, but rather what happens to individual parts when we run a wide parallel job. Ideally, all the bars should be level. This is because if there is no (or constant) communication overhead, a certain subroutine will always requires the same amount of aggregated compute time, because the actual compute work (without communication) is the same regardless of how many cores we are running on. Instead, we see that some subroutines (EDDIAG/ORTHCH) increase their share of time as we run on more tanks. These are the ones that exhibit poor parallel scaling.

What does EDDIAG and ORTHCH do? They are involved in the so-called “sub-space” diagonalization step. ORTHCH does the LU decomposition and EDDIAG diagonalizes the Hamiltonian in a space spanned by the trial Kohn-Sham orbitals. This procedure is necessary because the RMM-DIIS algorithm in VASP does not give the exact orbitals, but rather a linear combination of the eigenfunctions with the lowest energy. The problem is that this procedure is usually trivial and fast for a small system (the matrix size is NBANDS x NBANDS), but it becomes a bottleneck for large cells.

There is, however, a way parallelize this operation with SCALAPACK, which VASP also employs, but it is well-known that the scalability of SCALAPACK in this sense is limited. It is also mentioned in the VASP manual in several places (link). We can see this effect in the graph, where the yellow fraction stays relatively flat up to 192 cores, but then starts to eat up computational time. So, for now, the key to running VASP in parallel on > 1000 cores is to have many bands. A more long-term solution would be to find an alternative to using SCALAPACK for the sub-space diagonalization.

ABINIT vs VASP Round 2

In the second round of ABINIT vs VASP comparison, I am using a bigger 128-atom supercell of Li2FeSiO4 with one k-point, as is customary for big supercells. Compared to the silicon supercell that was studied in round 1, this system is big enough to be able to scale well across a few nodes in a cluster. This will tell us something about the parallel scaling. These are the results using optimized settings for both VASP and ABINIT on the Matter cluster at NSC:

Speed ABINIT vs VASP for 128 atoms

The parallel scaling of ABINIT is quite remarkable. It is superlinear for 1-4 nodes, and with 16 nodes the efficiency is still above 100%. Clearly, ABINIT is superior to VASP in terms of parallel scaling.

So should we all run ABINIT and enjoy tremendous speed-ups? The astute reader will notice that I have played a trick and only showed the normalized results. The chart above does not show the actual speed, only the parallel scaling relative to a single node run. The reason is that I want to compare to ideal linear scaling. Comparing actual speed, however, reveals another picture:

Speed ABINIT vs VASP for 128 atoms

VASP is 4x faster than ABINIT, despite worse parallel scaling. It makes me wonder if I missed some setting in ABINIT that magically reduces all the data structures by half. In particular, I was expecting gamma-point only optimization options, but it does not seem to exist in ABINIT? So in order to make it at all comparable, I used the normal -DNGZhalf version of VASP. For actual calculations with the gamma-point only version, VASP would be even faster than shown here.

Parallelization settings in ABINIT

ABINIT can parallelize over k-points, spin, bands and FFTs. Here, it is just 1 k-point, and no spin polarization, so there are two influential parameters which affects parallelization: npbands and npfft. The general recommendation in the ABINIT manual is to use npbands = number of cores, and npftt=1 for runs of up to a few nodes, and to aim for npband >= 4*npfft for wide parallel runs. Just like NPAR needs to be optimized for VASP, different npband and npfft combinations should be tested for ABINIT. These are the values I found, and what I used to generate the comparison above:

Nodes Best combination Speed (Jobs/h)
1 npbands=8/npfft=1 0.48 |
2 npbands=8/npfft=2 1.30 |
4 npbands=8/npfft=4 3.00 |
8 npbands=8/npfft=8 5.32 |
16 npbands=8/npfft=16 10.13 |

The general rule from the manual does not hold exactly here. Using only band parallelization is optimal for a single node, but once we run across nodes in the network, we need to activate the FFT parallelization for best performance. The rule seems to be:

npbands = 8 (cores per node?)
npfft = number of compute nodes

The choice can be quite influential: with 16 nodes, npband=128/npfft=1 is almost three times slower. But please keep in mind that these settings may not be universal.

Running Big on Lindgren

It is often claimed that VASP cannot “scale” to more than 1000 cores, but this is not entirely correct, as can be seen in the graph below.

Scaling of NiSi 1200-atoms on Lindgren

What you see are results from Lindgren, the Cray XE6 system at PDC, running a full SCF cycle on a 1200-atom NiSi surface. With 384 cores, the job requires about one hour without using too aggressive settings. This is a respectable run-time by itself, but we can also see that we can rougly triple the speed by increasing the number cores up to the plateau value of ca 2300 cores. But we pay a price for this, which is spending almost 2x more core hours in total to get the job done. So if you ask about the upper limit for VASP in terms of number of cores, the answer is that it depends the size of simulation cell, and the loss of core hours that one is willing to accept to get results quickly.

For everyday calculations, my personal preference is to use the node count where the computational efficiency drops below 50%. In this case, gettting > 50% efficiency for more than say 16 compute nodes (128-384c) is usually a challenge, even for big simulations with many hundred atoms. A small investigation of differently sized supercells revealed the following relationship:

VASP weak scaling on Lindgren

Technically, this is an example of weak scaling, where a large simulation size is necessary to get good parallel scaling. Judging by this study, we find that there is indeed no reasonably sized simulation cell where VASP can be run on > 1000 cores without poor efficiency.

But for large one-shot jobs, where time to solution is important, a lower efficiency is acceptable. There are several reasons why you would want to do this:

  • You need the results fast due to a deadline, or the sequential nature of the study.
  • It is the only way to get your job through due to practical wall-time limits (typically less than one week at most supercomputing center).

ABINIT vs VASP Round 1

VASP is a popular implementation of the projector augmented wave method (“PAW”). The most important reason for using VASP is likely to get access to their PAW atomic data set library (“POTCARs”), but what about the speed? My first test system is a 31 atom silicon supercell (with one vacancy).

Speed ABINIT vs VASP for 31 atoms

Conclusion: My VASP calculation runs about 1.7x faster than ABINIT for this system. I was using one compute node (8 cores) on the Kappa cluster.

The real space optimization (“RSO”) scheme turned out to be a significant factor. It is not so efficient for this cell, which we can see by turning it off for VASP. In case of ABINIT, I am uncertain whether it was used or not – I find no mention of RSO in the manual, and the tutorials imply that this option has to be enabled when generating the atomic data file. Inspecting the file on the ABINIT website Si_GGA_input suggests it was not used. It is therefore reasonable to compare vs VASP with LREAL=.FALSE. (no real space evaluation).

Caveat: I am an experienced VASP user, but not so experienced with ABINIT. Still, I took great care to make sure that the settings are comparable, and the intention is to gain knowledge of ABINIT tuning while continuing this project.

For reference, the settings were:

Parameter VASP ABINIT
Plane-wave basis set cut-off 250 eV 250 eV |
Main FFT grid 723 723 |
Fine FFT grid 1443 1443 |
FFT algorithm Intel MKL Goedecker “401” |
Bands 80 80 |
Electrons 124 124 |
K-points 6 6 |
Parallelization NPAR=1 npband=8/npfft=1 |
SCF algorithm ALGO=Fast wftoptalg=14 |
SCF mixing default iscf=17 |
Mixing accuracy  LMAXMIX=2 pawlmix=2 |
Real space optimization varied no? |

I believe that the most significant difference was that the ABINIT calculation converged in 9 SCF steps vs 15 in VASP, presumably due to a better mixing scheme. On the other hand, being able to use FFTs from Intel’s MKL in VASP should improve the speed significantly (I have seen it give speed boosts of ca. 25% vs FFTW for smaller systems on Intel processors.)

Optimizing NPAR and NSIM

Our VASP users at NSC are sometimes asking about how to set NSIM and NPAR for good parallel performance in VASP. I wrote about NPAR before. But what about the NSIM parameter? The VASP manual says that NSIM=4 by default. It means that 4 bands are optimized in parallel in the RMM-DIIS algorithm, which allows VASP to exploit matrix-matrix BLAS operations instead of matrix-vector operations.

But the NSIM/NPAR parameters should be adjusted based on actual underlying hardware (network, typ of processor, caches etc). Here are some results for the 24-atom PbSO4 cell running on a single Kappa compute node. Each bar in the chart below represents the average of three runs.

NSIM on Kappa for PbSO4

It looks like NPAR and NSIM are largely indepedent factors, with NPAR being the most important one. Varying NPAR can give a performance boost of up to ca 50%, while varying NSIM gives about 10%. The internal variability between runs is less than 1% in this case, so the differences are real. We can conclude that NPAR=1 is optimal for a single-node job, as expected, and that NSIM=2 might be beneficial, instead of keeping the default NSIM.

A more realistic example is a 128-atom Li2FeSiO4 supercell. This one we run on 4 nodes (32 cores) on Matter. It is a highly symmetric case with 512 bands. Like before, 3 runs were made for each data point.

NSIM on Matter for Li128

We find the best performance for NPAR=2/4, in line with previous results. But here, the default NSIM=4 setting seems to produce the worst performance, and the influence of NSIM is higher (up to +20% speed). The optimal choice seems to be NSIM=16.

It is tempting to conjecture that NSIM should be increased even more for larger jobs. To investigate the upper limit of VASP jobs, let us look at the NiSi supercell with 504-atoms. It takes about 23 minutes to run a full SCF cycle on 32 Matter nodes.

The outcome is not so encouraging, however:

NSIM on Matter for NiSi

NSIM=16 does not deliver an increase in performance, and the influence of smaller NSIM values is dwarfed by other measurement errors. In this case, a likely culprit is the variation in MPI performance when running over many Infiniband switches. So for large jobs, NSIM seems to make less difference. You can leave it at default value.

In conclusion:

  • Use NPAR = 1 and NSIM = 2 for single-node jobs
  • Use NPAR = nodes/2 (or nodes) and NSIM=2 for medium jobs. If you have enough bands per core and want to optimize, you can try NSIM=8/16 and see if it helps.
  • Use NPAR = sqrt(nodes) and NSIM=4 for large jobs.

VASP and the NPAR Setting

It is very important to set a value of NPAR explicitly when running MPI-parallel calculations with VASP. VASP’s default choice is NPAR=number of MPI ranks (or cores), which is generally a terrible choice on today’s clusters. You can get at least 30% higher speed by specifying a lower NPAR value in the INCAR. At NSC, we have even seen cases where using NPAR=cores produces incorrect results.

What does NPAR do? You should check the section about NPAR in VASP manual. Basically, it controls how the parallelization over bands is done.

Below is an example from NSC’s Mozart, an SGI Altix 3700 BX2 shared memory machine, running a 24-atom PbSO4 cell:

NPAR on Mozart

Second example is an FeCo (Wairauite) supercell with 53 atoms on NSC’s Neolith cluster:

NPAR on Neolith

In general, the influence of NPAR increases as you run wider parallel jobs. For big jobs, a bad NPAR choice might kill performance. In the example above, you could get the same speed running on 32 cores with good NPAR value, as you could on 64 cores using the default value.

So far, I have made studies of the optimal value of NPAR for the following SNIC resources: Neolith, Mozart, Matter, and PDC’s Lindgren.

Mozart

NSC’s Mozart will soon be obsolete and replaced by new fat nodes in the Kappa cluster, but here are the NPAR recommendations:

  • 1 core & NPAR = 1
  • 8 cores & NPAR = 2
  • 12 cores & NPAR = 3
  • 16 cores & NPAR = 2
  • 24 cores & NPAR = 4
  • 32 cores & NPAR = 4

Neolith & Matter clusters

Neolith and Matter are clusters with compute nodes having dual socket Xeon processors (8 cores in total) connected with Infiniband. I expect the results the apply for other similar systems as well (such as NSC’s Kappa and UPPMAX’s Kalkyl).

The formula is roughly:

NPAR = number of compute nodes / 2.

Some examples:

  • 1 node & NPAR = 1
  • 2 nodes & NPAR = 1
  • 4 nodes & NPAR = 2
  • 8 nodes & NPAR = 4
  • 16 nodes & NPAR = 8
  • 24 nodes & NPAR = 16
  • 32 nodes & NPAR = 16

For more than 32 nodes, keep NPAR at 16.

Lindgren (Cray XE6)

The Cray XE6 has a different interconnect, and also fatter nodes with four sockets. The relationship I found there is:

NPAR = number of compute nodes

Note that one compute node corresponds to 24 cores here. So for a 96 core batch job, you would be using 4 nodes and NPAR = 4.

NPAR and NBANDS

Another thing which is confusing about NPAR is that the value of NPAR will influence the number of bands used in your VASP calculation, unless you hard-code the number of bands (NBANDS tag) in the INCAR. The reason being that VASP tries to make sure that there is an equal number of bands on each node, when parallelizing over bands. Thus, one can unknowingly perform calculations with different number of bands when changing the number of cores in the job script. So, once again, set NPAR and NBANDS in the INCAR file. Alternatively, make sure that you have converged your calculation with respect to the number of bands. VASP guesses the number bands with quite a large safety margin, but it is not always enough!

How to Compile VASP

Please note that this is an old guide and not completely up-to-date. I am keeping it around mostly for reference. For up-to-date information, you should check one of the system-specific compile instructions and the VASP compiler status page.

Introduction

VASP is distributed as source code and everyone that works with VASP will have to rebuild the program from scratch sooner or later. Typical situations are: moving to a new supercomputer or cluster, activating certain nonstandard features of VASP, and performance optimization. This working paper collects some experiences of doing this on different computing platforms.

Prerequisites

To compile VASP, one needs a Unix-compatible system with:


  • Standard Unix/Linux tools: make, tar, gunzip, cpp etc, i.e. you should be able to compile source tarballs from scratch. Most installations should support this. The exceptions might be “consumer” installations of some Linux distributions, and Mac OS X (one must install the Xcode development environment separately).
  • A Fortran90-capable compiler (ifort, PGF, Pathscale, IBM XLF, gcc/gfortran have worked one time or another).
  • The MPI parallel version requires an MPI installation, or at least the MPI development libraries and headers.
  • BLAS and LAPACK libraries. An FFT-library is optional, because a decent one is included with VASP.

Getting and preparing the source code

The source of VASP comes in two files (“tarballs”) named something along the lines of: vasp.4.6.tar.gz, vasp.4.lib.gz. Usually one acquires these files from the official FTP server of the VASP Group.

[~/vasp]$ tar zxvf vasp.4.6.tar.gz
.... 
[~/vasp]$ tar zxvf vasp.4.lib.tar.gz
....
[~/vasp]$ ls
vasp.4.6  vasp.4.6.tar.gz  vasp.4.lib  vasp.4.lib.tar.gz

The folder vasp.4.6 contains the main source code of VASP, and vasp.4.lib contains a library with some subroutines necessary for compatibility. You can also find source files for lapack and linpack inside there.

Possible issue: Many VASP source files are meant to be preprocessed by cpp, they are named xxx.F, as opposed to the normal Fortran files that are named xxx.f. Obviously, this requires a case sensitive file system to work. This might be an issue on some Macs and old Unix systems (like OpenVMS). There is a straightforward workaround: edit the makefile to use .f90 instead of .f for preprocessed Fortran source files.

Building the VASP 4 library (libdmy)

Generally, the VASP 4 library builds without difficulties in most installations. There might be some complains about ranlib, but they are harmless. Create a new makefile by copying one of the template makefiles. They are called “makefile.something”. Use one that would seem to fit your computer system (makefile.linux for GNU/Linux etc).

[~/vasp]$ cd vasp.4.lib
[~/vasp/vasp.4.lib]$ cp makefile.linux makefile
[~/vasp/vasp.4.lib]$ nano makefile
[~/vasp/vasp.4.lib]$ make

Edit the lines containing FC and FFLAGS. These are for telling what Fortran compiler to use and which flags. Like this for the Intel compiler:

FC = ifort
...
FFLAGS = -O1
FREE = -free

or like this for gfortran:

FC = gfortran
...
FFLAGS = -O3
FREE = -ffree-form

In general, you would like to use a vendor supplied BLAS and LAPACK, so remove the lapack_double.o part from the makefile.

libdmy.a: $(DOBJ) lapack_double.o linpack_double.o
   -rm libdmy.a
   ar vq libdmy.a $(DOBJ)

When make has been run, you should be able to find a file called “libdmy.a” in this directory, if everything worked out ok. This library will later be used to build the vasp executable.

First build of VASP 4.6: basic version

The first aim is to manage to compile a version of VASP at all, i.e. no compilation errors, it should link correctly and run simple tests without crashing.

Finding the external libraries to use

VASP needs fast BLAS and LAPACK libraries. It is possible to download the source for these libraries from netlib.org and compile your own versions, but this is not the recommended way – they will be too slow. Instead one should use a library optimized for the particular hardware platform. The common choices are ACML (for AMD Opteron), MKL (for Intel cpu:s), GotoBLAS (multi-platform), ATLAS (multi-platform), ESSL (for IBM/PowerPC), Accelerate Framework (Mac OS X; based on ATLAS). You need to determine which one to use (see further notes below under Performance Optimization) and locate where it is installed in your computer system. If you are running at a supercomputer center, there is usually some kind of web documentation describing what is installed and where. If not, you should call support and/or look for libraries in standard places. These are: in the standard system directories for libraries, inside the compiler distribution, or in the site’s local software director.

Some examples below:

  • Standard directories /usr/lib, /usr/local/lib, /lib
  • Inside compiler distribution /opt/intel/mkl/10.0.3.020/lib/em64t, /sw/pgi/linux86-64/7.1/lib
  • Local software /opt/gotoblas2, /sw/numlib/atlas, /software/acml4.0.1, /usr/local/lib

Once a library has been found, consult its documentation for how to use it. Sometimes several files need to be included for everything to work, and the linking order is very important. Especially Intel’s MKL is infamous in this matter. They even have a interactive web page for assisting you with linking MKL. It can also be a good idea to inspect the makefiles that come with VASP for clues on how to use the different BLAS and LAPACK libraries. E.g.: what have people used for ATLAS?

[~/vasp]$ cd vasp.4.6
[~/vasp/vasp.4.6]$ ls makefile.*

makefile.altrix           makefile.hp            makefile.linux_efc_itanium  
makefile.linux_ifc_P4     makefile.rs6000        makefile.t3d
makefile.cray             makefile.hpux_itanium  makefile.linux_gfortran
...
[~/vasp/vasp.4.6]$ grep ATLAS makefile.linux_gfortran
#ATLASHOME= /usr/lib/blas/threaded-atlas
#BLAS=   -L$(ATLASHOME)  -lf77blas -latlas
#BLAS= $(ATLASHOME)/libf77blas.a $(ATLASHOME)/libatlas.a

This tells us that we will probably need to include two libraries for ATLAS to work: f77blas and atlas. (In fact, we also need to include ../vasp.4.lib/lapack_atlas.o)

The structure of the main VASP makefile

Most of the makefiles supplied with VASP have a special structure that is good to be familiar with when editing. In most cases, it is enough to copy an existing makefile and modify it, instead of writing your own from scratch. The general structure is:

  • Comments with (sometimes) relevant info. Not all of the statements about certain libraries are up-to-date. Read for information and tips, but exercise caution.
  • Settings for the serial version and libraries
  • Settings for the parallel version (usually disabled by commenting)
  • Main build part with specification of object files etc.
  • Special rules (not always present). This section specify rules for compiling some of the source files with lower (or higher) optimization. Very aggressive optimization will break or degrade performance of some subroutines – these source files must be included in this section with different compiler flags.


The makefiles are laid out in a way that you will be able to use the same makefile for building both the serial and the parallel version. The idea is that first you make the appropriate changes in the serial section, build the serial executable, and then uncomment the lines in the parallel section (without changing the serial section), edit them, and build the parallel executable.

Basic settings

Again, start from the top and look for the lines containing FC and FFLAGS. This time, we will compile with no optimizations. The reason is that we want to exclude compiler optimizations as a source of error in case VASP does not work. It also makes compilation faster during the testing stage, which save time, because you may have to recompile everything several times before everything works. The following shows a relatively safe compilation mode for Intel Fortran compiler:

FC = ifort
FCL = $(FC)
...
FFLAGS = -free -fpe0 -fp-model strict -fpstkchk
...
OFLAG = -O0

The FFLAGS statements enforce strict floating point arithmetics and checks. The code will run slower, but at this stage we are only interested in getting it running at all. The OFLAG statement tells the compiler to not do any compiler optimizations. Other Fortran compilers have similar settings (inspect manual pages and documentation). If there are no flags for controlling floating point precision, at least specify the lowest optimization level. (Note: here it is very tempting to enable all sorts of compiler checks, such as array bounds checking. Doing this, however, may break VASP, because the source has not always been well-behaved in these aspects of sound programming).

Features – preprocessor flags

The next step is to review and edit the preprocessor flags. They are used to customize your VASP program for different platforms and job types. Look through the makefile for lines with “CPP = “ looking like this:

#-----------------------------------------------------------------------
# possible options for CPP:
# NGXhalf             charge density   reduced in X direction
# wNGXhalf            gamma point only reduced in X direction
# avoidalloc          avoid ALLOCATE if possible
# IFC                 work around some IFC bugs ugs
# CACHE_SIZE          1000 for PII,PIII, 5000 for Athlon, 8000 P4
# RPROMU_DGEMV        use DGEMV instead of DGEMM in RPRO (usually  faster)
# RACCMU_DGEMV        use DGEMV instead of DGEMM in RACC (faster on P4)
#-----------------------------------------------------------------------

CPP     = $(CPP_)  -DHOST=\"MacXLF\" \
-Dkind8 -DNGXhalf -DCACHE_SIZE=2000 -Davoidalloc

There is often a commented (non-exhaustive) section above with some explanations of different choices. Consult the VASP manual for full descriptions. The syntax for each instruction is “-DKeyword”. Many of the options available here are advanced features and do not need to be used unless you are optimizing for maximum performance (see later section on optimization), but two of the settings are very important: NGXHALF and CACHE_SIZE. Enabling NGXhalf will reduce the memory requirements when running ordinary collinear calculations (i.e. no spin-orbit coupling). The option wNGXhalf reduces memory requirements even further, but only works for gamma point-only calculations, while NGXhalf works for arbitrary number of k-points. The standard way to compile VASP is to compile three different versions: one gamma-point only (with -DwNGXhalf and -DNGXhalf), one “half” version (with -DNGXhalf), and a “full” version for non-collinear calculations. Use the gamma-point only version for large supercells and molecules with only 1 k-point. It is roughly twice as fast and uses half the memory.

The second option that should be set is CACHE_SIZE. It sets the size of the cache memory used for calculating FFTs with Jürgen Furthmüller’s library. The size is given in number of 16-byte units (i.e. Fortran DOUBLE COMPLEX). A good starting guess for this number is to adapt the CACHE_SIZE to the size of the processor’s L1 data cache. (For optimal performance, one needs to make test runs). L1 cache information can be found by googling or visiting the processor manufacturers’ webpage (http://www.intel.com, http://www.amd.com, etc). Typical values range between CACHE_SIZE=1000 to 8000 (16KB to 128KB).

It is not necessary, but a nice gesture, to set the HOST flag to something else then what is already written in the makefile you use as template. VASP writes this information in the top of all OUTCAR files so that you can see which executable that was running. This applies especially if it contains someone else’s name. Put your own instead! Other useful information that can be put in the HOST flag is hostname and compiler used.

Compiling the serial version

It is now time to trying compiling using the makefile. Make sure that search paths for compilers and libraries are set up if it is necessary (e.g. PATH and LD_LIBRARY_PATH environment variables). Many compiler distributions include scripts to do this automatically.


[~/vasp/vasp.4.6]$ make
./preprocess <base.F | /usr/bin/cpp -P -C -traditional >base.f90 -DHOST=\"DarwinGCC\" -Dkind8 -DNGXhalf -DCACHE_SIZE=2000 -DGfortran -Davoidalloc -DRPROMU_DGEMV  
gfortran -ffree-form -ffree-line-length-none  -O0  -c base.f90
./preprocess <mpi.F | /usr/bin/cpp -P -C -traditional >mpi.f90 -DHOST=\"DarwinGCC\" -Dkind8 -DNGXhalf -DCACHE_SIZE=2000 -DGfortran -Davoidalloc -DRPROMU_DGEMV  
gfortran -ffree-form -ffree-line-length-none  -O0  -c mpi.f90
./preprocess <smart_allocate.F | /usr/bin/cpp -P -C -traditional >smart_allocate.f90 -DHOST=\"DarwinGCC\" -Dkind8 -DNGXhalf -DCACHE_SIZE=2000 -DGfortran -Davoidalloc -DRPROMU_DGEMV  

The building process will spew out lines looking like above (This one is from compiling VASP on with gfortan). The first line sends the .F files for preprocessing through cpp to generate a .f90 file. The second compiles the .f90 file with gfortran. It is repeated for all source files. If the compilation and linking finishes without errors, you should see last line that links all object files (*.o) together to form the executable:

gfortran -o vasp  main.o base.o mpi.o smart_allocate.o xml.o constant.o jacobi.o main_mpi.o  scala.o
asa.o lattice.o poscar.o ini.o setex.o radial.o pseudo.o   mgrid.o mkpoints.o wave.o wave_mpi.o
symmetry.o symlib.o lattlib.o random.o nonl.o nonlr.o dfast.o choleski2.o mix.o charge.o xcgrad.o
xcspin.o potex1.o potex2.o metagga.o  constrmag.o pot.o cl_shift.o force.o dos.o elf.o tet.o hamil.o
steep.o chain.o dyna.o relativistic.o LDApU.o sphpro.o paw.o us.o ebs.o wavpre.o wavpre_noio.o broyden.o
dynbr.o rmm-diis.o reader.o writer.o tutor.o xml_writer.o brent.o stufak.o fileio.o opergrid.o stepver.o
dipol.o xclib.o chgloc.o subrot.o optreal.o davidson.o edtest.o electron.o shm.o pardens.o  
paircorrection.o optics.o  constr_cell_relax.o stm.o finite_diff.o elpol.o setlocalpp.o aedens.o diolib.o
dlexlib.o drdatab.o fft3dfurth.o fft3dlib.o -L../vasp.4.lib -ldmy ../vasp.4.lib/linpack_double.o
-L/System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A -lblas -llapack
[~/vasp/vasp.4.6]$  

You should now be able to find the executable file called “vasp” in the directory. It is a good idea to invent a scheme and rename this file so that you can keep different builds apart.
Personally, I like to use the scheme [vasp]-[serial/mpi]-[build][XY]-[gamma/half/full]

[~/vasp/vasp.4.6]$ ls vasp
vasp
[~/vasp/vasp.4.6]$ mv vasp vasp-serial-build01-half
[~/vasp/vasp.4.6]$ cp makefile makefile.build01

This way, you can try several builds and have the binaries and the associated makefiles preserved.

Compiling the MPI parallel version

An MPI (Message Passing Interface) implementation needs to be installed on the computer system in order to compile (and run) the parallel version of VASP. Again, if you are using a supercomputer center, check if and where their MPI is installed. If different implementations are available, one is usually the recommended one. Nowadays, the popular implementations are “OpenMPI” (open source), “Intel MPI” (commercial), and “Scali MPI” (commercial). Special systems, like Cray’s may come with their own MPI library. On older systems, you might have some version of either “MPICH” or “LAM”. Beware that sometimes the installed MPI is compiled with Fortran77, which will creating linking errors related to underscores with VASP (which is compiled with Fortran90). Having an MPI compiled with Fortran90 makes life easier from the beginning.

To adapt the makefile for parallel compilation, look for the MPI section. It should be clearly marked in the makefile.

#=======================================================================
# MPI section, uncomment the following lines
#

We now need to uncomment and make changes to the lines containing FC, FCL, CPP and FFT3D. The idea in many of the makefiles provided with VASP is that these lines will overwrite the previous ones in the serial section. If there is no MPI section, make a copy of the serial makefile and edit the same lines. For OpenMPI, one should normally use the mpif90 Fortran compiler wrapper:


FC = mpif90
FCL = $(FC)
...
CPP     = $(CPP_) -DMPI -DHOST=\"DarwinIntel\" \
          -Dkind8 -DNGZhalf -DCACHE_SIZE=5000 -Davoidalloc
...
# FFT: fftmpi.o with fft3dlib of Juergen Furtmueller
FFT3D = fftmpi.o fftmpi_map.o fft3dlib.o

Note the addition of the “-DMPI” flag, and also that we should use “NGZhalf” instead of “NGXhalf”. In the FFT3D line, we changed from using the serial version of the FFT library to the MPI parallel one. There is usually no need to explicitly add the MPI libraries to the linking options, since we already use the mpif90 wrapper as the linking command, which will take care of this part automatically.

Just like before, we issue the “make” command, after cleaning up old object files. Example output from building parallel version on Mac OS X with gfortran 4.3.0 and openmpi 1.2.6:

[~/vasp/vasp.4.6]$ make clean
rm -f *.g *.f90 *.o *.L *.mod; touch *.F
[~/vasp/vasp.4.6]$ make
./preprocess <base.F | /usr/bin/cpp -P -C -traditional >base.f90 -DMPI -DHOST=\"DarwinGCC\" -Dkind8 -DNGZhalf -DCACHE_SIZE=2000 -DGfortran -Davoidalloc -DRPROMU_DGEMV
mpif90 -ffree-form -ffree-line-length-none  -O0  -c base.f90
./preprocess <mpi.F | /usr/bin/cpp -P -C -traditional >mpi.f90 -DMPI -DHOST=\"DarwinGCC\" -Dkind8 -DNGZhalf -DCACHE_SIZE=2000 -DGfortran -Davoidalloc -DRPROMU_DGEMV
mpif90 -ffree-form -ffree-line-length-none  -O0  -c mpi.f90
./preprocess <smart_allocate.F | /usr/bin/cpp -P -C -traditional >smart_allocate.f90 -DMPI -DHOST=\"DarwinGCC\" -Dkind8 -DNGZhalf -DCACHE_SIZE=2000 -DGfortran -Davoidalloc -DRPROMU_DGEMV
...

The executable file is called “vasp” as usual. We rename it, to keep it apart from the serial version.


[~/vasp/vasp.4.6]$ ls vasp
vasp
[~/vasp/vasp.4.6]$ mv vasp vasp-mpi-build02-half

To run VASP in parallel mode, you need to launch the executable using a special run command for MPI. For OpenMPI, it is called mpirun.

[~/vasp/vasptest/Tests/Si-cd]$ mpirun -np 4 ~/vasp/vasp.4.6/vasp-mpi-build02-half

The command above would run VASP on 4 cpu cores.

Testing the executables

Once you have compiled your first versions of VASP, it is time to test your compilations and verify the correctness of your executables. This is an important step. It is not good scientific practice to start running production calculations without having any idea if your “scientific instrument” is calibrated and is working correctly. Unfortunately, validating VASP is difficult because the source code (unlike many other ab initio programs) does not include either regression tests or functional tests. I consider this one of the weakest point of VASP. Possible ways to test your VASP installation are:

  • Assuming that you trust your current production system, do regression testing with some previous calculations and compare output, especially things like total energies, forces and number of SCF iterations. Make sure that the tests cover features you would like to use, like different choices of ALGO, ISMEAR and LDA+U.
  • Do functional testing by calculating some standard systems like Si(cd), Fe(bcc) etc, and compare with published values from experiments and other ab initio calculations. Make sure the calculations are converged well. If the values are close to what could be reasonably expected from DFT, chances are high that the program is working.

Save the test results you get, because you will need to repeat the tests and compare when doing performance optimization in the next chapter. The goal is to have the optimized version reproduce the same output as our safe compilation.

Note: I am currently making an automatic test suite for VASP (Link: here). It is yet far from comprehensive, but a passing mark should ensure, at least, basic operation of VASP.

Optimized build of VASP 4.6

Compiler optimizations and flags

It is tempting to spend a lot of time tinkering with different compiler optimization settings. This, however, is seldom time well spent, as the table below shows. Even the most basic optimization level gives almost all possible performance improvement (> 90%) from compiler optimizations. Later, we will learn to reason why. Many compilers have a flag for generating as fast code as possible, usually called “-fast” or similar. Otherwise, just try “-O” with the highest available level. You are unlikely to improve much upon these settings. Example makefile settings using Portland Group compiler, targeting AMD Opteron (rev E):


FC = pgf90
FCL = $(FC)
...
FFLAGS = -Mfree
...
OFLAG = -tp k8-64e -fast

Table 1. 10 SCF iterations of Fe13 icosahedral cluster (gamma-point only) using VASP executables compiled with gfortran 4.3.0 (2008-01-25) and different optimization settings. Running 2 MPI processes on 1.83 GHz Intel Core 2 Duo:

Optimization level

Time

Speed

-O0 (no optimization)

151 s

1

-O1

107 s

1.41

-O2

106 s

1.42

-Os (-O2 but size restrictive)

106 s

1.42

-O3

105 s

1.44

-O3 -ffast_math (unsafe)

104 s

1.45

In some cases, higher optimization will actually reduce performance or lead to numerical instability, like the example below:

Table 2. Same calculation as in Table 1. Intel Fortran Compiler 10 running with 2 MPI processes on 2.33 GHz Intel Xeon E5345 processor.

Optimization level

Time

Speed

-O0

102 s

1

-O1

33 s

3.09

-O2

unstable

 

-O1 / -O2 (some files)

38 s

2.68

-O3

unstable

 

It is possible to circumvent this problem by putting special rules in the end of the makefile. They will enforce lower (or higher) optimization when compiling certain source files, to avoid breaking the sensitive source files. Many of the makefiles supplied with VASP have special rules of this kind. Look in the end of a makefile for the special rules section:


# special rules
# these special rules are cumulative (that is once failed in one
# compiler version, stays in the list forever)
#-----------------------------------------------------------------------

fft3dlib.o : fft3dlib.F
    $(CPP)
    $(F77) -FR -lowercase -O1 -c $*$(SUFFIX)
    fft3dfurth.o : fft3dfurth.F
    $(CPP)
    $(F77) -FR -lowercase -O1 -c $*$(SUFFIX)

    ...

Here, “ff3dlib.F” and “fft3dfurth.F” and some other files are compiled with lower optimization (-O1), since -O2 and -O3 give worse performance, just like above. Again, since no regression tests are provided with VASP, pinpointing and diagnosing these kinds of compiler errors are very difficult. Sometimes, errors are obviously related to a single source file (e.g. corrupt XML output, try reducing optimization on xml.F and xml_writer.F), which you can then put in the special section for less aggressive optimization, but usually it is not that easy. You can use the rules in the existing makefile as clues before trying a brute force search (but don’t test one by one, use a divide-and-conquer approach).

Optimizing FFT “CACHE_SIZE”

As mentioned previously, the preprocessor option “-DCACHE_SIZE” influences the data structures used for computing fast Fourier transforms. My experience is that the best value is usually the one that is close to the L1 data cache of the CPU used. The chart below shows VASP run times using different cache size values, tested on an Intel Xeon E5345 processor with 32 KB of L1 data cache per core. Based on that value, we would use the initial guess of CACHE_SIZE=2000 (2000*16 bytes ≈ 32 KB). This choice give almost optimal performance, but 3000 and 4000 are marginally faster. AMD Opterons have 64KB of L1 cache per core, and there I, and others, have found CACHE_SIZE=5000 to be the optimal choice. The most important thing is to not settle with CACHE_SIZE=0 just because you don’t know what value to use (as is suggested in the comments of the source code). At least test a couple of values like 0, 2000 and 5000.

Speed vs CACHE_SIZE

Collecting data and profiling

“Premature optimization is the root of all evil” says the well-known proverb by Donald Knuth. Before we start to optimize our VASP program, we should know exactly in which subroutines the program spends its time, i.e what parts are in need of optimization. The best way to do this is to profile VASP running on your actual production system with a profiling program. Any decent supercomputing center will have such tools installed somewhere. If you are compiling a version for your laptop or workstation, you can use the GNU Profiler (“gprof”) or some vendor specific tool like Apple’s “Shark” profiler if you are on a Mac OS X system. As usual, it is good to profile both the optimized and unoptimized versions to get the complete picture.

Below is shown the output from the Shark profiler when running an optimized version of VASP compiled with IBM XL Fortran (PowerPC architecture). The FFT library provided with VASP is used, together with the BLAS/LAPACK library provided with Mac OS X (based on ATLAS). Heavily used subroutines are listed in the top of the window. In the top, we mostly find calls to BLAS (libBLAS.dylib) and the FFT routines (ipassm, fpassm). They account for more than 80% of the runtime in this (very short) calculation. This is the explanation why we could not improve the speed of the code that much using compiler optimizations. The BLAS library is external to the VASP source code and is hopefully already optimized, and the FFT source code is not helped by aggressive compiler optimizations (-O1 or similar is enough). From this we draw the conclusion the performance of VASP is mainly determined by the choice of BLAS/LAPACK and FFT libraries. (This is of course also stated in the VASP manual). Which library that is most important depends on the exact system architecture and the size of the calculation that VASP is running. Usually, RMM-DIIS (which employs BLAS calls) will be the dominant subroutine in larger production calculations. There is a good discussion of scaling behavior in one of the original VASP papers (Kresse & Furthmüller, Phys. Rev. B 54, 1169).

Profiling run of VASP

BLAS & LAPACK

The easiest way to improve your VASP performance is to choose and test several BLAS/LAPACK installations. The general recommendations for best speed are:

  • Intel Core2/Nehalem: Intel MKL
  • AMD Opteron: Intel MKL (sic!)
  • POWER/PowerPC: IBM ESSL

MKL and ESSL are commercial products. The free, but not as fast, alternatives are GotoBLAS2 (no longer maintained?), ATLAS and ACML. You can usually make VASP work with all of these libraries, but I recommend staying away from ATLAS “developer” versions.

Below are some results from the Matter cluster at NSC (Xeon E5520 processors) running a medium sized 53-atom test job with VASP versions linked to different BLAS/LAPACK libraries (7-9 runs per data set):

Run times with MKL, GotoBLAS2, and ACML

Here, the difference is quite dramatic. Using Intel MKL (including FFTs from MKL) produces a speed-up of about 30% compared to the slowest combination, with most of the speed gain coming from using the MKL FFTs. In isolation, ACML and GotoBLAS2 are both about 5% slower than MKL when compared using the same FFT library.

But sometimes, the results are not as conclusive. In the chart below, VASP was compiled on the (now retired) Isis cluster at UPPMAX with different numerical libraries. Isis had AMD Opteron rev. E processors. As above, 7 runs of each combination. The only statistically significant difference is that GOTO & ATLAS is faster than ACML, No statistically significant difference between FFTW and the built-in FFT library was found. Probably, Intel MKL should have been tested here as well, because recent tests on PDC’s Lindgren produced faster binaries with Intel MKL than the supplied Cray math library.

Run times on Isis cluster with different math libraries

Caveat Do not make generalized conclusions from the numbers published here. The relative performance, and especially stability, of different math libraries often changes over time with new versions or new CPU:s. You really need to test on each new cluster. While big speed gains are sometimes possible, such as using Intel MKL on Intel processors, it must be emphasized that all BLAS/LAPACK libraries have been highly tuned by now, sometimes down to the point where one has to use statistical tests to able to distinguish test results. Still, 1-2% may translate into thousands of dollars in saved computer time for a large cluster installation.

FFT libraries

VASP uses fast Fourier transforms to switch between real space and reciprocal space. VASP has a built-in FFT library (“FURTH”) written by Jürgen Furthmüller which is very competitive with other libraries. The other choice is FFTW, or any other FFT library that can be called using the same interface (such as the FFT routines inside MKL).

The choice of FFT library is controlled by the FFT3D variable in the makefiles:

# FFT: fftmpi.o with fft3dlib of Juergen Furthmueller
FFT3D = fftmpi.o fftmpi_map.o fft3dlib.o

To switch to using FFTW, write instead


FFT3D = fftmpiw.o fftmpi_map.o fft3dlib.o /(path to FFTW library)/libfftw3.a

Note that you are using a different object file called “fftmpiw.o”. As usual, one needs to locate where FFTW is installed in the current computer system.

On machines with older Intel processors (P4/Core2 era), using FFTW gave a statistically significant increase in speed of about 1% on Intel systems, but on modern systems (Nehalem) FURTH is faster by ca 9%, when compared to FFTW 3.2.1. But as seen in the chart, using Intel’s FFT:s inside MKL is by far the best choice for speed.

No statistically significant difference was found on older Opteron systems.

For the Cray XE6 at PDC (with Opteron Magny-Cours chips), Intel MKL FFTs and FFTW are almost tied (FFTW ca 0.5% faster), and FURTH library is about 7% slower when compiled with Intel Fortran compiler.