In a previous post “How accurate are different DFT codes?” I looked at the preliminary outcome of a study comparing the accuracy and precision of various DFT codes using a single measure called the “delta gauge”. A comprehensive study using this method has now been published in the journal Science. It features a long author list, as the work is a joint effort by many of the big research groups that employs or develops DFT.
The main purpose of the paper is to show that today’s DFT calculations are precise and also reasonable accurate. Given, sufficient (sometimes very high) convergence settings, DFT calculations performed using different software implementations do in fact arrive at the same answer. There is a certain error margin, but it is shown to be comparable to experimental uncertainties.
My observations from a quick read of the paper are:
It confirms my old hypothesis that high-quality PAW calculations are as precise as all-electron calculations in practice. The delta gauge for the best all-electron codes (LAPW methods) are 0.5-0.6, which is very close to what you can achieve with VASP and Abinit using the most recent PAW libraires.
A very important practical aspect that is not investigated in this study would be the computer resources required to arrive at the results. Even rough estimates would have been interesting to see, both from a user perspective and a technical HPC perspective.
Among the all-electron codes, RSPt, which is based on the FP-LMTO method does not fare as well. I asked one of the authors who ran the RSPt calculations, Torbjörn Björkman. He believed that the results could be improved to some degree. These were one of the first sets that were run, and the resulting delta values were deemed sufficiently good to not warrant further improvement, when compared with the preliminary Wien2K results available then. He believed that the RSPt results could be probably be improved further with the hindsight of the more recent results, but there were still some outliers in the data set which would prevent the delta vs Wien2k to reach zero.
I have a few reservations, though, whether this study finally settles the debate for reproducibility of modern computational material science:
The paper shows what is possible in the hands of an expert user or a developer of the software. That represents a best case scenario, because in everyday scientific practice, calculations are often produced by either relatively uneperienced users such as PhD students, or in a completely unsupervised process by a computer algorithm that itself runs and analyzes the calculations. In my opinion, the ultimate goal of reproducibility should be to arrive at a simulation process that can be automated and specified to such a level that a computer program can perform the calculations with the same accuracy as an expert, but I think we are not there yet, perhaps not until we see strong artificial intelligence.
The numerical settings used in the different programs are not shown in the paper, but is available in the supplementary information. They are in general very high and not representative of many research calculations. I think it cannot be assumed a priori that the predictions of all software package degrade equally gracefully when the settings are decreased. I believe that would be an interesting topic for further investigation.