Make combined omega DF algorithm consistent with uncombined version. #2283

andysim · 2021-08-24T15:40:34Z

Description

This addresses at least one of the problems (see details below) associated with #2279.

Todos

Fix Bad wB97X-D3BJ energies with MKL & in-core DFT #2279

Checklist

Tests added for any new features
All or relevant fraction of full tests run

Status

Ready for review
Ready for merge

andysim · 2021-08-24T16:05:05Z

There's a lot going on here, so bear with me. In DF algorithms we approximate (ab|cd)≈(ab|P) M_PQ (Q|cd), where M is the inverse of the metric tensor M = (P|Q)^{-1}. For calculations involving symmetric densities (such as SCF energies), we can write the M quantity as a product of inverse square roots instead; multiplying one to the left and the other to the right gives two identical intermediates, saving memory. That symmetric approach is used in the conventional algorithms that existed pre-1.4. When computing the inverse square root that's needed, the power() function is used; that function respects a cutoff that is used to eliminate (near) linear dependencies and increase stability (and was widely implemented by @susilehtola as part of an effort to increase stability of the code).

In cases where the density involved is not totally symmetric, such as excited states, the symmetric DF intermediates are not used and instead the decomposition is asymmetric: (ab|cd)≈[(ab|P)] [M_PQ (Q|cd)]. This asymmetric decomposition is also used in the recently introduced wcombine algorithm that combines coulomb-attenuated integrals and conventional ERIs to speed up exchange evaluation. To generate the second intermediate in this approach, a full inverse of the metric is needed, instead of the inverse square root, and this is achieved by calling general_invert(). This matrix inversion does not eliminate (near) linear dependencies and the system reported in #2279 is susceptible (see comments below for a demo). The wcombine algorithm was implemented after @susilehtola's cleanup efforts and inadvertently failed to remove linear dependencies during metric inversion; this PR corrects this oversight.

andysim · 2021-08-24T16:08:27Z

Some tests from the original issue on a 48 core Intel Xeon Gold 5120 Linux box, with OneAPI compilers (version 2021.2.0) as well as GCC 7.1, with and without the bug fixed:

Compiler	#Threads	wcombine	Bug fixed?	Energy (Eh)	Still Bad?
OneAPI	1	True	False	-1963.7577 8206 1252*	yes
OneAPI	8	True	False	did not converge	yes
OneAPI	1	False	False	-1963.7615 9009 8066
OneAPI	8	False	False	-1962.6864 3082 1442	yes
OneAPI	1	True	True	-1963.7615 2234 6426
OneAPI	8	True	True	-1963.7615 2417 8940
OneAPI	1	False	True	-1963.7615 5899 4243
OneAPI	8	False	True	-1963.7615 5898 3752
GCC	1	True	False	-1963.7587 7038 0348	yes
GCC	8	True	False	-1963.7587 7038 0374	yes
GCC	1	False	False	-1963.7615 8294 3193
GCC	8	False	False	-1963.7615 8294 3197
GCC	1	True	True	-1963.7615 8460 3790
GCC	8	True	True	-1963.7615 8460 3791
GCC	1	False	True	-1963.7615 5884 0375
GCC	8	False	True	-1963.7615 5884 0376

andysim · 2021-08-24T16:20:58Z

Discarding the near-singular values during eigendecomposition cleans things up a lot, so this PR should be merged. However, there's possibly more going on that I'm still investigating. The full inversion using OneAPI is pretty bad with 1 thread, but completely disastrous (like, billions of Hartrees bad) with 8. I tried to reproduce this by writing a standalone code to invert the exact same metric tensor using LAPACK and OneAPI's MKL implementation, but the two gave the same results, both with and without threads. Which strongly hints at a memory bug, as @susilehtola suggested on the original post.

Valgrind takes a long time to run, but gives the following backtrace if I run with the setup described above

==2802==
==2802== Thread 4:
==2802== Invalid read of size 32
==2802==    at 0x224ADCAD: __intel_avx_memmove (in /v/apps/intel/oneapi-2021.2.0.2883/compiler/2021.2.0/linux/compiler/lib/intel64_lin/libintlc.so.5)
==2802==    by 0x1CA81F58: __copy_m (stl_algobase.h:384)
==2802==    by 0x1CA81F58: __copy_move_a (stl_algobase.h:401)
==2802==    by 0x1CA81F58: __copy_move_a2 (stl_algobase.h:438)
==2802==    by 0x1CA81F58: copy (stl_algobase.h:470)
==2802==    by 0x1CA81F58: __uninit_copy (stl_uninitialized.h:93)
==2802==    by 0x1CA81F58: uninitialized_copy (stl_uninitialized.h:123)
==2802==    by 0x1CA81F58: __uninitialized_copy_a (stl_uninitialized.h:281)
==2802==    by 0x1CA81F58: _M_allocate_and_copy (stl_vector.h:1227)
==2802==    by 0x1CA81F58: operator= (vector.tcc:194)
==2802==    by 0x1CA81F58: psi::TwoBodyAOInt::TwoBodyAOInt(psi::TwoBodyAOInt const&) (twobody.cc:107)
==2802==    by 0x1CABEB22: Libint2TwoElectronInt (eribase.cc:3091)
==2802==    by 0x1CABEB22: psi::Libint2TwoElectronInt::Libint2TwoElectronInt(psi::Libint2TwoElectronInt const&) (eri.h:255)
==2802==    by 0x1CC3A0EE: Libint2ErfERI (eri.h:300)
==2802==    by 0x1CC3A0EE: psi::Libint2ErfERI::clone() const (eri.h:305)
==2802==    by 0x1C740620: psi::DFHelper::prepare_AO_wK_core() (dfhelper.cc:571)
==2802==    by 0xD265ED2: __kmp_invoke_microtask (in /u/andysim/anaconda3/envs/psi4dev/lib/libiomp5.so)
==2802==    by 0xD228725: __kmp_invoke_task_func (in /u/andysim/anaconda3/envs/psi4dev/lib/libiomp5.so)
==2802==    by 0xD22771B: __kmp_launch_thread (in /u/andysim/anaconda3/envs/psi4dev/lib/libiomp5.so)
==2802==    by 0xD26630A: _INTERNAL_26_______src_z_Linux_util_cpp_20354e55::__kmp_launch_worker(void*) (in /u/andysim/anaconda3/envs/psi4dev/lib/libiomp5.so)
==2802==    by 0x4E3EEA4: start_thread (in /usr/lib64/libpthread-2.17.so)
==2802==    by 0x51519FC: clone (in /usr/lib64/libc-2.17.so)
==2802==  Address 0x2866b120 is 0 bytes after a block of size 73,440 alloc'd
==2802==    at 0x4C2A593: operator new(unsigned long) (vg_replace_malloc.c:344)
==2802==    by 0x1CA85F62: allocate (new_allocator.h:104)
==2802==    by 0x1CA85F62: allocate (alloc_traits.h:491)
==2802==    by 0x1CA85F62: _M_allocate (stl_vector.h:170)
==2802==    by 0x1CA85F62: _M_fill_insert (vector.tcc:491)
==2802==    by 0x1CA85F62: std::vector<long, std::allocator<long> >::insert(__gnu_cxx::__normal_iterator<long const*, std::vector<long, std::allocator<long> > >, unsigned long, long const&) (stl_vector.h:1054)
==2802==    by 0x1CA7E37F: resize (stl_vector.h:696)
==2802==    by 0x1CA7E37F: psi::TwoBodyAOInt::create_sieve_pair_info(std::shared_ptr<psi::BasisSet>, std::vector<std::pair<int, int>, std::allocator<std::pair<int, int> > >&, bool) (twobody.cc:244)

This is a harmless copy construction of a vector that is well defined. I tried to remove the AVX2 code path using the MKL options but the problem persisted. Finding an old pre-AVX box and running on there did the trick, and Valgrind came back clean. It was clean for the GCC build also. Therefore it appears that there is a memory problem, but it's likely either an Intel compiler bug (I think OneAPI is still in beta technically) or a problem with me mixing libiomp5 from Conda with the native OneAPI installed compiler suite. I'll try the commercial Intel compilers next...

loriab

Formidable detective work.

loriab · 2021-08-25T16:04:59Z

fwiw, no difference in test system behavior between the two originally reported environments. Both fail to converge with nthreads=8 and show the same small error with nthreads=1. Psi is the 1.4 conda package, so bugfix in the PR not applied. MKLs and iomp5's in the 2019–2021 range. Defaults-based env has gomp present (been so since the gcc 7.3->7.5 transition a couple months ago) but it's not linked into psi or showing trouble.

(trial140def) psilocaluser@bash:psinet:/psi/gits/trial: conda list -n trial140def | grep -e gcc -e "\-ng" -e gomp -e mkl -e intel -e "1\.4+94"
_libgcc_mutex             0.1                        main  
blas                      1.0                         mkl  
intel-openmp              2021.3.0          h06a4308_3350  
libgcc-ng                 9.3.0               h5101ec6_17  
libgfortran-ng            7.5.0               ha8ba4b0_17  
libgomp                   9.3.0               h5101ec6_17  
libstdcxx-ng              9.3.0               hd4cf53a_17  
mkl                       2020.2                      256  
mkl-service               2.3.0            py38he904b0f_0  
mkl_fft                   1.3.0            py38h54f3939_0  
mkl_random                1.1.1            py38h0573a6f_0  
psi4                      1.4+9485035      py38hd63c0ca_0    psi4
(trial140def) psilocaluser@bash:psinet:/psi/gits/trial: conda list -n trial140ana | grep -e gcc -e "\-ng" -e gomp -e mkl -e intel -e "1\.4+94"
blas                      1.0                         mkl    anaconda
intel-openmp              2020.2                      254    anaconda
libgcc-ng                 9.1.0                hdf63c60_0    anaconda
libgfortran-ng            7.3.0                hdf63c60_0    anaconda
libstdcxx-ng              9.1.0                hdf63c60_0    anaconda
mkl                       2019.4                      243    anaconda
mkl-service               2.3.0            py38he904b0f_0    anaconda
mkl_fft                   1.2.0            py38h23d657b_0    anaconda
mkl_random                1.1.0            py38h962f231_0    anaconda
psi4                      1.4+9485035      py38hd63c0ca_0    psi4

andysim · 2021-08-25T16:45:36Z

Thanks, @loriab. I have tried a number of things, including ensuring that MKL and libiomp5 from the native compilers are used instead of conda versions of those libs, and nothing seems to change the outcome of the unpatched code with multiple threads. It's worth noting that running on an non-AVX platform removes the Valgrind errors, but doesn't change the answer so I think they're probably just coming from AVX memmove somehow confusing Valgrind. At this point, my only guess is that perhaps some weird nested thread issue is maybe causing diagonalization issues, but I have no idea why that would be a problem for the unpatched code and not for the patched code. In case anybody wants to check the results, here's a standalone code that diagonalizes the exact same matrix, using the exact same LAPACK calls as the unpatched code. So far it looks like things are working in Psi4 after the fix, but I just wanted to document all of the suspicious behavior on this PR, in case we see similar problems in future.

susilehtola · 2021-08-28T19:24:44Z

FWIW I think all DF codes should just call the orthogonalization routines in orthog.cc. You can use them either in the symmetric mode or the asymmetric mode. In the symmetric mode you contract with the matrix X that is produced by the orthogonalization; for the asymmetric one you can build the inverse matrix with S^-1 = X X^T. As I explain in arXiv:2106.11081, this should eliminate any numerical instabilities from the density fitting routines.

JonathonMisiewicz · 2021-08-30T15:07:25Z

Just to be clear, all other code that used general_invert is still going to have numerical instabilities, even after this PR?

andysim · 2021-09-24T15:18:54Z

Sorry, I missed your last question, @JonathonMisiewicz. Correct, there is a bug in MKL that is being fixed, so all general_invert calls are potentially vulnerable. I will note that many of the general_invert calls in the code give the correct answer still (including the one fixed in this PR, for many system sizes), as does the standalone code I used to report the issue. It looks like some kind of an uninitialized variable somewhere in Intel's kernels so it's likely to be context-dependent whether the right answer is returned. I'm pretty sure it's been present for at least the last few versions and we only just noticed it, so I suspect it's not that pervasive. Regardless, we need to update our MKL dependency as soon as the fix is published.

JonathonMisiewicz · 2021-09-24T15:22:16Z

That note was exactly what I was worried about. Thanks much, and approved.

JonathonMisiewicz · 2021-09-28T13:33:36Z

As of Friday, Andy said this "should be merged ASAP," so without further delay...

Regularize metric inverse with wcombine algorithm

9e6c1a0

loriab approved these changes Aug 25, 2021

View reviewed changes

loriab added backport correctness-error For issues where Psi4 gives answers that are wrong. labels Aug 30, 2021

loriab added this to the Psi4 1.5 milestone Aug 30, 2021

JonathonMisiewicz approved these changes Sep 24, 2021

View reviewed changes

jturney approved these changes Sep 28, 2021

View reviewed changes

JonathonMisiewicz merged commit 9163cbd into master Sep 28, 2021

JonathonMisiewicz deleted the dffix branch September 28, 2021 13:34

tallakahath mentioned this pull request Sep 30, 2021

Bad wB97X-D3BJ energies with MKL & in-core DFT #2279

Open

JonathonMisiewicz mentioned this pull request Nov 11, 2021

The calculation results of different versions (v1.3.2 vs v1.4 vs v1.4.1) of Psi4 vary greatly #2351

Closed

hokru mentioned this pull request Nov 22, 2021

disables wcombine option #2362

Merged

2 tasks

pablo-unzueta mentioned this pull request Oct 5, 2022

Cannot reproduce wB97M-D3(BJ) energies and forces openmm/spice-dataset#50

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make combined omega DF algorithm consistent with uncombined version. #2283

Make combined omega DF algorithm consistent with uncombined version. #2283

Uh oh!

andysim commented Aug 24, 2021

Uh oh!

andysim commented Aug 24, 2021

Uh oh!

andysim commented Aug 24, 2021 •

edited by loriab

Loading

Uh oh!

andysim commented Aug 24, 2021

Uh oh!

loriab left a comment

Uh oh!

loriab commented Aug 25, 2021

Uh oh!

andysim commented Aug 25, 2021

Uh oh!

susilehtola commented Aug 28, 2021

Uh oh!

JonathonMisiewicz commented Aug 30, 2021

Uh oh!

andysim commented Sep 24, 2021 •

edited

Loading

Uh oh!

JonathonMisiewicz commented Sep 24, 2021

Uh oh!

JonathonMisiewicz commented Sep 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Make combined omega DF algorithm consistent with uncombined version. #2283

Make combined omega DF algorithm consistent with uncombined version. #2283

Uh oh!

Conversation

andysim commented Aug 24, 2021

Description

Todos

Checklist

Status

Uh oh!

andysim commented Aug 24, 2021

Uh oh!

andysim commented Aug 24, 2021 • edited by loriab Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

andysim commented Aug 24, 2021

Uh oh!

loriab left a comment

Choose a reason for hiding this comment

Uh oh!

loriab commented Aug 25, 2021

Uh oh!

andysim commented Aug 25, 2021

Uh oh!

susilehtola commented Aug 28, 2021

Uh oh!

JonathonMisiewicz commented Aug 30, 2021

Uh oh!

andysim commented Sep 24, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JonathonMisiewicz commented Sep 24, 2021

Uh oh!

JonathonMisiewicz commented Sep 28, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

andysim commented Aug 24, 2021 •

edited by loriab

Loading

andysim commented Sep 24, 2021 •

edited

Loading