Skip to content

Conversation

@andysim
Copy link
Member

@andysim andysim commented Aug 24, 2021

Description

This addresses at least one of the problems (see details below) associated with #2279.

Todos

Checklist

Status

  • Ready for review
  • Ready for merge

@andysim
Copy link
Member Author

andysim commented Aug 24, 2021

There's a lot going on here, so bear with me. In DF algorithms we approximate (ab|cd)≈(ab|P) M_PQ (Q|cd), where M is the inverse of the metric tensor M = (P|Q)^{-1}. For calculations involving symmetric densities (such as SCF energies), we can write the M quantity as a product of inverse square roots instead; multiplying one to the left and the other to the right gives two identical intermediates, saving memory. That symmetric approach is used in the conventional algorithms that existed pre-1.4. When computing the inverse square root that's needed, the power() function is used; that function respects a cutoff that is used to eliminate (near) linear dependencies and increase stability (and was widely implemented by @susilehtola as part of an effort to increase stability of the code).

In cases where the density involved is not totally symmetric, such as excited states, the symmetric DF intermediates are not used and instead the decomposition is asymmetric: (ab|cd)≈[(ab|P)] [M_PQ (Q|cd)]. This asymmetric decomposition is also used in the recently introduced wcombine algorithm that combines coulomb-attenuated integrals and conventional ERIs to speed up exchange evaluation. To generate the second intermediate in this approach, a full inverse of the metric is needed, instead of the inverse square root, and this is achieved by calling general_invert(). This matrix inversion does not eliminate (near) linear dependencies and the system reported in #2279 is susceptible (see comments below for a demo). The wcombine algorithm was implemented after @susilehtola's cleanup efforts and inadvertently failed to remove linear dependencies during metric inversion; this PR corrects this oversight.

@andysim
Copy link
Member Author

andysim commented Aug 24, 2021

Some tests from the original issue on a 48 core Intel Xeon Gold 5120 Linux box, with OneAPI compilers (version 2021.2.0) as well as GCC 7.1, with and without the bug fixed:

Compiler #Threads wcombine Bug fixed? Energy (Eh) Still Bad?
OneAPI 1 True False -1963.7577 8206 1252* yes
OneAPI 8 True False did not converge yes
OneAPI 1 False False -1963.7615 9009 8066
OneAPI 8 False False -1962.6864 3082 1442 yes
OneAPI 1 True True -1963.7615 2234 6426
OneAPI 8 True True -1963.7615 2417 8940
OneAPI 1 False True -1963.7615 5899 4243
OneAPI 8 False True -1963.7615 5898 3752
GCC 1 True False -1963.7587 7038 0348 yes
GCC 8 True False -1963.7587 7038 0374 yes
GCC 1 False False -1963.7615 8294 3193
GCC 8 False False -1963.7615 8294 3197
GCC 1 True True -1963.7615 8460 3790
GCC 8 True True -1963.7615 8460 3791
GCC 1 False True -1963.7615 5884 0375
GCC 8 False True -1963.7615 5884 0376

@andysim
Copy link
Member Author

andysim commented Aug 24, 2021

Discarding the near-singular values during eigendecomposition cleans things up a lot, so this PR should be merged. However, there's possibly more going on that I'm still investigating. The full inversion using OneAPI is pretty bad with 1 thread, but completely disastrous (like, billions of Hartrees bad) with 8. I tried to reproduce this by writing a standalone code to invert the exact same metric tensor using LAPACK and OneAPI's MKL implementation, but the two gave the same results, both with and without threads. Which strongly hints at a memory bug, as @susilehtola suggested on the original post.

Valgrind takes a long time to run, but gives the following backtrace if I run with the setup described above

==2802==
==2802== Thread 4:
==2802== Invalid read of size 32
==2802==    at 0x224ADCAD: __intel_avx_memmove (in /v/apps/intel/oneapi-2021.2.0.2883/compiler/2021.2.0/linux/compiler/lib/intel64_lin/libintlc.so.5)
==2802==    by 0x1CA81F58: __copy_m (stl_algobase.h:384)
==2802==    by 0x1CA81F58: __copy_move_a (stl_algobase.h:401)
==2802==    by 0x1CA81F58: __copy_move_a2 (stl_algobase.h:438)
==2802==    by 0x1CA81F58: copy (stl_algobase.h:470)
==2802==    by 0x1CA81F58: __uninit_copy (stl_uninitialized.h:93)
==2802==    by 0x1CA81F58: uninitialized_copy (stl_uninitialized.h:123)
==2802==    by 0x1CA81F58: __uninitialized_copy_a (stl_uninitialized.h:281)
==2802==    by 0x1CA81F58: _M_allocate_and_copy (stl_vector.h:1227)
==2802==    by 0x1CA81F58: operator= (vector.tcc:194)
==2802==    by 0x1CA81F58: psi::TwoBodyAOInt::TwoBodyAOInt(psi::TwoBodyAOInt const&) (twobody.cc:107)
==2802==    by 0x1CABEB22: Libint2TwoElectronInt (eribase.cc:3091)
==2802==    by 0x1CABEB22: psi::Libint2TwoElectronInt::Libint2TwoElectronInt(psi::Libint2TwoElectronInt const&) (eri.h:255)
==2802==    by 0x1CC3A0EE: Libint2ErfERI (eri.h:300)
==2802==    by 0x1CC3A0EE: psi::Libint2ErfERI::clone() const (eri.h:305)
==2802==    by 0x1C740620: psi::DFHelper::prepare_AO_wK_core() (dfhelper.cc:571)
==2802==    by 0xD265ED2: __kmp_invoke_microtask (in /u/andysim/anaconda3/envs/psi4dev/lib/libiomp5.so)
==2802==    by 0xD228725: __kmp_invoke_task_func (in /u/andysim/anaconda3/envs/psi4dev/lib/libiomp5.so)
==2802==    by 0xD22771B: __kmp_launch_thread (in /u/andysim/anaconda3/envs/psi4dev/lib/libiomp5.so)
==2802==    by 0xD26630A: _INTERNAL_26_______src_z_Linux_util_cpp_20354e55::__kmp_launch_worker(void*) (in /u/andysim/anaconda3/envs/psi4dev/lib/libiomp5.so)
==2802==    by 0x4E3EEA4: start_thread (in /usr/lib64/libpthread-2.17.so)
==2802==    by 0x51519FC: clone (in /usr/lib64/libc-2.17.so)
==2802==  Address 0x2866b120 is 0 bytes after a block of size 73,440 alloc'd
==2802==    at 0x4C2A593: operator new(unsigned long) (vg_replace_malloc.c:344)
==2802==    by 0x1CA85F62: allocate (new_allocator.h:104)
==2802==    by 0x1CA85F62: allocate (alloc_traits.h:491)
==2802==    by 0x1CA85F62: _M_allocate (stl_vector.h:170)
==2802==    by 0x1CA85F62: _M_fill_insert (vector.tcc:491)
==2802==    by 0x1CA85F62: std::vector<long, std::allocator<long> >::insert(__gnu_cxx::__normal_iterator<long const*, std::vector<long, std::allocator<long> > >, unsigned long, long const&) (stl_vector.h:1054)
==2802==    by 0x1CA7E37F: resize (stl_vector.h:696)
==2802==    by 0x1CA7E37F: psi::TwoBodyAOInt::create_sieve_pair_info(std::shared_ptr<psi::BasisSet>, std::vector<std::pair<int, int>, std::allocator<std::pair<int, int> > >&, bool) (twobody.cc:244)

This is a harmless copy construction of a vector that is well defined. I tried to remove the AVX2 code path using the MKL options but the problem persisted. Finding an old pre-AVX box and running on there did the trick, and Valgrind came back clean. It was clean for the GCC build also. Therefore it appears that there is a memory problem, but it's likely either an Intel compiler bug (I think OneAPI is still in beta technically) or a problem with me mixing libiomp5 from Conda with the native OneAPI installed compiler suite. I'll try the commercial Intel compilers next...

Copy link
Member

@loriab loriab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formidable detective work.

@loriab
Copy link
Member

loriab commented Aug 25, 2021

fwiw, no difference in test system behavior between the two originally reported environments. Both fail to converge with nthreads=8 and show the same small error with nthreads=1. Psi is the 1.4 conda package, so bugfix in the PR not applied. MKLs and iomp5's in the 2019–2021 range. Defaults-based env has gomp present (been so since the gcc 7.3->7.5 transition a couple months ago) but it's not linked into psi or showing trouble.

(trial140def) psilocaluser@bash:psinet:/psi/gits/trial: conda list -n trial140def | grep -e gcc -e "\-ng" -e gomp -e mkl -e intel -e "1\.4+94"
_libgcc_mutex             0.1                        main  
blas                      1.0                         mkl  
intel-openmp              2021.3.0          h06a4308_3350  
libgcc-ng                 9.3.0               h5101ec6_17  
libgfortran-ng            7.5.0               ha8ba4b0_17  
libgomp                   9.3.0               h5101ec6_17  
libstdcxx-ng              9.3.0               hd4cf53a_17  
mkl                       2020.2                      256  
mkl-service               2.3.0            py38he904b0f_0  
mkl_fft                   1.3.0            py38h54f3939_0  
mkl_random                1.1.1            py38h0573a6f_0  
psi4                      1.4+9485035      py38hd63c0ca_0    psi4
(trial140def) psilocaluser@bash:psinet:/psi/gits/trial: conda list -n trial140ana | grep -e gcc -e "\-ng" -e gomp -e mkl -e intel -e "1\.4+94"
blas                      1.0                         mkl    anaconda
intel-openmp              2020.2                      254    anaconda
libgcc-ng                 9.1.0                hdf63c60_0    anaconda
libgfortran-ng            7.3.0                hdf63c60_0    anaconda
libstdcxx-ng              9.1.0                hdf63c60_0    anaconda
mkl                       2019.4                      243    anaconda
mkl-service               2.3.0            py38he904b0f_0    anaconda
mkl_fft                   1.2.0            py38h23d657b_0    anaconda
mkl_random                1.1.0            py38h962f231_0    anaconda
psi4                      1.4+9485035      py38hd63c0ca_0    psi4

@andysim
Copy link
Member Author

andysim commented Aug 25, 2021

Thanks, @loriab. I have tried a number of things, including ensuring that MKL and libiomp5 from the native compilers are used instead of conda versions of those libs, and nothing seems to change the outcome of the unpatched code with multiple threads. It's worth noting that running on an non-AVX platform removes the Valgrind errors, but doesn't change the answer so I think they're probably just coming from AVX memmove somehow confusing Valgrind. At this point, my only guess is that perhaps some weird nested thread issue is maybe causing diagonalization issues, but I have no idea why that would be a problem for the unpatched code and not for the patched code. In case anybody wants to check the results, here's a standalone code that diagonalizes the exact same matrix, using the exact same LAPACK calls as the unpatched code. So far it looks like things are working in Psi4 after the fix, but I just wanted to document all of the suspicious behavior on this PR, in case we see similar problems in future.

@susilehtola
Copy link
Member

FWIW I think all DF codes should just call the orthogonalization routines in orthog.cc. You can use them either in the symmetric mode or the asymmetric mode. In the symmetric mode you contract with the matrix X that is produced by the orthogonalization; for the asymmetric one you can build the inverse matrix with S^-1 = X X^T. As I explain in arXiv:2106.11081, this should eliminate any numerical instabilities from the density fitting routines.

@loriab loriab added backport correctness-error For issues where Psi4 gives answers that are wrong. labels Aug 30, 2021
@loriab loriab added this to the Psi4 1.5 milestone Aug 30, 2021
@JonathonMisiewicz
Copy link
Contributor

Just to be clear, all other code that used general_invert is still going to have numerical instabilities, even after this PR?

@andysim
Copy link
Member Author

andysim commented Sep 24, 2021

Sorry, I missed your last question, @JonathonMisiewicz. Correct, there is a bug in MKL that is being fixed, so all general_invert calls are potentially vulnerable. I will note that many of the general_invert calls in the code give the correct answer still (including the one fixed in this PR, for many system sizes), as does the standalone code I used to report the issue. It looks like some kind of an uninitialized variable somewhere in Intel's kernels so it's likely to be context-dependent whether the right answer is returned. I'm pretty sure it's been present for at least the last few versions and we only just noticed it, so I suspect it's not that pervasive. Regardless, we need to update our MKL dependency as soon as the fix is published.

@JonathonMisiewicz
Copy link
Contributor

That note was exactly what I was worried about. Thanks much, and approved.

@JonathonMisiewicz
Copy link
Contributor

As of Friday, Andy said this "should be merged ASAP," so without further delay...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport correctness-error For issues where Psi4 gives answers that are wrong.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bad wB97X-D3BJ energies with MKL & in-core DFT

6 participants