Published May 16, 2025 | Version v1
Dataset Open

Vector-QM24 (VQM24) dataset

  • 1. ROR icon University of Toronto
  • 2. ROR icon Vector Institute
  • 3. ROR icon Argonne National Laboratory
  • 4. ROR icon University of Kassel

Description

Quantum chemistry dataset of ~836 thousand small organic and inorganic molecules.


Density Functional Theory (DFT) properties for all 784,875 conformers in local minima; 258,242 constitutional isomers (most stable conformer) and 51,072 saddle point structures are available in the DFT_all.npz, DFT_uniques.npz and DFT_saddles.npz files respectively.
Diffusion quantum Monte Carlo (DMC) data for 10,793 constitutional isomers is available in the DMC.npz file.

All molecules are ordered in the same way across every array.

Keys for accessing each property are tabulated in the paper.

Usage example :

import numpy as np

data = np.load('DFT_all.npz', allow_pickle=True)
print(data.files) #see a list of all properties

key = 'freqs'

property = data[key] #DFT vibrational frequencies of all molecules
print(property[42]) #Frequencies of molecule number 42 in the array (HSCl, Thiohypochlorous acid)

 

Input file samples, tools and kernel ridge regression, graph neural network models : https://github.com/dkhan42/VQM24

Atomic energies (in Hartree) used to calculate the atomization energies :

#atomic energies wB97X-D3/cc-pVDZ (PSI4 v1.7)
eatomic = {'Hydrogen' : -0.5012728848846926,
'Carbon' : -37.83859584856468,
'Nitrogen' : -54.5760607136932450,
'Oxygen' : -75.0474818911551438,
'Fluorine' : -99.7031524437270917,
'Bromine' : -2574.01253635198464,
'Chlorine' : -460.13960793480203,
'Phosphorous' : -341.2510291850040858,
'Sulfur' : -398.1021030909759020,
'Silicon' : -289.3578409507016431}
 
 
Wavefunctions of all 836 thousand molecules from the dataset are available as .molden files in wavefunctions.tar.gz
.molden file for a specific molecule from the dataset can be found using the 'compounds' array in 'DFT_all.npz' file.
For instance : the 0-th entry in the 'compounds' array of DFT_all.npz corresponds to 'SH2_0/conformer_1'
Wavefunction file for this molecule will be found at 'wavefunctions/SH2_0/conformer_1.molden' after untarring wavefunctions.tar.gz
Multiwfn (http://sobereva.com/multiwfn/) can be used to read the .molden wavefunction files
 

Dataset is described in the paper : https://www.nature.com/articles/s41597-025-05428-4

Files

Files (108.2 GB)

Name Size Download all
md5:fa6dcc8571fc5a9b627f6d53d6098155
1.1 GB Download
md5:82bfaf515f720d45cc5fe03e401b73f4
111.8 MB Download
md5:f2e32ff43232445e063795d5c6abbaed
337.6 MB Download
md5:565e295d845662d7df8e0dcca6db0d21
2.9 MB Download
md5:10dc7ba545af85c7d4aaeda2472daace
106.7 GB Download

Additional details

Software