Table 1. Data Sets for Prospective Prediction major differences between DNNs today and the classical artificial neural networks widely used for chemical applications in the 1990s is that DNNs have more than one intermediate (i.., hidden) layer and more neurons in each layer and are thus both “deeper” and “wider.” data sets of various sizes (2000—S0 000 molecules) using a common descriptor type. Each data set was divided into a training set and test set. Kaggle contestants were given descriptors and activities for the training set and descriptors only for the test set. Contestants were allowed to generate models using any machine learning method or combinations thereof, and predict the activities of test set molecules. Contestants could submit as many separate sets of predictions as they wished within a certain time period. The winning entry (submitted by one of the authors, George Dahl) improved the mean R’ averaged over the 15 data sets from 0.42 (for RF) to 0.49. While the improvement might not seem large, we have seldom seen any method in the past 10 years that could consistently outperform RF by such a margin, so we felt this was an interesting result. Figure 1. Architecture of deep neural nets. Figure 3. Activation function in the output layer. Table 2. Comparing Test R’s of Different Models QSAR task, the average R* would be degraded only from 0.423 to 0.412, merely a 2.6% reduction. These results suggest that DNNs can generally outperform RF. Figure 5. Impacts of Network Architecture. Each marker in the plot represents a choice of DNN network architecture. The markers share the same number of hidden layers are connected with a line. The measurement (i.e., y-axis) is the difference of the mean R” between DNNs and RF. The mean R’ of DNNs is obtained by averaging over all DNNs sharing the same network architecture and over 15 data sets. The horizontal dotted green line indicates 0, where the mean R? of DNNs is the same as that of RF. Figure 6. Choice of activation functions. Each column represents a QSAR data set, and each circle represents the difference, measured in R’, of a pair of DNNs trained with ReLU and Sigmoid, respectively. The horizontal dashed red line indicates 0. A positive value indicates the case where ReLU outperforms Sigmoid. The horizontal dotted green line indicates the overall difference between ReLU and Sigmoid, measured in mean R’. The data sets where ReLU is significantly better than Sigmoid are marked with “+”s at the bottom of the plot and colored blue. “+” indicates p-value < 0.05, and “++” indicates p-value < 0.01, while “+++” indicates p-value < 0.001. In contract, the data set where Sigmoid is significantly better than ReLU is marked with “—"s at the bottom of the plot and is colored black. “—” indicates p-value < 0.05. The remaining data sets are not marked and are colored gray. “ Figure 7. Difference between joint DNNs trained with multiple data sets and the individual DNNs trained with single data sets. Each column represent a scenario for comparing joint DNNs with single-task DNNs. Each circle represents the difference, measured in R’, of a pair of DNNs trained fron multiple data sets and a single data set, respectively. The horizontal dashed red line indicates 0. A positive value indicates the case where a joint DN} outperforms an individual DNN. The p-value of a two-side paired-sample t test conducted for each scenario is also provided at the bottom of eacl column. Figure 8. Impacts of unsupervised pretraining. Each column represents a QSAR data set, and each circle represents the difference, measured in R’, of a pair of DNNs trained without and with pretraining, respectively. The horizontal dashed red line indicates 0. A positive value indicates that a DNN without a pretraining outperforms the corresponding DNN with a pretraining. The horizontal dotted green line indicates the overall difference between DNNs without and with pretraining, measured in mean R’. Figure 9. DNN vs RF with refined parameter settings. Each column represents a QSAR data set, and each circle represents the improvement, measured in R’, of a DNN over RF. The horizontal dashed red line indicates 0. A positive value means that the corresponding DNN outperforms RF. The horizontal dotted green line indicates the overall improvement of DNNs over RF measured in mean R’. The data sets, in which DNNs dominates RF for all arbitrarily parameter settings, are colored in blue; the data set, in which RF dominates DNNs for all parameter settings, is colored black; the other data sets are colored gray. Table 3. Comparing RF with DNN Trained Using Recommended Parameter Settings on 15 Additional Datasets