Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2010, IFIP Advances in Information and Communication Technology
…
8 pages
1 file
Conformal predictors represent a new flexible framework that outputs region predictions with a guaranteed error rate. Efficiency of such predictions depends on the nonconformity measure that underlies the predictor. In this work we designed new nonconformity measures based on a random forest classifier. Experiments demonstrate that proposed conformal predictors are more efficient than current benchmarks on noisy mass spectrometry data (and at least as efficient on other type of data) while maintaining the property of validity: they output fewer multiple predictions, and the ratio of mistakes does not exceed the preset level. When forced to produce singleton predictions, the designed conformal predictors are at least as accurate as the benchmarks and sometimes significantly outperform them.
Journal of Chemical Information and Computer Sciences, 2003
A new classification and regression tool, Random Forest, is introduced and investigated for predicting a compound's quantitative or categorical biological activity based on a quantitative description of the compound's molecular structure. Random Forest is an ensemble of unpruned classification or regression trees created by using bootstrap samples of the training data and random feature selection in tree induction. Prediction is made by aggregating (majority vote or averaging) the predictions of the ensemble. We built predictive models for six cheminformatics data sets. Our analysis demonstrates that Random Forest is a powerful tool capable of delivering performance that is among the most accurate methods to date. We also present three additional features of Random Forest: built-in performance assessment, a measure of relative importance of descriptors, and a measure of compound similarity that is weighted by the relative importance of descriptors. It is the combination of relatively high prediction accuracy and its collection of desired features that makes Random Forest uniquely suited for modeling in cheminformatics.
2016
In the current study we examine an application of the machine learning methods to model the retention constants in the thin layer chromatography (TLC). This problem can be described with hundreds or even thousands of descriptors relevant to various molecular properties, most of them redundant and not relevant for the retention constant prediction. Hence we employed feature selection to significantly reduce the number of attributes. Additionally we have tested application of the bagging procedure to the feature selection. The random forest regression models were built using selected variables. The resulting models have better correlation with the experimental data than the reference models obtained with linear regression. The cross-validation confirms robustness of the models.
Croatica Chemica Acta
Shortcomings of the correlation coefficient (Pearson's) as a measure for estimating and calculating the accuracy of predictive model properties are analysed. Here we discuss two such cases that can often occur in the application of the model in predicting properties of a new external set of compounds. The first problem in using the correlation coefficient is its insensitivity to the systemic error that must be expected in predicting properties of a novel external set of compounds, which is not a random sample selected from the training set. The second problem is that an external set can be arbitrarily large or small and have an arbitrary and uneven distribution of the measured value of the target variable, whose values are not known in advance. In these conditions, the correlation coefficient can be an overoptimistic measure of agreement of predicted values with the corresponding experimental values and can lead to a highly optimistic conclusion about the predictive ability of t...
Lecture Notes in Computer Science, 2016
The paper presents an application of Conformal Predictors to a chemoinformatics problem of identifying activities of chemical compounds. The paper addresses some specific challenges of this domain: a large number of compounds (training examples), high-dimensionality of feature space, sparseness and a strong class imbalance. A variant of conformal predictors called Inductive Mondrian Conformal Predictor is applied to deal with these challenges. Results are presented for several non-conformity measures (NCM) extracted from underlying algorithms and different kernels. A number of performance measures are used in order to demonstrate the flexibility of Inductive Mondrian Conformal Predictors in dealing with such a complex set of data.
JUITA: Jurnal Informatika
A conformational epitope is a part of a protein-based vaccine. It is challenging to identify using an experiment. A computational model is developed to support identification. However, the imbalance class is one of the constraints to achieving optimal performance on the conformational epitope B cell prediction. In this paper, we compare several conformational epitope B cell prediction models from non-ensemble and ensemble approaches. A sampling method from Random undersampling, SMOTE, and cluster-based undersampling is combined with a decision tree or SVM to build a non-ensemble model. A random forest model and several variants of the bagging method is used to construct the ensemble model. A 10-fold cross-validation method is used to validate the model. The experiment results show that the combination of the cluster-based under-sampling and decision tree outperformed the other sampling method when combined with the non-ensemble and the ensemble method. This study provides a baselin...
Journal of Chemical Information and Modeling, 2017
The ability to interpret the predictions made by quantitative structure activity relationships (QSARs) offers a number of advantages. Whilst QSARs built using non6linear modelling approaches, such as the popular Random Forest algorithm, might sometimes be more predictive than those built using linear modelling approaches, their predictions have been perceived as difficult to interpret. However, a growing number of approaches have been proposed for interpreting non6linear QSAR models in general and Random Forest in particular. In the current work, we compare the performance of Random Forest to two widely used linear modelling approaches: linear Support Vector Machines (SVM), or Support Vector Regression (SVR), and Partial Least Squares (PLS). We compare their performance in terms of their predictivity as well as the chemical interpretability of the predictions, using novel scoring schemes for assessing Heat Map images of substructural contributions. We critically assess different approaches to interpreting Random Forest models as well as for obtaining predictions from the forest. We assess the models on a large number of widely employed, public domain benchmark datasets corresponding to regression and binary classification problems of relevance to hit identification and toxicology. We conclude that Random Forest typically yields comparable or possibly better predictive performance than the linear modelling approaches and that its predictions may also be interpreted in a chemically and biologically meaningful way. In contrast to earlier work looking at interpreting non6linear QSAR models, we directly compare two methodologically distinct approaches for interpreting Random Forest models. The approaches for interpreting Random
Proceedings of the 3rd Skövde Workshop on …, 2009
Journal of Chemometrics, 2020
Nuclear magnetic resonance (NMR) can provide a large amount of information about an analyzed sample; however, its spectra contain above 6000 variables, making it difficult for random forest (RF) applications. Reducing the size of the original dataset can minimize this problem. In this paper, we compared RF classification models obtained with full NMR spectral range and from the reduction of NMR variables, using principal component analysis (PCA) and the Fisher discriminant (FD). Then, the variables used in the construction of RF trees were analyzed and identified. Here, we used 1 H and 13 C NMR spectra obtained from 126 petroleum samples and values of their total acidy number (TAN), as measured by ASTM D664, ranging from 0.03 to 4.96 mg KOHÁ g −1 , to distinguish the oil samples from the TAN values. Of two classes that resulted, the first contained 78 samples with TAN values less than, or equal to, 0.3 mg KOHÁ g −1 , while the second contained 48 samples with TAN values higher than 0.3 mg KOHÁ g −1. The 1 H NMR results showed that the combination of FD and RF techniques provided the best accuracy (88%). For 13 C NMR data, the most accurate model was obtained by the association of PCA and RF (84%). The identification of variables used in RF allowed a better understanding of the important chemical data contained in the spectra and the relationship to TAN in petroleum. K E Y W O R D S fisher discriminant, NMR, PCA, random forest, reduction of variables 1 | INTRODUCTION Random forest (RF) is a machine learning method recognized for its efficiency in supervised classification of linear and nonlinear data, especially in the areas of genetic analysis, metabolomics, medical image analysis, food quality control, and crude oil. 1-4 The potential of the RF technique in classification is confirmed when associated with analytical methods that allow the acquisition of a large amount of information, such as chromatography, mass spectrometry, infrared spectroscopy, nuclear magnetic resonance of hydrogen (1 H NMR), and carbon (13 C NMR). 1-4 The NMR technique provides detailed information on functional groups, characterizing the sample at the molecular level; however, the number of variables generated may exceed 65 000, depending on the spectral range analyzed. In
2015
The report summarises some preliminary findings of WP1.4: Confidence Estimation and feature significance. It presents an application of conformal predictors in transductive and inductive modes to the large, high-dimensional, sparse and imbalanced data sets found in Compound Activity Prediction from PubChem public repository. The report describes a version of conformal predictors called Mondrian Predictor that keeps validity guarantees for each class. The experiments were conducted using several non-conformity measures extracted from underlying algorithms such as SVM, Nearest Neighbours and Näıve Bayes. The results show (1) that Inductive Conformal Mondrian Prediction framework is quick and effective for large imbalanced data and (2) that its less strict i.i.d. requirements combine well with training set editing algorithms such as Cascade SVM. Among the algorithms tested with the Mondrian ICP framework, Cascade SVM with Tanimoto+RBF kernel appeared to be best performing one, if the q...
2019
In real-world scenarios, interpretable models are often required to explain predictions, and to allow for inspection and analysis of the model. The overall purpose of oracle coaching is to produce highly accurate, but interpretable, models optimized for a specific test set. Oracle coaching is applicable to the very common scenario where explanations and insights are needed for a specific batch of predictions, and the input vectors for this test set are available when building the predictive model. In this paper, oracle coaching is used for generating underlying classifiers for conformal prediction. The resulting conformal classifiers output valid label sets, i.e., the error rate on the test data is bounded by a preset significance level, as long as the labeled data used for calibration is exchangeable with the test set. Since validity is guaranteed for all conformal predictors, the key performance metric is efficiency, i.e., the size of the label sets, where smaller sets are more in...
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
IEEE Transactions on Instrumentation and Measurement, 2022
BMC bioinformatics, 2009
Molecular & Cellular Proteomics, 2005
Proceedings of the 7th Course on Ensemble Methods for Learning Machines, 2004
IEEE Access
arXiv (Cornell University), 2011
Systems Science & Control Engineering, 2014
Journal of Machine Learning Research, 2007
Journal of Artificial Intelligence Research, 2011
Annals of Mathematics and Artificial Intelligence, 2017
Annals of Mathematics and Artificial Intelligence, 2013
Microchemical Journal, 2019