Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2008
Abstract In this paper we propose an energy-based model (EBM) for selecting subsets of features that are both causally and predictively relevant for classification tasks. The proposed method is tested in the causality challenge, a competition that promotes research on strengthen feature selection by taking into account causal information of features.
2013
Editor: Isabelle Guyon et al. In this paper we propose an energy-based model (EBM) for selecting subsets of features that are both causally and predictively relevant for classification tasks. The proposed method is tested in the causality challenge, a competition that promotes research on strengthen feature selection by taking into account causal information of features. Under the proposed approach, an energy value is assigned to every configuration of features and the problem is reduced to that of finding the configuration that minimizes an energy function. We propose an energy function that takes into account causal, predictive, and relevance/correlation information of features. Particularly, we introduce potentials that combine the rankings of individual feature selection methods, Markov blanket information and predictive performance estimations. The configuration with lower energy will be that offering the best tradeoff between these sources of information. Experimental results ...
2006
Energy-Based Models (EBMs) capture dependencies between variables by associating a scalar energy to each configuration of the variables. Inference consists in clamping the value of observed variables and finding configurations of the remaining variables that minimize the energy. Learning consists in finding an energy function in which observed configurations of the variables are given lower energies than unobserved ones. The EBM approach provides a common theoretical framework for many learning models, including traditional discriminative and generative approaches, as well as graph-transformer networks, conditional random fields, maximum margin Markov networks, and several manifold learning methods.
Pattern Recognition, 2010
Selecting relevant features for support vector machine (SVM) classifiers is important for a variety of reasons such as generalization performance, computational efficiency, and feature interpretability. Traditional SVM approaches to feature selection typically extract features and learn SVM parameters independently. Independently performing these two steps might result in a loss of information related to the classification process. This paper proposes a convex energy-based framework to jointly perform feature selection and SVM parameter learning for linear and non-linear kernels. Experiments on various databases show significant reduction of features used while maintaining classification performance.
Neural processing letters, 2010
In pattern classification, input pattern features usually contribute differently, in accordance to their relevances for a specific classification task. In a previous paper, we have introduced the Energy Supervised Relevance Neural Gas classifier, a kernel method which uses the maximization of Onicescu's informational energy for computing the relevances of input features. Relevances were used to improve classification accuracy. In our present work, we focus on the feature ranking capability of this approach. We compare our algorithm to standard feature ranking methods.
International Journal of Reconfigurable and Embedded Systems (IJRES), 2025
This study focuses on the development of electrical power forecasting based on electricity usage in Wuzhou, China. To develop a forecasting model, the important features need to be identified. Therefore, this study investigates the performance of the feature selection method, focusing on the mutual information as a filter and random forest as a wrapper-based feature selection. From the experiment, six features have been chosen, whereby both feature selection methods chose almost identical features. Later, the selected features are trained and tested with common machine learning models, namely random forest regressor, support vector regression (SVR), k-nearest neighbor (KNN) regressor, and extreme gradient boosting (XGBoost) regressor. The performances of the feature selections tested on each of the models are measured in terms of mean absolute error (MAE), root mean square error (RMSE) and coefficient of determination (R²). Findings from the experiment revealed that XGBoost outperform the other machine learning models with RMSE 0.9566 and R² indicated with 0.2561. However, SVR outperformed XGBoost and other model by obtaining MAE 0.6028. It can be concluded that the performance of filter-based outperformed the embedded feature selection.
PloS one, 2018
Feature selection is considered to be one of the most critical methods for choosing appropriate features from a larger set of items. This task requires two basic steps: ranking and filtering. Of these, the former necessitates the ranking of all features, while the latter involves filtering out all irrelevant features based on some threshold value. In this regard, several feature selection methods with well-documented capabilities and limitations have already been proposed. Similarly, feature ranking is also nontrivial, as it requires the designation of an optimal cutoff value so as to properly select important features from a list of candidate features. However, the availability of a comprehensive feature ranking and a filtering approach, which alleviates the existing limitations and provides an efficient mechanism for achieving optimal results, is a major problem. Keeping in view these facts, we present an efficient and comprehensive univariate ensemble-based feature selection (uEF...
Expert Systems with Applications, 2009
Feature selection has become an increasingly important field of research. It aims at finding optimal feature subsets that can achieve better generalization on unseen data. However, this can be a very challenging task, especially when dealing with large feature sets. Hence, a search strategy is needed to explore a relatively small portion of the search space in order to find "semi-optimal" subsets. Many search strategies have been proposed in the literature, however most of them do not take into consideration relationships between features. Due to the fact that features usually have different degrees of dependency among each other, we propose in this paper a new search strategy that utilizes dependency between feature pairs to guide the search in the feature space. When compared to other well-known search strategies, the proposed method prevailed.
Lecture Notes in Computer Science, 2000
Feature selection aims to find the most important feature subset from a given feature set without degradation of discriminative information. In general, we wish to select a feature subset that is effective for any kind of classifier. Such studies are called Classifier-Independent Feature Selection, and Novovičová et al.'s method is one of them. Their method estimates the densities of classes with Gaussian mixture models, and selects a feature subset using Kullback-Leibler divergence between the estimated densities, but there is no indication how to choose the number of features to be selected. Kudo and Sklansky (1997) suggested the selection of a minimal feature subset such that the degree of degradation of performance is guaranteed. In this study, based on their suggestion, we try to find a feature subset that is minimal while maintainig a given Kullback-Leibler divergence.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000
Several authors have studied the problem of dimensionality reduction or feature selection using statistical distance measures, e.g., the Chemoff coefficient, Bhattacharyya distance, I-divergence, and J-divergence because they generally felt that direct use of the probability of classification error expression was either computationally or mathematically intractable. We show that for the difficult problem of testing one weakly stationary Gaussian stochastic process against another when the mean vectors are similar and the covariance matrices (patterns) differ, the probability of error expression may be dealt with directly using a combination of classical methods and distribution function theory. The results offer a new and accurate fmite dimensionality informationtheoretic strategy to feature selection, and are shown,by use of examples, to be superior to the well-known Kadota-Shepp approach which employs distance measures and asymptotics in its formulation. The present Part I deals with the theory; Part II deals with the implementation of a computer-based real-time pattern classifier which takes into account a realistic quasi-stationarity of the patterns. Index Tenns-Bayesian error probability, covariance matrix discrimination, data compression, empirical distribution function, feature selection, pattern classification, statistical distance measures.
2001
Feature selection is used to improve efficiency of learning algorithms by finding an optimal subset of features. However, most feature selection techniques can handle only certain types of data. Additional limitations of existing methods include intensive computational requirements and inability to identify redundant variables. In this paper, we are presenting a novel, information-theoretic algorithm for feature selection, which finds an optimal set of attributes by removing both irrelevant and redundant features. The algorithm has a polynomial computational complexity and it is applicable to datasets of mixed nature. The method performance is evaluated on several benchmark datasets by using a standard classifier (C4.5).
2016
From a machine learning point of view, identifying a subset of relevant features from a real data set can be useful to improve the results achieved by classification methods and to reduce their time and space complexity. To achieve this goal, feature selection methods are usually employed. These approaches assume that the data contains redundant or irrelevant attributes that can be eliminated. In this work, we propose a novel algorithm to manage the optimization problem that is at the foundation of the Mutual Information feature selection methods. Furthermore, our novel approach is able to estimate automatically the number of dimensions to retain. The quality of our method is confirmed by the promising results achieved on standard real data sets.
The Annals of Applied Statistics, 2010
In generalized linear regression problems with an abundant number of features, lasso-type regularization which imposes an ℓ 1 -constraint on the regression coefficients has become a widely established technique. Deficiencies of the lasso in certain scenarios, notably strongly correlated design, were unmasked when Zou and Hastie [J. Roy. Statist. Soc. Ser. B 67 (2005) 301-320] introduced the elastic net. In this paper we propose to extend the elastic net by admitting general nonnegative quadratic constraints as a second form of regularization. The generalized ridge-type constraint will typically make use of the known association structure of features, for example, by using temporal-or spatial closeness.
arXiv (Cornell University), 2022
Choosing which properties of the data to use as input to multivariate decision algorithms-a.k.a. feature selection-is an important step in solving any problem with machine learning. While there is a clear trend towards training sophisticated deep networks on large numbers of relatively unprocessed inputs (so-called automated feature engineering), for many tasks in physics, sets of theoretically well-motivated and well-understood features already exist. Working with such features can bring many benefits, including greater interpretability, reduced training and run time, and enhanced stability and robustness. We develop a new feature selection method based on Distance Correlation (DisCo), and demonstrate its effectiveness on the tasks of boosted top-and W-tagging. Using our method to select features from a set of over 7,000 energy flow polynomials, we show that we can match the performance of much deeper architectures, by using only ten features and two orders-of-magnitude fewer model parameters.
European Symposium on Artificial Neural Networks ( …
An Informational Energy LVQ Approach for Feature Ranking R˘azvan Andonie 1 and Angel Cataron 2 1Computer Science Department, Central Washington University, USA ... Denote the set of all codebook vectors by {w1,..., wK}. The components of a vector wj are [wj1,...,wjn]. ...
Feature selection is one of the important issues in the domain of system modelling, data mining and pattern recognition. Subset selection evaluates a subset of features as a group for suitability prior to applying a learning algorithm. Subset selection algorithms can be broken into wrapper, filter and hybrid categories. Literatures surveyed related to this are given as follows..
Entropy, 2019
Feature selection aims to select the smallest feature subset that yields the minimum generalization error. In the rich literature in feature selection, information theory-based approaches seek a subset of features such that the mutual information between the selected features and the class labels is maximized. Despite the simplicity of this objective, there still remain several open problems in optimization. These include, for example, the automatic determination of the optimal subset size (i.e., the number of features) or a stopping criterion if the greedy searching strategy is adopted. In this paper, we suggest two stopping criteria by just monitoring the conditional mutual information (CMI) among groups of variables. Using the recently developed multivariate matrix-based Rényi’s α -entropy functional, which can be directly estimated from data samples, we showed that the CMI among groups of variables can be easily computed without any decomposition or approximation, hence making o...
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2004
This paper adopts a Bayesian approach to simultaneously learn both an optimal nonlinear classifier and a subset of predictor variables (or features) that are most relevant to the classification task. The approach uses heavy-tailed priors to promote sparsity in the utilization of both basis functions and features; these priors act as regularizers for the likelihood function that rewards good classification on the training data. We derive an expectation-maximization (EM) algorithm to efficiently compute a maximum a posteriori (MAP) point estimate of the various parameters. The algorithm is an extension of recent state-of-the-art sparse Bayesian classifiers, which in turn can be seen as Bayesian counterparts of support vector machines. Experimental comparisons using kernel classifiers demonstrate both parsimonious feature selection and excellent classification accuracy on a range of synthetic and benchmark data sets.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2005
Feature selection is an important problem for pattern classification systems. We study how to select good features according to the maximal statistical dependency criterion based on mutual information. Because of the difficulty in directly implementing the maximal dependency condition, we first derive an equivalent form, called minimal-redundancy-maximal-relevance criterion (mRMR), for first-order incremental feature selection. Then, we present a two-stage feature selection algorithm by combining mRMR and other more sophisticated feature selectors (e.g., wrappers). This allows us to select a compact set of superior features at very low cost. We perform extensive experimental comparison of our algorithm and other methods using three different classifiers (naive Bayes, support vector machine, and linear discriminate analysis) and four different data sets (handwritten digits, arrhythmia, NCI cancer cell lines, and lymphoma tissues). The results confirm that mRMR leads to promising improvement on feature selection and classification accuracy.
Pattern Recognition, 2009
Feature selection plays an important role in data mining and pattern recognition, especially for large scale data. During past years, various metrics have been proposed to measure the relevance between different features. Since mutual information is nonlinear and can effectively represent the dependencies of features, it is one of widely used measurements in feature selection. Just owing to these, many promising feature selection algorithms based on mutual information with different parameters have been developed. In this paper, at first a general criterion function about mutual information in feature selector is introduced, which can bring most information measurements in previous algorithms together. In traditional selectors, mutual information is estimated on the whole sampling space. This, however, cannot exactly represent the relevance among features. To cope with this problem, the second purpose of this paper is to propose a new feature selection algorithm based on dynamic mutual information, which is only estimated on unlabeled instances. To verify the effectiveness of our method, several experiments are carried out on sixteen UCI datasets using four typical classifiers. The experimental results indicate that our algorithm achieved better results than other methods in most cases.
We introduce a view of unsupervised learning that integrates probabilistic and nonprobabilistic methods for clustering, dimensionality reduction, and feature extraction in a unified framework. In this framework, an energy function associates low energies to input points that are similar to training samples, and high energies to unobserved points. Learning consists in minimizing the energies of training samples while ensuring that the energies of unobserved ones are higher. Some traditional methods construct the architecture so that only a small number of points can have low energy, while other methods explicitly "pull up" on the energies of unobserved points. In probabilistic methods the energy of unobserved points is pulled by minimizing the log partition function, an expensive, and sometimes intractable process. We explore different and more efficient methods using an energy-based approach. In particular, we show that a simple solution is to restrict the amount of information contained in codes that represent the data. We demonstrate such a method by training it on natural image patches and by applying to image denoising.