Dimensionality reduction can often improve the performance of the k-nearest neighbor classifier (... more Dimensionality reduction can often improve the performance of the k-nearest neighbor classifier (kNN) for high-dimensional data sets, such as microarrays. The effect of the choice of dimensionality reduction method on the predictive performance of kNN for classifying microarray data is an open issue, and four common dimensionality reduction methods, Principal Component Analysis (PCA), Random Projection (RP), Partial Least Squares (PLS) and Information Gain(IG), are compared on eight microarray data sets. It is observed that all dimensionality reduction methods result in more accurate classifiers than what is obtained from using the raw attributes. Furthermore, it is observed that both PCA and PLS reach their best accuracies with fewer components than the other two methods, and that RP needs far more components than the others to outperform kNN on the non-reduced dataset. None of the dimensionality reduction methods can be concluded to generally outperform the others, although PLS is shown to be superior on all four binary classification tasks, but the main conclusion from the study is that the choice of dimensionality reduction method can be of major importance when classifying microarrays using kNN.
International Journal of Imaging Systems and Technology, 2009
We propose an approach for data clustering based on optimum-path forest. The samples are taken as... more We propose an approach for data clustering based on optimum-path forest. The samples are taken as nodes of a graph, whose arcs are defined by an adjacency relation. The nodes are weighted by their probability density values (pdf) and a connectivity function is maximized, such that each maximum of the pdf becomes root of an optimum-path tree (cluster), composed by samples ''more strongly connected'' to that maximum than to any other root. We discuss the advantages over other pdf-based approaches and present extensions to large datasets with results for interactive image segmentation and for fast, accurate, and automatic brain tissue classification in magnetic resonance (MR) images. We also include experimental comparisons with other clustering approaches. Figure 4. The boxes show an 1D pdf q with four maxima. The map V (white) indicates the removal of two irrelevant domes (gray) when h(t) 5 q(t) 2 2. The 1D optimum-path forest P (vectors) shows the influence zones of the two remaining maxima.
The paper proposes a heuristic instance reduction algorithm as an approach to machine learning an... more The paper proposes a heuristic instance reduction algorithm as an approach to machine learning and knowledge discovery in centralized and distributed databases. The proposed algorithm is based on an original method for a selection of reference instances and creates a reduced training dataset. The reduced training set consisting of selected instances can be used as an input for the machine learning algorithms used for data mining tasks. The algorithm calculates for each instance in the data set the value of its similarity coefficient. Values of the coefficient are used to group instances into clusters. The number of clusters depends on the value of the so called representation level set by the user. Out of each cluster only a limited number of instances is selected to form a reduced training set. The proposed algorithm uses population learning algorithm for selection of instances. The paper includes a description of the proposed approach and results of the validating experiment.
Dimensionality reduction can often improve the performance of the k-nearest neighbor classifier (... more Dimensionality reduction can often improve the performance of the k-nearest neighbor classifier (kNN) for high-dimensional data sets, such as microarrays. The effect of the choice of dimensionality reduction method on the predictive performance of kNN for classifying microarray data is an open issue, and four common dimensionality reduction methods, Principal Component Analysis (PCA), Random Projection (RP), Partial Least Squares (PLS) and Information Gain(IG), are compared on eight microarray data sets. It is observed that all dimensionality reduction methods result in more accurate classifiers than what is obtained from using the raw attributes. Furthermore, it is observed that both PCA and PLS reach their best accuracies with fewer components than the other two methods, and that RP needs far more components than the others to outperform kNN on the non-reduced dataset. None of the dimensionality reduction methods can be concluded to generally outperform the others, although PLS is shown to be superior on all four binary classification tasks, but the main conclusion from the study is that the choice of dimensionality reduction method can be of major importance when classifying microarrays using kNN.
International Journal of Imaging Systems and Technology, 2009
We propose an approach for data clustering based on optimum-path forest. The samples are taken as... more We propose an approach for data clustering based on optimum-path forest. The samples are taken as nodes of a graph, whose arcs are defined by an adjacency relation. The nodes are weighted by their probability density values (pdf) and a connectivity function is maximized, such that each maximum of the pdf becomes root of an optimum-path tree (cluster), composed by samples ''more strongly connected'' to that maximum than to any other root. We discuss the advantages over other pdf-based approaches and present extensions to large datasets with results for interactive image segmentation and for fast, accurate, and automatic brain tissue classification in magnetic resonance (MR) images. We also include experimental comparisons with other clustering approaches. Figure 4. The boxes show an 1D pdf q with four maxima. The map V (white) indicates the removal of two irrelevant domes (gray) when h(t) 5 q(t) 2 2. The 1D optimum-path forest P (vectors) shows the influence zones of the two remaining maxima.
The paper proposes a heuristic instance reduction algorithm as an approach to machine learning an... more The paper proposes a heuristic instance reduction algorithm as an approach to machine learning and knowledge discovery in centralized and distributed databases. The proposed algorithm is based on an original method for a selection of reference instances and creates a reduced training dataset. The reduced training set consisting of selected instances can be used as an input for the machine learning algorithms used for data mining tasks. The algorithm calculates for each instance in the data set the value of its similarity coefficient. Values of the coefficient are used to group instances into clusters. The number of clusters depends on the value of the so called representation level set by the user. Out of each cluster only a limited number of instances is selected to form a reduced training set. The proposed algorithm uses population learning algorithm for selection of instances. The paper includes a description of the proposed approach and results of the validating experiment.
Uploads
Papers by Fábio Falcão