Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Background: High-dimensional biomedical data are frequently clustered to identify subgroup structures pointing at distinct disease subtypes. It is crucial that the used cluster algorithm works correctly. However, by imposing a predefined shape on the clusters, classical algorithms occasionally suggest a cluster structure in homogenously distributed data or assign data points to incorrect clusters. We analyzed whether this can be avoided by using emergent self-organizing feature maps (ESOM). Methods: Data sets with different degrees of complexity were submitted to ESOM analysis with large numbers of neurons, using an interactive R-based bioinformatics tool. On top of the trained ESOM the distance structure in the high dimensional feature space was visualized in the form of a so-called U-matrix. Clustering results were compared with those provided by classical common cluster algorithms including single linkage, Ward and k-means. Results: Ward clustering imposed cluster structures on cluster-less ''golf ball " , ''cuboid " and ''S-shaped " data sets that contained no structure at all (random data). Ward clustering also imposed structures on permuted real world data sets. By contrast, the ESOM/U-matrix approach correctly found that these data contain no cluster structure. However, ESOM/U-matrix was correct in identifying clusters in biomedical data truly containing subgroups. It was always correct in cluster structure identification in further canon-ical artificial data. Using intentionally simple data sets, it is shown that popular clustering algorithms typically used for biomedical data sets may fail to cluster data correctly, suggesting that they are also likely to perform erroneously on high dimensional biomedical data. Conclusions: The present analyses emphasized that generally established classical hierarchical clustering algorithms carry a considerable tendency to produce erroneous results. By contrast, unsupervised machine-learned analysis of cluster structures, applied using the ESOM/U-matrix method, is a viable, unbiased method to identify true clusters in the high-dimensional space of complex data.
International Journal of Neural Systems, 1999
Determining the structure of data without prior knowledge of the number of clusters or any information about their composition is a problem of interest in many fields, such as image analysis, astrophysics, biology, etc. Partitioning a set of n patterns in a p-dimensional feature space must be done such that those in a given cluster are more similar to each other than the rest. As there are approximately [Formula: see text] possible ways of partitioning the patterns among K clusters, finding the best solution is very hard when n is large. The search space is increased when we have no a priori number of partitions. Although the self-organizing feature map (SOM) can be used to visualize clusters, the automation of knowledge discovery by SOM is a difficult task. This paper proposes region-based image processing methods to post-processing the U-matrix obtained after the unsupervised learning performed by SOM. Mathematical morphology is applied to identify regions of neurons that are simi...
With the advancement of high-throughput biotechnologies, biological data describing DNA, RNA, protein, and metabolite biomolecules are generated faster than ever. Huge amount of information is being produced and collected. Bioinformatics uses information technology to facilitate the discovery of new knowledge from large sets of various biological data at the molecular level. Within various applications of information technology, clustering has long played an important role.
Peerj Computer Science, 2024
This survey rigorously explores contemporary clustering algorithms within the machine learning paradigm, focusing on five primary methodologies: centroid-based, hierarchical, density-based, distribution-based, and graph-based clustering. Through the lens of recent innovations such as deep embedded clustering and spectral clustering, we analyze the strengths, limitations, and the breadth of application domains—ranging from bioinformatics to social network analysis. Notably, the survey introduces novel contributions by integrating clustering techniques with dimensionality reduction and proposing advanced ensemble methods to enhance stability and accuracy across varied data structures. This work uniquely synthesizes the latest advancements and offers new perspectives on overcoming traditional challenges like scalability and noise sensitivity, thus providing a comprehensive roadmap for future research and practical applications in data-intensive environments.
Menemui Matematik (Discovering Mathematics), 2016
The Self-organizing map is among the most acceptable algorithm in the unsupervised learning technique for cluster analysis. It is an important tool used to map high-dimensional data sets onto a low-dimensional discrete lattice of neurons. This feature is used for clustering and classifying data. Clustering is the process of grouping data elements into classes or clusters so that items in each class or cluster are as similar to each other as possible. In this paper, we present an overview of self organizing map, its architecture, applications and its training algorithm. Computer simulations have been analyzed based on samples of data for clustering problems.
2010
This work presents a neural network model for the clustering analysis of data based on Self Organizing Maps (SOM). The model evolves during the training stage towards a hierarchical structure according to the input requirements. The hierarchical structure symbolizes a specialization tool that provides refinements of the classification process. The structure behaves like a single map with different resolutions depending on the region to analyze. The benefits and performance of the algorithm are discussed in application to the Iris dataset, a classical example for pattern recognition.
IEEE Access
Clustering is a challenging problem in machine learning in which one attempts to group N objects into K 0 groups based on P features measured on each object. In this article, we examine the case where N P and K 0 is not known. Clustering in such high dimensional, small sample size settings has numerous applications in biology, medicine, the social sciences, clinical trials, and other scientific and experimental fields. Whereas most existing clustering algorithms either require the number of clusters to be known a priori or are sensitive to the choice of tuning parameters, our method does not require the prior specification of K 0 or any tuning parameters. This represents an important advantage for our method because training data are not available in the applications we consider (i.e., in unsupervised learning problems). Without training data, estimating K 0 and other hyperparameters-and thus applying alternative clustering algorithms-can be difficult and lead to inaccurate results. Our method is based on a simple transformation of the Gram matrix and application of the strong law of large numbers to the transformed matrix. If the correlation between features decays as the number of features grows, we show that the transformed feature vectors concentrate tightly around their respective cluster expectations in a low-dimensional space. This result simplifies the detection and visualization of the unknown cluster configuration. We illustrate the algorithm by applying it to 32 benchmarked microarray datasets, each containing thousands of genomic features measured on a relatively small number of tissue samples. Compared to 21 other commonly used clustering methods, we find that the proposed algorithm is faster and twice as accurate in determining the ''best'' cluster configuration. INDEX TERMS Clustering, gram matrix, high-dimensional features, hyperparameter-free.
2005
In this paper, we propose a new clustering method consisting in automated “flood- fill segmentation” of the U*-matrix of a Self-Organizing Map after training. Using several artificial datasets as a benchmark, we find that the clustering results of our U*F method are good over a wide range of critical dataset types. Furthermore, comparison to standard clustering algorithms (K-means, single-linkage and Ward) directly applied on the same datasets show that each of the latter performs very bad on at least one kind of dataset, contrary to our U*F clustering method: while not always the best, U*F clustering has the great advantage of exhibiting consistently good results. Another advantage of U*F is that the computation cost of the SOM segmentation phase is negligible, contrary to other SOM-based clustering approaches which apply O(n2logn) standard clustering algorithms to the SOM prototypes. Finally, it should be emphasized that U*F clustering does not require a priori knowledge on the nu...
2011
Abstract: We survey agglomerative hierarchical clustering algorithms and discuss efficient implementations that are available in R and other software environments. We look at hierarchical self-organizing maps, and mixture models. We review grid-based clustering, focusing on hierarchical density-based approaches. Finally we describe a recently developed very efficient (linear time) hierarchical clustering algorithm, which can also be viewed as a hierarchical grid-based algorithm.
Clustering is a branch of multivariate analysis that is used to create groups of data. While there are currently a variety of techniques that are used for creating clusters, many require defining additional information, including the actual number of clusters, before they can be carried out. The case study of this research presents a novel neural network that is capable of creating groups by using a combination of hierarchical clustering and self-organizing maps, without requiring the number of existing clusters to be specified beforehand.
2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541)
The Self-Organizing Map (SOM) has emerged as one of the popular choices for clustering data; however, when it comes to point density accuracy of codebooks or reliability and interpretability of the map, the SOM leaves much to be desired. In this paper, we compare the newly developed K-Means Hierarchical (KMH) clustering algorithm to the SOM. We also introduce a new initialization scheme for the K-means that improves codebook placement and, propose a novel visualization scheme that combines the Principal Component Analysis (PCA) and Minimal Spanning Tree (MST) in an arrangement that ensures reliability of the visualization unlike the SOM. A practical application of the algorithm is demonstrated on a challenging Bioinformatics problem.
PLoS ONE, 2008
Background: A hierarchy, characterized by tree-like relationships, is a natural method of organizing data in various domains. When considering an unsupervised machine learning routine, such as clustering, a bottom-up hierarchical (BU, agglomerative) algorithm is used as a default and is often the only method applied.
Studies in health technology and informatics, 2002
This work deals with multidimensional data analysis, precisely cluster analysis applied to a very well known dataset, the Wisconsin Breast Cancer dataset. After the introduction of the topics of the paper the cluster analysis concept is shortly explained and different methods of cluster analysis are compared. Further, the Kohonen model of self-organizing maps is briefly described together with an example and with explanations of how the cluster analysis can be performed using the maps. After describing the data set and the methodology used for the analysis we present the findings using textual as well as visual descriptions and conclude that the approach is a useful complement for assessing multidimensional data and that this dataset has been overused for automated decision benchmarking purposes, without a thorough analysis of the data it contains.
Self-Organizing Maps, 2010
Applications of clustering algorithms in biomedical research are ubiquitous, with typical examples including gene expression data analysis, genomic sequence analysis, biomedical document mining, and MRI image analysis. However, due to the diversity of cluster analysis, the differing terminologies, goals, and assumptions underlying different clustering algorithms can be daunting. Thus, determining the right match between clustering algorithms and biomedical applications has become particularly important. This paper is presented to provide biomedical researchers with an overview of the status quo of clustering algorithms, to illustrate examples of biomedical applications based on cluster analysis, and to help biomedical researchers select the most suitable clustering algorithms for their own applications. Fig. 1. Number of papers on cluster analysis in subject areas of Life and Health Sciences from 2000 to 2009. Searches were performed using Scopus. in the GenBank database, constituting an increase of about 25 billion bases from two years ago [89,172,350,468 bases from 85,500,730 sequences, released on April 15, 2008 (GenBank Release 165.0)].
Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2011
We survey agglomerative hierarchical clustering algorithms and discuss efficient implementations that are available in R and other software environments. We look at hierarchical self-organizing maps, and mixture models. We review grid-based clustering, focusing on hierarchical density-based approaches. Finally, we describe a recently developed very efficient (linear time) hierarchical clustering algorithm, which can also be viewed as a hierarchical grid-based algorithm. C
Lecture Notes in Computer Science, 2016
One main challenge in modern medicine is the discovery of molecular disease subtypes characterized by relevant clinical differences, such as survival. However, clustering high-dimensional expression data is challenging due to noise and the curse of high-dimensionality. This article describes a disease subtyping pipeline that is able to exploit the important information available in pathway databases and clinical variables. The pipeline consists of a new feature selection procedure and existing clustering methods. Our procedure partitions a set of patients using the set of genes in each pathway as clustering features. To select the best features, this procedure estimates the relevance of each pathway and fuses relevant pathways. We show that our pipeline finds subtypes of patients with more distinctive survival profiles than traditional subtyping methods by analyzing a TCGA colon cancer gene expression dataset. Here we demonstrate that our pipeline improves three different clustering methods: k-means, SNF, and hierarchical clustering.
International Journal of Scientific Research in Computer Science, Engineering and Information Technology, 2021
Artificial Intelligence (AI) and Machine Learning (ML), which are becoming a part of interest rapidly for various researchers. ML is the field of Computer Science study, which gives capability to learn without being absolutely programmed. This work focuses on the standard k-means clustering algorithm and analysis the shortcomings of the standard k-means algorithm. The k-means clustering algorithm calculates the distance between each data object and not all cluster centres in every iteration, which makes the efficiency of clustering is high. In this work, we have to try to improve the k-means algorithm to solve simple data to store some information in every iteration, which is to be used in the next interaction. This method avoids computing distance of data object to the cluster centre repeatedly, saving the running time. An experimental result shows the enhanced speed of clustering, accuracy, reducing the computational complexity of the k-means. In this, we have work on iris dataset extracted from Kaggle.
2004
The development of microarray technologies gives scientists the ability to examine, discover and monitor the mRNA transcript levels of thousands of genes in a single experiment. Nevertheless, the tremendous amount of data that can be obtained from microarray studies presents a challenge for data analysis. The most commonly used computational approach for analyzing microarray data is cluster analysis. In this paper, we investigate the application of an unsupervised extension of the recently proposed k-windows clustering algorithm on gene expression microarray data. This algorithm apart from identifying the clusters present in a dataset also calculates their number thus no special knowledge about the data is required. To improve the quality of the clustering, we selected the most highly correlated genes with respect to the class distinction of the genes. The results obtained by the application of the algorithm exhibit high classification success.
Journal of Neuroscience Methods, 2005
Cluster analysis is an important tool for classifying data. Established techniques include k-means and k-median cluster analysis. However, these methods require the user to provide a priori estimations of the number of clusters and their approximate location in the parameter space. Often these estimations can be made based on some prior understanding about the nature of the data. Alternatively, the user makes these estimations based on visualization of the data. However, the latter is problematic in data sets with large numbers of dimensions. Presented here is an algorithm that can automatically provide these estimates without human intervention based on the inherent structure of the data set. The number of dimensions does not limit it.
2004
In this paper, we investigate the application of an unsupervised extension of the recently proposed k-windows clustering algorithm on gene expression microarray data. The k-windows algorithm is used both to identify sets of genes according to their expression in a set of samples, and to cluster samples into homogeneous groups. Experimental results and comparisons indicate that this is a promising approach.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.