Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2004, Journal of Biomedical Informatics
In microarray gene expression data, clusters may hide in subspaces. Traditional clustering algorithms that make use of similarity measurements in the full input space may fail to detect the clusters. In recent years a number of algo- rithms have been proposed to identify this kind of projected clusters, but many of them rely on some critical parame- ters whose proper
2006
Abstract: Motivation: Unsupervised learning or clustering is frequently used to explore gene expression profiles for insight into both regulation and function. However, the quality of clustering results is often difficult to assess and each algorithm has tunable parameters with often no obvious way to choose appropriate values. Most algorithms also require the number of clusters to be predetermined yet this value is rarely known and, thus, is arrived at by subjective criteria.
Journal of Computational Biology, 2002
There are many algorithms to cluster sample data points based on nearness or a similarity measure. Often the implication is that points in different clusters come from different underlying classes, whereas those in the same cluster come from the same class. Stochastically, the underlying classes represent different random processes. The inference is that clusters represent a partition of the sample points according to which process they belong. This paper discusses a model-based clustering toolbox that evaluates cluster accuracy. Each random process is modeled as its mean plus independent noise, sample points are generated, the points are clustered, and the clustering error is the number of points clustered incorrectly according to the generating random processes. Various clustering algorithms are evaluated based on process variance and the key issue of the rate at which algorithmic performance improves with increasing numbers of experimental replications. The model means can be selected by hand to test the separability of expected types of biological expression patterns. Alternatively, the model can be seeded by real data to test the expected precision of that output or the extent of improvement in precision that replication could provide. In the latter case, a clustering algorithm is used to form clusters, and the model is seeded with the means and variances of these clusters. Other algorithms are then tested relative to the seeding algorithm. Results are averaged over various seeds. Output includes error tables and graphs, confusion matrices, principal-component plots, and validation measures. Five algorithms are studied in detail: K-means, fuzzy C-means, self-organizing maps, hierarchical Euclidean-distance-base d and correlation-based clustering. The toolbox is applied to gene-expression clustering based on cDNA microarrays using real data. Expression pro le graphics are generated and error analysis is displayed within the context of these pro le graphics. A large amount of generated output is available over the web.
2003
Projected clustering has become a hot research topic due to its ability to cluster high-dimensional data. However, most existing projected clustering algorithms depend on some critical user parameters in determining the relevant attributes of each cluster. In case wrong parameter values are used, the clustering performance will be seriously degraded. Unfortunately, correct parameter values are rarely known in real datasets. In this paper, we propose a projected clustering algorithm that does not depend on user inputs in determining relevant attributes. It responds to the clustering status and adjusts the internal thresholds dynamically. From experimental results, our algorithm shows a much higher usability than the other projected clustering algorithms used in our comparison study. It also works well with a gene expression dataset for studying lymphoma. The high usability of the algorithm and the encouraging results suggest that projected clustering can be a practical tool for analyzing gene expression profiles.
Computers in Biology and Medicine, 2006
The development of microarray technologies gives scientists the ability to examine, discover and monitor the mRNA transcript levels of thousands of genes in a single experiment. Nonetheless, the tremendous amount of data that can be obtained from microarray studies presents a challenge for data analysis. The most commonly used computational approach for analyzing microarray data is cluster analysis, since the number of genes is usually very high compared to the number of samples. In this paper, we investigate the application of the recently proposed k-windows clustering algorithm on gene expression microarray data. This algorithm apart from identifying the clusters present in a data set also calculates their number and thus requires no special knowledge about the data.
Molecules and Cells
The analysis of microarray data is essential for large amounts of gene expression data. In this review we focus on clustering techniques. The biological rationale for this approach is the fact that many co-expressed genes are co-regulated, and identifying co-expressed genes could aid in functional annotation of novel genes, de novo identification of transcription factor binding sites and elucidation of complex biological pathways. Co-expressed genes are usually identified in microarray experiments by clustering techniques. There are many such methods, and the results obtained even for the same datasets may vary considerably depending on the algorithms and metrics for dissimilarity measures used, as well as on user-selectable parameters such as desired number of clusters and initial values. Therefore, biologists who want to interpret microarray data should be aware of the weakness and strengths of the clustering methods used. In this review, we survey the basic principles of clusteri...
International Journal of Bioinformatics Research, 2011
DNA microarray technology is a fundamental tool in gene expression data analysis. The collection of datasets from the technology has underscored the need for quantitative analytical tools to examine such data. Due to the large number of genes and complex gene regulation networks, clustering is a useful exploratory technique for analyzing these data. Many clustering algorithms have been proposed to analyze microarray gene expression data, but very few of them evaluate the quality of the clusters. In this paper, a novel cluster analysis technique has been proposed without considering number of clusters a priori. The method computes a similarity measurement function based on which the clusters are merged and subsequently splits a cluster by computing the degree of separation of the cluster. The process of splitting and merging performs iteratively until the cluster validity index (i.e. DB index) degrades. The experimental result shows that the proposed cluster analysis technique gives comparable results on gene cancer dataset with existing methods. This study may help raise relevant issues in the extraction of meaningful biological information from microarray expression data.
Bonfring
Cluster analysis in microarray gene expression studies is used to find groups of correlated and co-regulated genes. Several clustering algorithms are available in the literature. However no single algorithm is optimal for data generated under different technological platforms and experimental conditions. It is possible to combine several clustering methods and solutions using an ensemble approach. The method also known as consensus clustering is used here to examine the robustness of cluster solutions from several different algorithms. The method proposed here also is useful for estimating the number of clusters in a dataset. Here we examine the properties of consensus clustering using real and simulated datasets
Communication in Statistics- Theory and Methods
The analysis of microarray data is a widespread functional genomics approach that allows for the monitoring of the expression of thousands of genes at once. The analysis of the great amount of data generated in a microarray experiment requires powerful statistical techniques. One of the first tasks of the analysis of microarray data is to cluster data into biologically meaningful groups according to their expression patterns. In this article, we discuss classical as well as recent clustering techniques for microarray data. We pay particular attention to both theoretical and practical issues and give some general indications that might be useful to practitioners.
International Journal of Computational Models and Algorithms in Medicine, 2014
Identification of biological significant subspace clusters (biclusters and triclusters) of genes from microarray experimental data is a very daunting task that emerged, especially with the development of high throughput technologies. Several methods and applications of subspace clustering (biclustering and triclustering) in DNA microarray data analysis have been developed in recent years. Various computational and evaluation methods based on diverse principles were introduced to identify new similarities among genes. This review discusses and compares these methods, highlights their mathematical principles, and provides insight into the applications to solve biological problems.
Bioinformatics, 2004
Motivation: A measurement of cluster quality is needed to choose potential clusters of genes that contain biologically relevant patterns of gene expression. This is strongly desirable when a large number of gene expression profiles have to be analyzed and proper clusters of genes need to be identified for further analysis, such as the search for meaningful patterns, identification of gene functions or gene response analysis. Results: We propose a new cluster quality method, called stability, by which unsupervised learning of gene expression data can be performed efficiently. The method takes into account a cluster's stability on partition. We evaluate this method and demonstrate its performance using four independent, real gene expression and three simulated datasets. We demonstrate that our method outperforms other techniques listed in the literature. The method has applications in evaluating clustering validity as well as identifying stable clusters.
BMC Bioinformatics, 2008
Background: Clustering is a popular data exploration technique widely used in microarray data analysis. Most conventional clustering algorithms, however, generate only one set of clusters independent of the biological context of the analysis. This is often inadequate to explore data from different biological perspectives and gain new insights. We propose a new clustering model that can generate multiple versions of different clusters from a single dataset, each of which highlights a different aspect of the given dataset.
Lecture Notes in Computer Science, 2006
We propose a mining framework that supports the identification of useful knowledge based on data clustering. With the recent advancement of microarray technologies, we focus our attention on gene expression datasets mining. In particular, given that genes are often coexpressed under subsets of experimental conditions, we present a novel subspace clustering algorithm. In contrast to previous approaches, our method is based on the observation that the number of subspace clusters is related with the number of maximal subspace clusters to which any gene pair can belong. By performing discretization to gene expression profiles, the similarity between two genes is transformed as a sequence of symbols that represents the maximal subspace cluster for the gene pair. This domain transformation (from genes into gene-gene relations) allows us to make the number of possible subspace clusters dependent on the number of genes. Based on the symbolic representations of genes, we present an efficient subspace clustering algorithm that is scalable to the number of dimensions. In addition, the running time can be drastically reduced by utilizing inverted index and pruning non-interesting subspaces. Experimental results indicate that the proposed method efficiently identifies co-expressed gene subspace clusters for a yeast cell cycle dataset.
BMC Bioinformatics, 2006
Background: DNA Microarray technology is an innovative methodology in experimental molecular biology, which has produced huge amounts of valuable data in the profile of gene expression. Many clustering algorithms have been proposed to analyze gene expression data, but little guidance is available to help choose among them. The evaluation of feasible and applicable clustering algorithms is becoming an important issue in today's bioinformatics research.
Bioinformatics, 2003
Motivation: With the advent of microarray chip technology, large data sets are emerging containing the simultaneous expression levels of thousands of genes at various time points during a biological process. Biologists are attempting to group genes based on the temporal pattern of their expression levels. While the use of hierarchical clustering (UPGMA) with correlation 'distance' has been the most common in the microarray studies, there are many more choices of clustering algorithms in pattern recognition and statistics literature. At the moment there do not seem to be any clear-cut guidelines regarding the choice of a clustering algorithm to be used for grouping genes based on their expression profiles. Results: In this paper, we consider six clustering algorithms (of various flavors!) and evaluate their performances on a well-known publicly available microarray data set on sporulation of budding yeast and on two simulated data sets. Among other things, we formulate three reasonable validation strategies that can be used with any clustering algorithm when temporal observations or replications are present. We evaluate each of these six clustering methods with these validation measures. While the 'best' method is dependent on the exact validation strategy and the number of clusters to be used, overall Diana appears to be a solid performer. Interestingly, the performance of correlation-based hierarchical clustering and model-based clustering (another method that has been advocated by a number of researchers) appear to be on opposite extremes, depending on what validation measure one employs. Next it is shown that the group means produced by Diana are the closest and those produced by UPGMA are the farthest from a model profile based on a set of hand-picked genes. Availability: S+ codes for the partial least squares based clustering are available from the authors upon request. All
2012 5th International Symposium on Communications, Control and Signal Processing, 2012
Clustering is one of most useful tools for the microarray gene expression data analysis. Although there have been many reviews and surveys in the literature, many good and effective clustering ideas have not been collected in a systematic way for some reasons. In this paper, we review five clustering families representing five clustering concepts rather than five algorithms. We also review some clustering validations and collect a list of benchmark gene expression datasets.
Southeast Europe journal of soft computing, 2013
Gene expression analysis is becoming very important in order to understand complex living organisms. Rather than analyzing genes individually, there is more powerful approach, microarray technology to analyze the genes expression in high throughput. This new approach brings new analyses problems that make the interpretation difficult. To understand the correlated gene expression analysis easier some clustering methods are applied to the gene expression analysis. In this paper, different approach is represented to start to cluster with usin some computational strategies.
Computers in Biology and Medicine, 2008
Many clustering techniques have been proposed for the analysis of gene expression data obtained from microarray experiments. However, choice of suitable method(s) for a given experimental dataset is not straightforward. Common approaches do not translate well and fail to take account of the data profile. This review paper surveys state of the art applications which recognise these limitations and addresses them. As such, it provides a framework for the evaluation of clustering in gene expression analyses. The nature of microarray data is discussed briefly. Selected examples are presented for clustering methods considered.
BMC Bioinformatics, 2014
Background: Clustering is crucial for gene expression data analysis. As an unsupervised exploratory procedure its results can help researchers to gain insights and formulate new hypothesis about biological data from microarrays. Given different settings of microarray experiments, clustering proves itself as a versatile exploratory tool. It can help to unveil new cancer subtypes or to identify groups of genes that respond similarly to a specific experimental condition. In order to obtain useful clustering results, however, different parameters of the clustering procedure must be properly tuned. Besides the selection of the clustering method itself, determining which distance is going to be employed between data objects is probably one of the most difficult decisions.
IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2000
Cluster analysis is usually the first step adopted to unveil information from gene expression microarray data. Besides selecting a clustering algorithm, choosing an appropriate proximity measure (similarity or distance) is of great importance to achieve satisfactory clustering results. Nevertheless, up to date, there are no comprehensive guidelines concerning how to choose proximity measures for clustering microarray data. Pearson is the most used proximity measure, whereas characteristics of other ones remain unexplored. In this paper we investigate the choice of proximity measures for the clustering of microarray data by evaluating the performance of 16 proximity measures in 52 datasets from time-course and cancer experiments. Our results support that measures rarely employed in the gene expression literature can provide better results than commonly employed ones, such as Pearson, Spearman, and Euclidean distance. Given that different measures stood out for time-course and cancer data evaluations, their choice should be specific to each scenario. To evaluate measures on time-course data we preprocessed and compiled 17 datasets from the microarray literature in a benchmark along with a new methodology, called Intrinsic Biological Separation Ability (IBSA). Both can be employed in future research to assess the effectiveness of new measures for gene time-course data.
Bioinformatics, 2001
Motivation: There is a great need to develop analytical methodology to analyze and to exploit the information contained in gene expression data. Because of the large number of genes and the complexity of biological networks, clustering is a useful exploratory technique for analysis of gene expression data. Other classical techniques, such as principal component analysis (PCA), have also been applied to analyze gene expression data. Using different data analysis techniques and different clustering algorithms to analyze the same data set can lead to very different conclusions. Our goal is to study the effectiveness of principal components (PCs) in capturing cluster structure. Specifically, using both real and synthetic gene expression data sets, we compared the quality of clusters obtained from the original data to the quality of clusters obtained after projecting onto subsets of the principal component axes. Results: Our empirical study showed that clustering with the PCs instead of the original variables does not necessarily improve, and often degrades, cluster quality. In particular, the first few PCs (which contain most of the variation in the data) do not necessarily capture most of the cluster structure. We also showed that clustering with PCs has different impact on different algorithms and different similarity metrics. Overall, we would not recommend PCA before clustering except in special circumstances.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.