Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
1997
We consider a model-based approach to clustering, whereby each observation is assumed to have arisen from an underlying mixture of a nite number of distributions. The number of components in this mixture model corresponds to the number of clusters to be imposed on the data. A common assumption is to take the component distributions to be multivariate normal with perhaps some restrictions on the component covariance matrices. The model can be tted to the data using maximum likelihood implemented via the EM algorithm. There is a number of computational issues associated with the tting, including the speci cation of initial starting points for the EM algorithm and the carrying out of tests for the number of components in the nal version of the model. We shall discuss some of these problems and describe an algorithm that attempts to handle them automatically.
1996
We present the approach to clustering whereby a normal mixture model is fitted to the data by maximum likelihood. The general case of normal component densities with unrestricted covariance matrices is considered and so it extends the work of Abbas and Fahmy (1994), who imposed the restriction of diagonal component covariance matrices. Attention is also focussed on the problem of testing for the number of clusters within this mixture framework, using the likelihood ratio test.
Entropy, 2019
This paper presents an integrated approach for the estimation of the parameters of a mixture model in the context of data clustering. The method is designed to estimate the unknown number of clusters from observed data. For this, we marginalize out the weights for getting allocation probabilities that depend on the number of clusters but not on the number of components of the mixture model. As an alternative to the stochastic expectation maximization (SEM) algorithm, we propose the integrated stochastic expectation maximization (ISEM) algorithm, which in contrast to SEM, does not need the specification, a priori, of the number of components of the mixture. Using this algorithm, one estimates the parameters associated with the clusters, with at least two observations, via local maximization of the likelihood function. In addition, at each iteration of the algorithm, there exists a positive probability of a new cluster being created by a single observation. Using simulated datasets, w...
Determining the number of component clusters for a multivariate normal mixture model is the most important problem in model based clustering and determining the number of candidate mixture models is the most interesting problem in multivariate normal mixture model based clustering using model selection criteria. In this study; first, the concept of the total number of candidate component cluster centers is introduced and an interval is constructed by using the number of partitions in each variable in multivariate data. Second, an equation is given for the total number of candidate mixture models in multivariate normal mixture model based clustering. The number of candidate mixture models is defined as the sum of the number of possible mixture models with different number of component clusters.
2014
A novel family of twelve mixture models with random covariates, nested in the linear t cluster-weighted model (CWM), is introduced for model-based clustering. The linear t CWM was recently presented as a robust alternative to the better known linear Gaussian CWM. The proposed family of models provides a unified framework that also includes the linear Gaussian CWM as a special case. Maximum likelihood parameter estimation is carried out within the EM framework, and both the BIC and the ICL are used for model selection. A simple and effective hierarchical random initialization is also proposed for the EM algorithm. The novel model-based clustering technique is illustrated in some applications to real data. Finally, a simulation study for evaluating the performance of the BIC and the ICL is presented.
Australian <html_ent glyph="@amp;" ascii="&"/> New Zealand Journal of Statistics, 1999
implemented the finite mixture model approach to clustering in a program called MULTIMIX. The program is designed to cluster multivariate data that have categorical and continuous variables and that possibly contain missing values. This paper describes the approach taken to design MULTIMIX and how some of the statistical problems were dealt with. As an example, the program is used to cluster a large medical dataset.
Machine Learning, 42, 2001
We compare the three basic algorithms for model-based clustering on high-dimensional discretevariable datasets. All three algorithms use the same underlying model: a naive-Bayes model with a hidden root node, also known as a multinomial-mixture model. In the first part of the paper, we perform an experimental comparison between three batch algorithms that learn the parameters of this model: the Expectation-Maximization (EM) algorithm, a "winner take all" version of the EM algorithm reminiscent of the K-means algorithm, and modelbased agglomerative clustering. We find that the EM algorithm significantly outperforms the other methods, and proceed to investigate the effect of various initialization methods on the final solution produced by the EM algorithm. The initializations that we consider are (1) parameters sampled from an uninformative prior, (2) random perturbations of the marginal distribution of the data, and (3) the output of agglomerative clustering. Although the methods are substantially different, they lead to learned models that are similar in quality.
Econometrics and Statistics
As in other estimation scenarios, likelihood based estimation in the normal mixture setup is highly non-robust against model misspecification and presence of outliers (apart from being an ill-posed optimization problem). A robust alternative to the ordinary likelihood approach for this estimation problem is proposed which performs simultaneous estimation and data clustering and leads to subsequent anomaly detection. To invoke robustness, the methodology based on the minimization of the density power divergence (or alternatively, the maximization of the β-likelihood) is utilized under suitable constraints. An iteratively reweighted least squares approach has been followed in order to compute the proposed estimators for the component means (or equivalently cluster centers) and component dispersion matrices which leads to simultaneous data clustering. Some exploratory techniques are also suggested for anomaly detection, a problem of great importance in the domain of statistics and machine learning. The proposed method is validated with simulation studies under different setups ; it performs competitively or better compared to the popular existing methods like K-medoids, TCLUST, trimmed K-means and MCLUST, especially when the mixture components (i.e., the clusters) share regions with significant overlap or outlying clusters exist with small but non-negligible weights (particularly in higher dimensions). Two real datasets are also used to illustrate the performance of the newly proposed method in comparison with others along with an application in image processing. The proposed method detects the clusters with lower misclassification rates and successfully points out the outlying (anomalous) observations from these datasets.
Journal of Multivariate Analysis, 2009
The analysis of finite mixture models for exponential repeated data is considered. The mixture components correspond to different unknown groups of the statistical units. Dependency and variability of repeated data are taken into account through random effects. For each component, an exponential mixed model is thus defined. When considering parameter estimation in this mixture of exponential mixed models, the EM-algorithm cannot be directly used since the marginal distribution of each mixture component cannot be analytically derived. In this paper, we propose two parameter estimation methods. The first one uses a linearisation specific to the exponential distribution hypothesis within each component. The second approach uses a Metropolis-Hastings algorithm as a building block of a general MCEM-algorithm.
i-manager’s Journal on Pattern Recognition
Recognition is the science of making inferences based on data and is the heart of all scientific inquiry, including understanding ourselves and the real-world around us. Growing numbers of applications are starting to use Pattern Recognition as the initial step towards interpreting human actions, intention, and behavior, and as a central part of Next-Generation Smart Environments.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002
In this paper we develop a dynamic continuous solution to the clustering problem of data characterized by a mixture of K distributions, where K is given a priori. The proposed solution resorts to game theory tools, in particular mean field games and can be interpreted as the continuous version of a generalized Expectation-Maximization (GEM) algorithm. The main contributions of this paper are twofold: first, we prove that the proposed solution is a GEM algorithm; second, we derive closed-form solution for a Gaussian mixture model and show that the proposed algorithm converges exponentially fast to a maximum of the log-likelihood function, improving significantly over the state of the art. We conclude the paper by presenting simulation results for the Gaussian case that indicate better performance of the proposed algorithm in term of speed of convergence and with respect to the overlap problem.
Statistics and Computing
A weighted likelihood approach for robust fitting of a mixture of multivariate Gaussian components is developed in this work. Two approaches have been proposed that are driven by a suitable modification of the standard EM and CEM algorithms, respectively. In both techniques, the M-step is enhanced by the computation of weights aimed at downweighting outliers. The weights are based on Pearson residuals stemming from robust Mahalanobis-type distances. Formal rules for robust clustering and outlier detection can be also defined based on the fitted mixture model. The behavior of the proposed methodologies has been investigated by some numerical studies and real data examples in terms of both fitting and classification accuracy and outlier detection.
Journal of Computational and Graphical Statistics, 2016
The use of a finite mixture of normal mixtures model in model-based clustering allows to capture non-Gaussian data clusters. However, identifying the clusters from the normal components is challenging and in general either achieved by imposing constraints on the model or by using post-processing procedures.
1996
We consider the approach to unsupervised learning whereby a normal mixture model is tted to the data by maximum likelihood. An algorithm called NMM is presented that enables the normal mixture model with either restricted or unrestricted component covariance matrices to be tted to a given data set. The algorithm automatically handles the problem of the speci cation of initial values for the parameters in the iterative tting of the model within the framework of the EM algorithm. The algorithm also has the provision to carry a test for the number of components on the basis of the likelihood ratio statistic.
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000
ÐWe propose assessing a mixture model in a cluster analysis setting with the integrated completed likelihood. With this purpose, the observed data are assigned to unknown clusters using a maximum a posteriori operator. Then, the Integrated Completed Likelihood (ICL) is approximated using an a Á la Bayesian information criterion (BIC). Numerical experiments on simulated and real data of the resulting ICL criterion show that it performs well both for choosing a mixture model and a relevant number of clusters. In particular, ICL appears to be more robust than BIC to violation of some of the mixture model assumptions and it can select a number of clusters leading to a sensible partitioning of the data.
Computational Statistics & Data Analysis, 2014
ABSTRACT A novel family of twelve mixture models with random covariates, nested in the linear $t$ cluster-weighted model (CWM), is introduced for model-based clustering. The linear $t$ CWM was recently presented as a robust alternative to the better known linear Gaussian CWM. The proposed family of models provides a unified framework that also includes the linear Gaussian CWM as a special case. Maximum likelihood parameter estimation is carried out within the EM framework, and both the BIC and the ICL are used for model selection. A simple and effective hierarchical random initialization is also proposed for the EM algorithm. The novel model-based clustering technique is illustrated in some applications to real data. Finally, a simulation study for evaluating the performance of the BIC and the ICL is presented.
Advances in Data Analysis and Classification, 2020
Finite mixtures present a powerful tool for modeling complex heterogeneous data. One of their most important applications is model-based clustering. It assumes that each data group can be reasonably described by one mixture model component. This establishes a one-to-one relationship between mixture components and clusters. In some cases, however, this relationship can be broken due to the presence of observations from the same class recorded in different ways. This effect can occur because of recording inconsistencies due to the use of different scales, operator errors, or simply various recording styles. The idea presented in this paper aims to alleviate this issue through modifications incorporated into mixture models. While the proposed methodology is applicable to a broad class of mixture models, in this paper it is illustrated on Gaussian mixtures. Several simulation studies and an application to a real-life data set are considered, yielding promising results. Mixture modeling • K-means • Cluster analysis • Measurement inconsistency • EM algorithm • Hand-written digits
Advances in Data Analysis and Classification, 2014
Parameter estimation for model-based clustering using a finite mixture of normal inverse Gaussian (NIG) distributions is achieved through variational Bayes approximations. Univariate NIG mixtures and multivariate NIG mixtures are considered. The use of variational Bayes approximations here is a substantial departure from the traditional EM approach and alleviates some of the associated computational complexities and uncertainties. Our variational algorithm is applied to simulated and real data. The paper concludes with discussion and suggestions for future work.
The estimation of mixture models has been proposed for quite some time as an approach for cluster analysis. Several variants of the Expectation-Maximization algorithm are currently available for this purpose. Estimation of mixture models simultaneously allows the determination of the number of clusters and yields distributional parameters for clustering base variables. There are several information criteria that help to support the selection of a particular model or clustering structure. However, a question remains concerning the selection of specific criteria that may be more suitable for particular applications. In the present work we analyze the relationship between the performance of information criteria and the type of measurement of clustering variables. In order to study this relationship we perform the analysis of forty-two data sets with known clustering structure and with clustering variables that are categorical, continuous and mixed type. We then compare eleven information-based criteria in their ability to recover the data sets' clustering structures. As a result, we select AIC3, BIC and ICL-BIC criteria as the best candidates for model selection that refers to models with categorical, continuous and mixed type clustering variables, respectively.
Austrian Journal of Statistics, 2006
Finite mixture models are being increasingly used to model the distributions of a wide variety of random phenomena and to cluster data sets. In this paper, we focus on the use of normal mixture models to cluster data sets of continuous multivariate data. As normality based methods of estimation are not robust, we review the use of t component distributions. With the t mixture model-based approach, the normal distribution for each component in the mixture model is embedded in a wider class of elliptically symmetric distributions with an additional parameter called the degrees of freedom. The advantage of the t mixture model is that, although the number of outliers needed for breakdown is almost the same as with the normal mixture model, the outliers have to be much larger. We also consider the use of the t distribution for the robust clustering of high-dimensional data via mixtures of factor analyzers. The latter enable a mixture model to be fitted to data which have high dimension relative to the number of data points to be clustered.
arXiv (Cornell University), 2016
Training the parameters of statistical models to describe a given data set is a central task in the field of data mining and machine learning. A very popular and powerful way of parameter estimation is the method of maximum likelihood estimation (MLE). Among the most widely used families of statistical models are mixture models, especially, mixtures of Gaussian distributions. A popular hard-clustering variant of the MLE problem is the so-called completedata maximum likelihood estimation (CMLE) method. The standard approach to solve the CMLE problem is the Classification-Expectation-Maximization (CEM) algorithm . Unfortunately, it is only guaranteed that the algorithm converges to some (possibly arbitrarily poor) stationary point of the objective function. In this paper, we present two algorithms for a restricted version of the CMLE problem. That is, our algorithms approximate reasonable solutions to the CMLE problem which satisfy certain natural properties. Moreover, they compute solutions whose cost (i.e. complete-data log-likelihood values) are at most a factor (1 + ε) worse than the cost of the solutions that we search for. Note the CMLE problem in its most general, i.e. unrestricted, form is not well defined and allows for trivial optimal solutions that can be thought of as degenerated solutions.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.