Papers by Emilie Lebarbier

Statistics and Computing, 2019
This work is motivated by an application for the homogeneization of GNSS-derived IWV (Integrated ... more This work is motivated by an application for the homogeneization of GNSS-derived IWV (Integrated Water Vapour) series. Indeed, these GPS series are affected by abrupt changes due to equipment changes or environemental effects. The detection and correction of the series from these changes is a crucial step before any use for climate studies. In addition to these abrupt changes, it has been observed in the series a non-stationary of the variability. We propose in this paper a new segmentation model that is a breakpoint detection in the mean model of a Gaussian process with heterogeneous variance on known time-intervals. In this segmentation case, the dynamic programming (DP) algorithm used classically to infer the breakpoints can not be applied anymore. We propose a procedure in two steps: we first estimate robustly the variances and then apply the classical inference by plugging these estimators. The performance of our proposed procedure is assessed through simulation experiments. An application to real GNSS data is presented.

arXiv: Methodology, 2020
Homogenization is an important and crucial step to improve the usage of observational data for cl... more Homogenization is an important and crucial step to improve the usage of observational data for climate analysis. This work is motivated by the analysis of long series of GNSS Integrated Water Vapour (IWV) data which have not yet been used in this context. This paper proposes a novel segmentation method that integrates a periodic bias and a heterogeneous, monthly varying, variance. The method consists in estimating first the variance using a robust estimator and then estimating the segmentation and periodic bias iteratively. This strategy allows for the use of the dynamic programming algorithm that remains the most efficient exact algorithm to estimate the change-point positions. The statistical performance of the method is assessed through numerical experiments. An application to a real data set of 120 global GNSS stations is presented. The method is implemented in the R package GNSSseg that will be available on the CRAN.
We consider the segmentation problem of Poisson and negative binomial (i.e. overdispersed Poisson... more We consider the segmentation problem of Poisson and negative binomial (i.e. overdispersed Poisson) rate distributions. In segmentation, an important issue remains the choice of the number of segments. To this end, we propose a penalized log-likelihood estimator where the penalty function is constructed in a non-asymptotic context following the works of L. Birg\'e and P. Massart. The resulting estimator is proved to satisfy an oracle inequality. The performances of our criterion is assessed using simulated and real datasets in the RNA-seq data analysis context.
A procedure is provided to detect multiple change-points in the mean of very large Gaussian signa... more A procedure is provided to detect multiple change-points in the mean of very large Gaussian signals. From an algorithmical point of view, visiting all possible congurations of change-points cannot be performed on large samples. The proposed procedure runs CART rst in order to reduce the number of congurations of change-points by keeping the relevant ones, and then runs an exhaustive search on these change-points in order to obtain a convenient conguration. A simulation study compares the dierent algorithms in terms of theoretical performance and in terms of computational time.

Motivation: Detecting local correlations in expression between neighbor genes along the genome ha... more Motivation: Detecting local correlations in expression between neighbor genes along the genome has proved to be an effective strategy to identify possible causes of transcriptional deregulation in cancer. It has been successfully used to illustrate the role of mechanisms such as copy number variation (CNV) or epigenetic alterations as factors that may significantly alter expression in large chromosomic regions (gene silencing or gene activation). Results: The identification of correlated regions requires segmenting the gene expression correlation matrix into regions of homogeneously correlated genes and assessing whether the observed local correlation is significantly higher than the background chromosomal correlation. A unified statistical framework is proposed to achieve these two tasks, where optimal segmentation is efficiently performed using dynamic programming algorithm, and detection of highly correlated regions is then achieved using an exact test procedure. We also propose ...
Appendix file containing a table with all the competing methods, the proof of Lemma 1 and the dis... more Appendix file containing a table with all the competing methods, the proof of Lemma 1 and the distribution of the test statistic. (PDF 106 kb)
Sampling variance of species identification in fisheries-acoustic surveys based on automated
We propose a method based on a penalized contrast criterion for estimating the change-points in a... more We propose a method based on a penalized contrast criterion for estimating the change-points in a dicrete distribution of independant variables. The number of change-points and their locations are unknown. We consider two minimum contrast estimation: the maximum likelihood one and the least-squares one. In the two contexts we define the penalty function involved in our corresponding criterion such that the resulting estimator minimizes non asymptotically the associated risk.
HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific ... more HAL is a multi-disciplinary open access archive for the deposit and dissemination of sci-entific research documents, whether they are pub-lished or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et a ̀ la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés. Identifying and characterizing the ontogenetic component in tree development
A procedure is provided to detect multiple change-points in the mean of very large Gaussian signa... more A procedure is provided to detect multiple change-points in the mean of very large Gaussian signals. From an algorithmical point of view, visiting all possible con gurations of change-points cannot be performed on large samples. The proposed procedure runs CART rst in order to reduce the number of con gurations of change-points by keeping the relevant ones, and then runs an exhaustive search on these change-points in order to obtain a convenient con guration. A simulation study compares the di erent algorithms in terms of theoretical performance and in terms of computational time.
Dans cet article, nous proposons une discussion sur le critere de selection de modeles Bayesian I... more Dans cet article, nous proposons une discussion sur le critere de selection de modeles Bayesian Information Criterion (BIC). Afin de comprendre son comportement, nous decrivons les etapes de sa construction et les hypotheses necessaires a son application en detaillant les approximations dont il decoule. En s'appuyant sur la notion de quasi-vrai modele, nous reprecisons la propriete de consistance pour la dimension definie pour BIC. Enfin, nous mettons en evidence les differences entre le critere BIC et le critere AIC d'Akaike en comparant leurs proprietes.

Segmentation methods have been successfully applied to the mapping of chromosomal abnormalities w... more Segmentation methods have been successfully applied to the mapping of chromosomal abnormalities when using CGH microarrays. Most current methods deal with one CGH profile only, and do not integrate multiple arrays, whereas the CGH microarray technology becomes widely used to characterize chromosomal defaults at the cohort level. We present CGHSeg, an R package that is devoted to the analysis of CGH profiles at the individual and at the cohort levels. This package performs segmentation in multiple CGH profiles in the framework of linear models, and multivariate segmentation/clustering for the joint characterization of aberration types (status assignment of regions based on the cohort). Overall, linear models offer a unified framework for the joint analysis of multiple CGH profiles, and we will show how they can be used to link the experience acquired in the field of expression arrays (normalization, experimental design) with array CGH data analysis.
Pour la premiere fois, une procedure de segmentation multiple de series de coordonnees est propos... more Pour la premiere fois, une procedure de segmentation multiple de series de coordonnees est proposee pour des stations GPS geographiquement proches. Elle permet d'estimer simultanement des vitesses de deplacements et des signaux saisonniers specifiques a chaque serie tout en determinant un signal de deplacement commun a toutes les stations. Une extension du modele propose par Picard et al. (2011) et Bertin et al. (2014) est consideree afin de prendre en compte les differentes caracteristiques liees aux donnees GPS ainsi que la procedure d'estimation, procedure iterative. Les resultats obtenus sur quatre ensembles de series reelles GPS sont tres pertinents d'autant plus que la methode permet de ne pas segmenter le signal physique en identifiant des ruptures liees au mouvement reel du sol.
A procedure is provided to detect multiple change-points in the mean for very large Gaussian sign... more A procedure is provided to detect multiple change-points in the mean for very large Gaussian signals. From an algorithmical point of view, visiting all possible configurations of change points cannot be performed on large samples. The proposed procedure runs CART to reduce the number of configurations of change-points by keeping the relevant ones, and then runs an exhaustive search on these change-points in order to obtain a convenient configuration. A simulation study compares the different algorithms in terms of theoretical performance and in terms of computational time.
We propose a Bayesian approach to detect multiple change-points in a piecewise-constant signal co... more We propose a Bayesian approach to detect multiple change-points in a piecewise-constant signal corrupted by a functional part corresponding to environmental or experimental disturbances. The piecewise constant part (also called segmentation part) is expressed as the product of a lower triangular matrix by a sparse vector. The functional part is a linear combination of functions from a large dictionary. A Stochastic Search Variable Selection approach is used to obtain sparse estimations of the segmentation parameters (the change-points and the means over the segments) and of the functional part. The performance of our proposed method is assessed using simulation experiments. Applications to two real datasets from geodesy and economy fields are also presented.

The interest for the change-point detection issue has been motivated by its applications in sever... more The interest for the change-point detection issue has been motivated by its applications in several fields. Among them, we can quote genomics with the problem of detecting chromosomic aberrations which can be the cause of serious diseases; geodesy, where we are interested in the detection of abrupt changes which may be due to changes of devices or to short earthquakes; telecommunications, where change-point detection techniques can be used for detecting network attacks or network anomalies. The change-point detection issue can be expressed as follows: Let y1, . . . ,yn be some observations, from which we want to identify the regions in which the observations can be considered as “stationary” in a sense to be defined. The change-points correspond to the boundaries of these regions. More formally, the (yt)1≤t≤n can be modeled as realizations of a sequence of n random variables (Yt)1≤t≤n having a probability distribution depending on a parameter θt such as:

<p>Homogenization is an important step to improve t... more <p>Homogenization is an important step to improve the quality of long-term observational data sets and estimate climatic trends. In this work, we use the GNSSseg/GNSSfast segmentation packages that were developed by Quarello et al., 2020, for the detection of abrupt changes in the mean of Integrated Water Vapour (IWV) data derived from GNSS measurements. The method works on the difference of the IWV time series (GNSS – reference) in order to cancel out the common climatic variations and enhance the discontinuities due to the inhomogeneities in the GNSS series. This segmentation method accounts for changes in the variance on fixed intervals (monthly) and a periodic bias (annual) due to representativeness differences between GNSS and the reference (in our case, a global atmospheric reanalysis). <br>The goal of this study is to analyze the sensitivity of the segmentation method to the data properties, particularly the GNSS data processing method. Two reprocessed GNSS solutions are considered: IGS repro1, covering the period 1995-2010, and CODE REPRO2015 + OPER, covering the period 1994-2018. Next, the impact of the length of time series and missing data are investigated. Finally, the use of two different reference series is considered (ERA-Interim and ERA5 reanalyses).<br>The segmentation results are screened for outliers (multiple detections occurring within a distance of 80 days) and validated with respect to known equipment changes (from GNSS metadata). The impact of the data properties is analyzed by comparing the number and position of detected change-points and the fraction of validated change-points. The influence of the variance of the IWV difference series and the magnitude of the periodic bias is examined. Finally, the results are compared in terms of estimated linear trends taking the detected change-points into account.<br>From the multiple comparisons, we found that about 30 % of change points are similar when the GNSS processing method changed, while 60 % are similar when the CODE series is shortened to match the length of the repro1 series. These tests highlight that the segmentation results are processing-dependent and are affected by the length of the series. The impact of the data properties on the IWV trends and associated uncertainties are also quantified. Besides, it is important to note that the best segmentation result is found when the ERA5 reanalysis is used as a reference.</p>
Uploads
Papers by Emilie Lebarbier