Papers by Emmanuelle Génin

IEEE Access, 2020
This paper proposes a new privacy-preserving framework to perform rare variant case-control assoc... more This paper proposes a new privacy-preserving framework to perform rare variant case-control association tests with information provided by two parties: a Genomic Research Unit (GRU) with sequencing data from individuals affected by a disease D (cases); a Genomic Research Center (GRC) with sequencing data from healthy individuals (controls). To identify genes with rare variants involved in D, GRU needs to compare cases against controls using association tests (genome-wide association study). The main originality of our proposal is twofold. First, it positions GRC as a proxy between GRU and the server. Doing so makes it possible to use classical cryptographic tools to securely conduct association tests with no computation complexity increase, contrarily to actual state of the art proposals which are of very high complexity being based on homomorphic encryption, for instance. In particular, we show how sensitive data confidentiality can be ensured with secret key based cryptographic hashing with no need to modify statistical algorithms. In our protocol the server simply conducts statistical analyses on partially hashed data. Secondly, we introduce a novel privacy constraint: GRU's identity should remain unknown to the server as this knowledge can give it clues about GRU's data (e.g., diseases and genes of interest). We exhibit how Pretty Good Privacy (PGP) can be used to solve this problem. We illustrate our protocol in the case of one rare variant association test, the Weighted-Sum Statistic (WSS) algorithm, carried out on real genetic data. This secure WSS achieves the same accuracy as its nonsecure version with no increase of complexity. Furthermore, we establish that our protocol can be extended to the different rare variant association tests available in the literature. INDEX TERMS Data confidentiality, data outsourcing, genome-wide association study (GWAS), privacy, secure GWAS platform, weighted-sum statistic (WSS).

HAL (Le Centre pour la Communication Scientifique Directe), May 7, 2021
2 . N G S m i s m a p pi n g c o nfo u n d s t h e cli nic al in t e r p r e t a tio n of t h e P... more 2 . N G S m i s m a p pi n g c o nfo u n d s t h e cli nic al in t e r p r e t a tio n of t h e P R S S 1 p .Ala 1 6Val (c. 4 7 C > T) v a ri a n t in c h r o ni c p a n c r e a ti ti s. G u t 7 1 , p p . 8 4 1-8 4 2. 1 0. 1 1 3 6/ g u tj nl-2 0 2 1-3 2 4 9 4 3 P u blis h e r s p a g e : h t t p:// dx. doi.o r g/ 1 0. 1 1 3 6/ g u tj nl-2 0 2 1-3 2 4 9 4 3 Pl e a s e n o t e: C h a n g e s m a d e a s a r e s ul t of p u blis hi n g p r o c e s s e s s u c h a s c o py-e di ti n g, fo r m a t ti n g a n d p a g e n u m b e r s m a y n o t b e r efl e c t e d in t hi s v e r sio n. Fo r t h e d efi nitiv e v e r sio n of t hi s p u blic a tio n, pl e a s e r ef e r t o t h e p u blis h e d s o u r c e . You a r e a d vis e d t o c o n s ul t t h e p u blis h e r's v e r sio n if yo u wis h t o ci t e t hi s p a p er. This ve r sio n is b ei n g m a d e a v ail a bl e in a c c o r d a n c e wi t h p u blis h e r p olici e s. S e e h t t p://o r c a . cf. a c. u k/ p olici e s. h t ml fo r u s a g e p olici e s. Co py ri g h t a n d m o r al ri g h t s fo r p u blic a tio n s m a d e a v ail a bl e in ORCA a r e r e t ai n e d by t h e c o py ri g h t h ol d e r s . NGS mismapping confoundsResolving the clinicalconflicting interpretations of the PRSS1 p.Ala16Val (c.47C>T) variant in chronic pancreatitis

Human Genomics, Aug 16, 2022
The American College of Medical Genetics and Genomics (ACMG)-recommended five variant classificat... more The American College of Medical Genetics and Genomics (ACMG)-recommended five variant classification categories (pathogenic, likely pathogenic, uncertain significance, likely benign, and benign) have been widely used in medical genetics. However, these guidelines are fundamentally constrained in practice owing to their focus upon Mendelian disease genes and their dichotomous classification of variants as being either causal or not. Herein, we attempt to expand the ACMG guidelines into a general variant classification framework that takes into account not only the continuum of clinical phenotypes, but also the continuum of the variants' genetic effects, and the different pathological roles of the implicated genes. As a disease model, we employed chronic pancreatitis (CP), which manifests clinically as a spectrum from monogenic to multifactorial. Bearing in mind that any general conceptual proposal should be based upon sound data, we focused our analysis on the four most extensively studied CP genes, PRSS1, CFTR, SPINK1 and CTRC . Based upon several cross-gene and cross-variant comparisons, we first assigned the different genes to two distinct categories in terms of disease causation: CP-causing (PRSS1 and SPINK1) and CP-predisposing (CFTR and CTRC ). We then employed two new classificatory categories, "predisposing" and "likely predisposing", to replace ACMG's "pathogenic" and "likely pathogenic" categories in the context of CP-predisposing genes, thereby classifying all pathologically relevant variants in these genes as "predisposing". In the case of CP-causing genes, the two new classificatory categories served to extend the five ACMG categories whilst two thresholds (allele frequency and functional) were introduced to discriminate "pathogenic" from "predisposing" variants. Employing CP as a disease model, we expand ACMG guidelines into a five-category classification system (predisposing, likely predisposing, uncertain significance, likely benign, and benign) and a seven-category classification system (pathogenic, likely pathogenic, predisposing, likely predisposing, uncertain significance, likely benign, and benign) in the context of disease-predisposing and disease-causing genes, respectively. Taken together, the two systems constitute a general variant classification framework that, in principle, should span the entire spectrum of variants in any disease-related gene. The maximal compliance of our five-category and seven-category classification systems with the ACMG guidelines ought to facilitate their practical application.

International Journal of Molecular Sciences
About 8% of the human genome is covered with candidate cis-regulatory elements (cCREs). Disruptio... more About 8% of the human genome is covered with candidate cis-regulatory elements (cCREs). Disruptions of CREs, described as “cis-ruptions” have been identified as being involved in various genetic diseases. Thanks to the development of chromatin conformation study techniques, several long-range cystic fibrosis transmembrane conductance regulator (CFTR) regulatory elements were identified, but the regulatory mechanisms of the CFTR gene have yet to be fully elucidated. The aim of this work is to improve our knowledge of the CFTR gene regulation, and to identity factors that could impact the CFTR gene expression, and potentially account for the variability of the clinical presentation of cystic fibrosis as well as CFTR-related disorders. Here, we apply the robust GWAS3D score to determine which of the CFTR introns could be involved in gene regulation. This approach highlights four particular CFTR introns of interest. Using reporter gene constructs in intestinal cells, we show that two ne...

Genotype-phenotype association tests are typically adjusted for population stratification using p... more Genotype-phenotype association tests are typically adjusted for population stratification using principal components that are estimated genome-wide. This lacks resolution when analysing populations with fine structure and/or individuals with fine levels of admixture. This can affect power and precision, and is a particularly relevant consideration when control individuals are recruited using geographic selection criteria. Such is the case in France where we have recently created reference panels of individuals anchored to different geographic regions. To make correct comparisons against case groups, who would likely be gathered from large urban areas, new methods are needed.We present SURFBAT (a SURrogate Family Based Association Test) which performs an approximation of the transmission-disequilibrium test. Our method hinges on the application of genotype imputation algorithms to match similar haplotypes between the case and control groups. This permits us to approximate local ances...

Bioinformatics, 2018
Summary Predicted deleteriousness of coding variants is a frequently used criterion to filter out... more Summary Predicted deleteriousness of coding variants is a frequently used criterion to filter out variants detected in next-generation sequencing projects and to select candidates impacting on the risk of human diseases. Most available dedicated tools implement a base-to-base annotation approach that could be biased in presence of several variants in the same genetic codon. We here proposed the MACARON program that, from a standard VCF file, identifies, re-annotates and predicts the amino acid change resulting from multiple single nucleotide variants (SNVs) within the same genetic codon. Applied to the whole exome dataset of 573 individuals, MACARON identifies 114 situations where multiple SNVs within a genetic codon induce an amino acid change that is different from those predicted by standard single SNV annotation tool. Such events are not uncommon and deserve to be studied in sequencing projects with inconclusive findings. Availability and implementation MACARON is written in pyt...

Annals of Clinical and Translational Neurology, 2019
ObjectivesBlood biomarkers for cerebral tissue ischemia are lacking. The goal was to identify a b... more ObjectivesBlood biomarkers for cerebral tissue ischemia are lacking. The goal was to identify a blood transcriptomic signature jointly identified in the ischemic brain.MethodsA nonhuman primate model with middle cerebral artery (MCA) territory infarction was used to study gene expression by microarray during acute ischemic cerebral stroke in the brain and the blood. Brain samples were collected in the infarcted and contralateral non‐infarcted cortex as well as blood samples before and after occlusion. Gene expression was compared between the two brain locations to find differentially expressed genes. The expressions of these genes were then compared in the blood pre‐ and post‐occlusion.ResultsHierarchical clustering of brain expression data revealed strong independent clustering of ischemic and nonischemic brain samples. The top five enriched, up‐regulated gene sets in the brain were TNF α signaling, apoptosis, P53 pathway, hypoxia, and UV response up. A comparison of differentially...

PLOS ONE, 2017
Cystic Fibrosis is the most common lethal autosomal recessive disorder in the white population, a... more Cystic Fibrosis is the most common lethal autosomal recessive disorder in the white population, affecting among other organs, the lung, the pancreas and the liver. Whereas Cystic Fibrosis is a monogenic disease, many studies reveal a very complex relationship between genotype and clinical phenotype. Indeed, the broad phenotypic spectrum observed in Cystic Fibrosis is far from being explained by obvious genotype-phenotype correlations and it is admitted that Cystic Fibrosis disease is the result of multiple factors, including effects of the environment as well as modifier genes. Our objective was to highlight new modifier genes with potential implications in the lung, pancreatic and liver outcomes of the disease. For this purpose we performed a system biology approach which combined, database mining, literature mining, gene expression study and network analysis as well as pathway enrichment analysis and protein-protein interactions. We found that IFI16, CCNE2 and IGFBP2 are potential modifiers in the altered lung function in Cystic Fibrosis. We also found that EPHX1, HLA-DQA1, HLA-DQB1, DSP and SLC33A1, GPNMB, NCF2, RASGRP1, LGALS3 and PTPN13, are potential modifiers in pancreas and liver, respectively. Associated pathways indicate that immune system is likely involved and that Ubiquitin C is probably a central node, linking Cystic Fibrosis to liver and pancreatic disease. We highlight here new modifier genes with potential implications in Cystic Fibrosis. Nevertheless, our in silico analysis requires functional analysis to give our results a physiological relevance.
Journal of Cystic Fibrosis, 2007

Genes & Immunity, 2008
Most of the published works so far have aimed at finding genes associated with multiple sclerosis... more Most of the published works so far have aimed at finding genes associated with multiple sclerosis (MS) susceptibility. Very few studies have attempted to correlate disease features with DNA variants. In a well-characterized sample (651 patients) representative of multiple sclerosis natural history, we engaged a comprehensive study of the role of human leukocyte antigen (HLA) in the course of the disease. We investigated the role of HLA-DRB1*15 allele in samples stratified according to severity evaluated by the Multiple Sclerosis Severity Score (MSSS), time to reach EDSS 6.0 and disease type. We found that HLA-DRB1*15 genotype does not influence MS severity even among patients presenting with a given type of the disease. However, we show for the first time that HLA-DRB1*15 allele modulates the course of MS for relapsing-remitting (RR) onset patients likely by precipitating the secondary progressive (SP) phase.

Scientific Reports, Jan 2, 2024
accuracy coming from SSPs has been shown in populations such as the Netherlands 11 , Estonia 12 ,... more accuracy coming from SSPs has been shown in populations such as the Netherlands 11 , Estonia 12 , Norway 13 , and Japan 14. SSP imputation also improves the power of genome-wide association studies (GWAS) involving both common and rare variants 13,15-17. The benefits of using SSPs have been shown to be particularly evident in the context of isolated populations 17-21. SSPs may often be relatively small and so the best approach may often be to combine an SSP with a large cosmopolitan reference panel. Though combining public and study specific reference panels is computationally feasible, it remains problematic for other practical reasons. Panels such as the HRC 2 or TOPMED 8 are only fully available through online servers and hence it is not possible to merge their data with one's in-house sequencing data. Hence, most published results cited above involving a combination of panels have merged an SSP with the freely available (but smaller in comparison) 1000G. It should also be recognised that as leading imputation servers are located outside of the European Union, General Data Protection Regulations have added significant complications for the use of imputation servers 22. In this study, we elucidate precisely what is to be gained or lost from pursuing the use of such servers compared to in-house imputation using SSPs. Leading population-based imputation software invoke haplotype copying models based on the Li-Stephens model 23. This model uses coalescent theory, capturing the idea that if two chromosomes (at a given position) are followed back in time, they will eventually coalesce, sharing a (most recent) common ancestor and this will translate into stretches of shared haplotypes between individuals. For two unrelated individuals, any given genomic region would likely contain many differences representing a very long coalescent time between the pair. But with a large enough sample of a population and in a given genomic region, each observed haplotype can be expected to have a shared lineage (and hence have a relatively recent common ancestor) with at least one other haplotype in the sample. Thus these two haplotypes would likely share a near identical haplotype (allowing for only a few very recent mutations) that would stretch far enough to contain multiple common genetic variants. Extending this idea across regions, a given chromosome from the sample can be described as a mosaic of small haplotype segments present in the pool of all other chromosomes in the sample. This concept is harnessed by imputation software; each target individual chromosome is modelled as a mosaic of reference panel haplotypes using genotyping information for the target individual on a set of common variants. Once a likely chain of copying haplotypes is estimated based on similarities for common genetic variants, missing genotypes can be inferred. Or more often, posterior probabilities of missing genotypes across many potential chains are estimated. Developments in imputation software have been driven by the need to make inference from larger and larger reference panels, but also to operate efficiently to find the best subsets of reference individuals for each chromosomal region. In particular, the PBWT 24 algorithm has allowed for very rapid sub-selections of reference panel individuals to serve as region-specific reference haplotype pools. PBWT can be employed as a phasing and imputation software on its own but the algorithm has also been incorporated into other software such as EAGLE2 25 , IMPUTE5 5 and SHAPEIT4 26. With the concepts of the Li-Stephens model in mind, it is intuitive that imputation will be successful if the reference panel contains relevant haplotypes which closely match the target individual but also enough diversity to enable good haplotype matching across the target's whole chromosome-i.e. there are no weak links in the chain. This can explain potentially counter-intuitive results such as the inclusion of the UK10K 27 imputation panel improving the imputation of Italian 28 and even Chinese 29 genomes. Aside from choice of reference panel, an important consideration is the estimation of haplotypes-referred to herein simply as 'phasing'. The accuracy of phasing has also been widely evaluated, with a parallel rapid development of competing software. Population based phasing software use broadly the same haplotype copying models as imputation software, only that two chains of mosaics have to be found simultaneously rather than a single one. An important difference is that when phasing, inference is often made between individuals in the study. Conversely when imputing, each target individual has missing genotypes imputed from their pre-phased data using only the reference panel. Older software versions such as IMPUTE2 3 and MaCH 30 provide the possibility of phasing and imputing simultaneously. Avoiding pre-phasing has been shown to give small increases in imputation accuracy though this comes at a price of a huge increase in computation complexity 31. Therefore, this approach is unlikely to be considered for imputation involving large target and/or reference panel sample sizes (such as those analysed here); and in particular is not possible on current imputation servers. Imputation accuracy has not been investigated in the French population. The French population has considerable internal diversity 32,33 and does not have direct representations in panels such as 1000G, HRC, or TOPMED. Recently, 856 French individuals were whole-genome sequenced at 30-40 ×, this makes up the FranceGenRef panel (Labex GENMED http:// www. genmed. fr/); an obvious candidate for an imputation panel for French genomes. However, as FranceGenRef is relatively small, it is unclear as to whether it will be competitive with a panel such as the HRC (38,821 individuals) for imputation. Furthermore, FranceGenRef does not include individuals from all corners of France and so may not be appropriate for imputing missing genotypes for all French genomes. In this study, we will evaluate potential approaches for both phasing and imputation of French data using either the Sanger and Michigan imputation servers or in-house phasing and imputation. We will also analyse the interplay of population structure within France and the impact that this can have on phasing and imputation accuracy. Results Evaluating imputation servers Our study involves two French datasets: FrEx, a panel with exome data on 574 individuals recruited in six French cities and FranceGenRef (FGR) with whole genome sequence data on 856 individuals with ancestry in different French regions (Fig. 1). The constitutions of both datasets are described fully in the "Methods". To motivate the use of a French SSP for imputation of French genomes, an initial investigation of the performance of imputation servers for French individuals was performed. Our technique was to send sets of common variants extracted from
American Journal of Epidemiology, 2010

France has a population with extensive internal fine-structure; and while public imputation refer... more France has a population with extensive internal fine-structure; and while public imputation reference panels contain an abundance of European genomes, there include few French genomes. Intuitively, using a ‘study specific panel’ (SSP) for France would therefore likely be beneficial. To investigate, we imputed 550 French individuals using either the University of Michigan imputation server with the Haplotype Reference Consortium panel, or in-house using an SSP of 850 whole-genome sequenced French individuals.With approximate geo-localization of both our target and SSP individuals we are able to pinpoint different scenarios where SSP-based imputation would be preferred over server-based imputation or vice-versa. We could also show to a high degree of resolution how the proximity of the reference panel to a target individual determined the accuracy of both haplotype phasing and genotype imputation.Previous comparisons of different strategies have shown the benefits of combining public ...

European genetic ancestry originates from three main ancestral populations - Western hunter-gathe... more European genetic ancestry originates from three main ancestral populations - Western hunter-gatherers, early European farmers and Yamnaya Eurasian herders - whose edges geographically met in present-day France. Despite its central role to our understanding of how the ancestral populations interacted and gave rise to modern population structure, the population history of France has remained largely understudied. Here, we analysed the high-coverage whole-genome sequences and genome-wide genotype profiles of respectively 856 and 3,234 present-day individuals from the northern half of France, and merged them with publicly available present-day and ancient Europe-wide genotype datasets. We also explored, for the first time, the whole-genome sequences of six mediaeval individuals (300-1100 CE) from Western France to gain insights into the genetic impact of what is commonly known as the Migration Period in Europe. We found extensive fine-scale population structure across Brittany and the d...

Genetica, 2022
In this paper, we explain the concept of heritability and describe the different methods and the ... more In this paper, we explain the concept of heritability and describe the different methods and the genotype-phenotype correspondences used to estimate heritability in the specific field of human genetics. Heritability studies are conducted on extremely diverse human traits: quantitative traits (physical, biological, but also cognitive and behavioral measurements) and binary traits (as is the case of most human diseases). Instead of variables such as education and socioeconomic status as covariates in genetic studies, they are now the direct object of genetic analysis. We make a review of the different assumptions underlying heritability estimates and dispute the validity of most of them. Moreover, and maybe more importantly, we show that they are very often misinterpreted. These erroneous interpretations lead to a vision of a genetic determinism of human traits. This vision is currently being widely disseminated not only by the mass media and the mainstream press, but also by the scientific press. We caution against the dangerous implication it has both medically and socially. Contrarily to the field of animal and plant genetics for which the polygenic model and the concept of heritability revolutionized selection methods, we explain why it does not provide answer in human genetics.

Background. Estimating relatedness is an important step for many genetic study designs. A variety... more Background. Estimating relatedness is an important step for many genetic study designs. A variety of methods for estimating coefficients of pairwise relatedness from genotype data have been proposed. Both the kinship coefficient and the fraternity coefficient for all pairs of individuals are of interest. However, when dealing with low-depth sequencing or imputation data, individual level genotypes cannot be confidently called. To ignore such uncertainty is known to result in biased estimates. Accordingly, methods have recently been developed to estimate kinship from uncertain genotypes. Results. We present new method-of-moment estimators of both the coefficients and calculated directly from genotype likelihoods. We have simulated low-depth genetic data for a sample of individuals with extensive relatedness by using the complex pedigree of the known genetic isolates of Cilento in South Italy. Through this simulation, we explore the behaviour of our estimators, demonstrate their prope...

The presence of missing data in association studies is an important problem, particularly with hi... more The presence of missing data in association studies is an important problem, particularly with high-density SNP maps, since the probability that at least one genotype is missing dramatically increases with the number of markers. A possible strategy is to simply ignore the missing data and only use the complete observations, and, consequently, to accept a significant decrease of the sample size. Using GAW15 simulated data on which we removed some genotypes to generate different levels of missing data, we show that this strategy might lead to an important loss in power to detect association, but may also result in false conclusions regarding the most likely susceptibility site if another marker is in linkage disequilibrium with the disease susceptibility site. We propose a multiple imputation approach to deal with missing data on case-parent trios and evaluated the performance of this approach on the same simulated data. We found that our multiple imputation approach has high power to detect association with the susceptibility site even with a large amount of missing data, and can identify the susceptibility sites among a set of sites in linkage disequilibrium.

European Journal of Human Genetics, 2021
Rare genetic variants are expected to play an important role in disease and several statistical m... more Rare genetic variants are expected to play an important role in disease and several statistical methods have been developed to test for disease association with rare variants, including variance-component tests. These tests however deal only with binary or continuous phenotypes and it is not possible to take advantage of a suspected heterogeneity between subgroups of patients. To address this issue, we extended the popular rare-variant association test SKAT to compare more than two groups of individuals. Simulations under different scenarios were performed that showed gain in power in presence of genetic heterogeneity and minor lack of power in absence of heterogeneity. An application on whole-exome sequencing data from patients with early-or late-onset moyamoya disease also illustrated the advantage of our SKAT extension. Genetic simulations and SKAT extension are implemented in the R package Ravages available on GitHub (https://github.com/genostats/Ravages).

Clinical Genetics, 2020
Bardet‐Biedl syndrome (BBS) is a ciliopathy characterized by retinitis pigmentosa, obesity, polyd... more Bardet‐Biedl syndrome (BBS) is a ciliopathy characterized by retinitis pigmentosa, obesity, polydactyly, cognitive impairment and renal failure. Pathogenic variants in 24 genes account for the molecular basis of >80% of cases. Toward saturated discovery of the mutational basis of the disorder, we carefully explored our cohorts and identified a hominid‐specific SINE‐R/VNTR/Alu type F (SVA‐F) insertion in exon 13 of BBS1 in eight families. In six families, the repeat insertion was found in trans with c.1169 T > G, p.Met390Arg and in two families the insertion was found in addition to other recessive BBS loci. Whole genome sequencing, de novo assembly and SNP array analysis were performed to characterize the genomic event. This insertion is extremely rare in the general population (found in 8 alleles of 8 BBS cases but not in >10 800 control individuals from gnomAD‐SV) and due to a founder effect. Its 2435 bp sequence contains hallmarks of LINE1 mediated retrotransposition. Fu...
Uploads
Papers by Emmanuelle Génin