Papers by Smita Krishnaswamy

Journal of Cell Biology
Skin homeostasis is maintained by stem cells, which must communicate to balance their regenerativ... more Skin homeostasis is maintained by stem cells, which must communicate to balance their regenerative behaviors. Yet, how adult stem cells signal across regenerative tissue remains unknown due to challenges in studying signaling dynamics in live mice. We combined live imaging in the mouse basal stem cell layer with machine learning tools to analyze patterns of Ca2+ signaling. We show that basal cells display dynamic intercellular Ca2+ signaling among local neighborhoods. We find that these Ca2+ signals are coordinated across thousands of cells and that this coordination is an emergent property of the stem cell layer. We demonstrate that G2 cells are required to initiate normal levels of Ca2+ signaling, while connexin43 connects basal cells to orchestrate tissue-wide coordination of Ca2+ signaling. Lastly, we find that Ca2+ signaling drives cell cycle progression, revealing a communication feedback loop. This work provides resolution into how stem cells at different cell cycle stages co...

Nature Communications
Due to commonalities in pathophysiology, age-related macular degeneration (AMD) represents a uniq... more Due to commonalities in pathophysiology, age-related macular degeneration (AMD) represents a uniquely accessible model to investigate therapies for neurodegenerative diseases, leading us to examine whether pathways of disease progression are shared across neurodegenerative conditions. Here we use single-nucleus RNA sequencing to profile lesions from 11 postmortem human retinas with age-related macular degeneration and 6 control retinas with no history of retinal disease. We create a machine-learning pipeline based on recent advances in data geometry and topology and identify activated glial populations enriched in the early phase of disease. Examining single-cell data from Alzheimer’s disease and progressive multiple sclerosis with our pipeline, we find a similar glial activation profile enriched in the early phase of these neurodegenerative diseases. In late-stage age-related macular degeneration, we identify a microglia-to-astrocyte signaling axis mediated by interleukin-1β which ...

Nature Machine Intelligence
The development of powerful natural language models has improved the ability to learn meaningful ... more The development of powerful natural language models has improved the ability to learn meaningful representations of protein sequences. In addition, advances in high-throughput mutagenesis, directed evolution and next-generation sequencing have allowed for the accumulation of large amounts of labelled fitness data. Leveraging these two trends, we introduce Regularized Latent Space Optimization (ReLSO), a deep transformer-based autoencoder, which features a highly structured latent space that is trained to jointly generate sequences as well as predict fitness. Through regularized prediction heads, ReLSO introduces a powerful protein sequence encoder and a novel approach for efficient fitness landscape traversal. Using ReLSO, we explicitly model the sequence-function landscape of large labelled datasets and generate new molecules by optimizing within the latent space using gradient-based methods. We evaluate this approach on several publicly available protein datasets, including variant sets of anti-ranibizumab and green fluorescent protein. We observe a greater sequence optimization efficiency (increase in fitness per optimization step) using ReLSO compared with other approaches, where ReLSO more robustly generates high-fitness sequences. Furthermore, the attention-based relationships learned by the jointly trained ReLSO models provide a potential avenue towards sequence-level fitness attribution information. NATuRE MAChiNE iNTElliGENCE | www.nature.com/natmachintell Articles NaTURe MachINe INTeLLIgeNce An alternative to working in the sequence space is to learn a low-dimensional, semantically rich representation of peptides and proteins. These latent representations collectively form the latent space, which is easier to navigate. With this approach, a therapeutic candidate can be optimized using its latent representation, in a procedure called latent space optimization. Here we propose ReLSO, a deep transformer-based approach to protein design, which combines the powerful encoding ability of a transformer model with a bottleneck that produces information-rich, low-dimensional latent representations. The latent space in ReLSO, besides being low dimensional, is regularized to be (1) smooth with respect to structure and fitness by way of fitness prediction from the latent space, (2) continuous and interpolatable between training data points and (3) pseudoconvex on the basis of negative sampling outside the data. This highly designed latent space enables optimization directly in latent space using gradient ascent on the fitness and converges to an optimum that can then be decoded back into the sequence space. Key contributions of ReLSO include the following.

The complexity and intelligence of the brain give the illusion that measurements of brain activit... more The complexity and intelligence of the brain give the illusion that measurements of brain activity will have intractably high dimensionality, rifewith collection and biological noise. Nonlinear dimensionality reduction methods like UMAP and t-SNE have proven useful for high-throughput biomedical data. However, they have not been used extensively for brain imaging data such as from functional magnetic resonance imaging (fMRI), a noninvasive, secondary measure of neural activity over time containing redundancy and co-modulation from neural population activity. Here we introduce a nonlinear manifold learning algorithm for timeseries data like fMRI, called temporal potential of heat diffusion for affinity-based transition embedding (T-PHATE). In addition to recovering a lower intrinsic dimensionality from timeseries data, T-PHATE exploits autocorrelative structure within the data to faithfully denoise dynamic signals and learn activation manifolds. We empirically validate T-PHATE on thr...

Cancer Discovery
Phenotypic plasticity describes the ability of cancer cells to undergo dynamic, nongenetic cell s... more Phenotypic plasticity describes the ability of cancer cells to undergo dynamic, nongenetic cell state changes that amplify cancer heterogeneity to promote metastasis and therapy evasion. Thus, cancer cells occupy a continuous spectrum of phenotypic states connected by trajectories defining dynamic transitions upon a cancer cell state landscape. With technologies proliferating to systematically record molecular mechanisms at single-cell resolution, we illuminate manifold learning techniques as emerging computational tools to effectively model cell state dynamics in a way that mimics our understanding of the cell state landscape. We anticipate that “state-gating” therapies targeting phenotypic plasticity will limit cancer heterogeneity, metastasis, and therapy resistance. Significance: Nongenetic mechanisms underlying phenotypic plasticity have emerged as significant drivers of tumor heterogeneity, metastasis, and therapy resistance. Herein, we discuss new experimental and computation...
The development of powerful natural language models have increased the ability to learn meaningfu... more The development of powerful natural language models have increased the ability to learn meaningful representations of protein sequences. In addition, advances in high-throughput mutagenesis, directed evolution, and next-generation sequencing have allowed for the accumulation of large amounts of labeled fitness data. Leveraging these two trends, we introduce Regularized Latent Space Optimization (ReLSO), a deep transformer-based autoencoder which is trained to jointly generate sequences as well as predict fitness. Using ReLSO, we explicitly model the underlying sequence-function landscape of large labeled datasets and optimize within latent space using gradient-based methods. Through regularized prediction heads, ReLSO introduces a powerful protein sequence encoder and novel approach for efficient fitness landscape traversal.

Lecture Notes in Computer Science
Recent work has established clear links between the generalization performance of trained neural ... more Recent work has established clear links between the generalization performance of trained neural networks and the geometry of their loss landscape near the local minima to which they converge. This suggests that qualitative and quantitative examination of the loss landscape geometry could yield insights about neural network generalization performance during training. To this end, researchers have proposed visualizing the loss landscape through the use of simple dimensionality reduction techniques. However, such visualization methods have been limited by their linear nature and only capture features in one or two dimensions, thus restricting sampling of the loss landscape to lines or planes. Here, we expand and improve upon these in three ways. First, we present a novel "jump and retrain" procedure for sampling relevant portions of the loss landscape. We show that the resulting sampled data holds more meaningful information about the network's ability to generalize. Next, we show that non-linear dimensionality reduction of the jump and retrain trajectories via PHATE, a trajectory and manifold-preserving method, allows us to visualize differences between networks that are generalizing well vs poorly. Finally, we combine PHATE trajectories with a computational homology characterization to quantify trajectory differences.

The last decade has witnessed a technological arms race to encode the molecular states of cells i... more The last decade has witnessed a technological arms race to encode the molecular states of cells into DNA libraries, turning DNA sequencers into scalable single-cell microscopes. Single-cell measurement of chromatin accessibility (DNA), gene expression (RNA), and proteins has revealed rich cellular diversity across tissues, organisms, and disease states. However, single-cell data poses a unique set of challenges. A dataset may comprise millions of cells with tens of thousands of sparse features. Identifying biologically relevant signals from the background sources of technical noise requires innovation in predictive and representational learning. Furthermore, unlike in machine vision or natural language processing, biological ground truth is limited. Here we leverage recent advances in multi-modal single-cell technologies which, by simultaneously measuring two layers of cellular processing in each cell, provide ground truth analogous to language translation. We define three key tasks...

2021 IEEE International Conference on Big Data (Big Data), 2021
A major challenge in embedding or visualizing clinical patient data is the heterogeneity of varia... more A major challenge in embedding or visualizing clinical patient data is the heterogeneity of variable types including continuous lab values, categorical diagnostic codes, as well as missing or incomplete data. In particular, in EHR data, some variables are missing not at random (MNAR) but deliberately not collected and thus are a source of information. For example, lab tests may be deemed necessary for some patients on the basis of suspected diagnosis, but not for others. Here we present the MURAL forest-an unsupervised random forest for representing data with disparate variable types (e.g., categorical, continuous, MNAR). MURAL forests consist of a set of decision trees where node-splitting variables are chosen at random, such that the marginal entropy of all other variables is minimized by the split. This allows us to also split on MNAR variables and discrete variables in a way that is consistent with the continuous variables. The end goal is to learn the MURAL embedding of patients using average tree distances between those patients. These distances can be fed to nonlinear dimensionality reduction method like PHATE to derive visualizable embeddings. While such methods are ubiquitous in continuous-valued datasets (like single cell RNA-sequencing) they have not been used extensively in mixed variable data. We showcase the use of our method on one artificial and two clinical datasets. We show that using our approach, we can visualize and classify data more accurately than competing approaches. Finally, we show that MURAL can also be used to

2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), 2021
We propose a method called integrated diffusion for combining multimodal data, gathered via diffe... more We propose a method called integrated diffusion for combining multimodal data, gathered via different sensors on the same system, to create a integrated data diffusion operator. As real world data suffers from both local and global noise, we introduce mechanisms to optimally calculate a diffusion operator that reflects the combined information in data by maintaining low frequency eigenvectors of each modality both globally and locally. We show the utility of this integrated operator in denoising and visualizing multimodal toy data as well as multi-omic data generated from blood cells, measuring both gene expression and chromatin accessibility. Our approach better visualizes the geometry of the integrated data and captures known cross-modality associations. More generally, integrated diffusion is broadly applicable to multimodal datasets generated by noisy sensors collected in a variety of fields.

Skin epidermal homeostasis is maintained via constant regeneration by stem cells, which must comm... more Skin epidermal homeostasis is maintained via constant regeneration by stem cells, which must communicate to balance their self-renewal and differentiation. A key molecular pathway, Ca2+ signaling has been implicated as a signal integrator in developing and wounded epithelial tissues[1, 2, 3, 4]. Yet how stem cells carry out this signaling across a regenerative tissue remains unknown due to significant challenges in studying signaling dynamics in live mice, limiting our understanding of the mechanisms of stem cell communication during homeostasis. To interpret high dimensional signals that have complex spatial and temporal patterns, we combined optimized imaging of Ca2+ signaling in thousands of epidermal stem cells in living mice with a new machine learning tool, Geometric Scattering Trajectory Homology (GSTH). Using a combination of signal processing, data geometry, and topology, GSTH captures patterns of signaling at multiple scales, either between direct or distant stem cell neig...

In many important contexts involving measurements of biological entities, there are distinct cate... more In many important contexts involving measurements of biological entities, there are distinct categories of information: some information is easy-to-obtain information (EI) and can be gathered on virtually every subject of interest, while other information is hard-to-obtain information (HI) and can only be gathered on some of the biological samples. For example, in the context of drug discovery, measurements like the chemical structure of a drug are EI, while measurements of the transcriptome of a cell population perturbed with the drug is HI. In the clinical context, basic health monitoring is EI because it is already being captured as part of other processes, while cellular measurements like flow cytometry or even ultimate patient outcome are HI. We propose building a model to make probabilistic predictions of HI from EI on the samples that have both kinds of measurements, which will allow us to generalize and predict the HI on a large set of samples from just the EI. To accomplish...

Previously, the effect of a drug on a cell population was measured based on simple metrics such a... more Previously, the effect of a drug on a cell population was measured based on simple metrics such as cell viability. However, as single-cell technologies are becoming more advanced, drug screen experiments can now be conducted with more complex readouts such as gene expression profiles of individual cells. The increasing complexity of measurements from these multi-sample experiments calls for more sophisticated analytical approaches than are currently available. We developed a novel method called PhEMD (Phenotypic Earth Mover’s Distance) and show that it can be used to embed the space of drug perturbations on the basis of the drugs’ effects on cell populations. When testing PhEMD on a newly-generated, 300-sample CyTOF kinase inhibition screen experiment, we find that the state space of the perturbation conditions is surprisingly low-dimensional and that the network of drugs demonstrates manifold structure. We show that because of the fairly simple manifold geometry of the 300 samples,...

Journal of immunology (Baltimore, Md. : 1950), Jan 6, 2018
Type 1 diabetes (T1D) is most likely caused by killing of β cells by autoreactive CD8 T cells. Me... more Type 1 diabetes (T1D) is most likely caused by killing of β cells by autoreactive CD8 T cells. Methods to isolate and identify these cells are limited by their low frequency in the peripheral blood. We analyzed CD8 T cells, reactive with diabetes Ags, with T cell libraries and further characterized their phenotype by CyTOF using class I MHC tetramers. In the libraries, the frequency of islet Ag-specific CD45ROIFN-γCD8 T cells was higher in patients with T1D compared with healthy control subjects. Ag-specific cells from the libraries of patients with T1D were reactive with ZnT8, whereas those from healthy control recognized ZnT8 and other Ags. ZnT8-reactive CD8 cells expressed an activation phenotype in T1D patients. We found TCR sequences that were used in multiple library wells from patients with T1D, but these sequences were private and not shared between individuals. These sequences could identify the Ag-specific T cells on a repeated draw, ex vivo in the IFN-γ CD8 T cell subset....

ABSTRACTSingle-cell RNA-sequencing is fast becoming a major technology that is revolutionizing bi... more ABSTRACTSingle-cell RNA-sequencing is fast becoming a major technology that is revolutionizing biological discovery in fields such as development, immunology and cancer. The ability to simultaneously measure thousands of genes at single cell resolution allows, among other prospects, for the possibility of learning gene regulatory networks at large scales. However, scRNA-seq technologies suffer from many sources of significant technical noise, the most prominent of which is ‘dropout’ due to inefficient mRNA capture. This results in data that has a high degree of sparsity, with typically only ~10% non-zero values. To address this, we developed MAGIC (Markov Affinity-based Graph Imputation of Cells), a method for imputing missing values, and restoring the structure of the data. After MAGIC, we find that two- and three-dimensional gene interactions are restored and that MAGIC is able to impute complex and non-linear shapes of interactions. MAGIC also retains cluster structure, enhances ...

Proceedings of the 46th Annual Design Automation Conference, 2009
State elements are increasingly vulnerable to soft errors due to their decreasing size, and the f... more State elements are increasingly vulnerable to soft errors due to their decreasing size, and the fact that latched errors cannot be completely eliminated by electrical or timing masking. Most prior methods of reducing the soft-error rate (SER) involve combinational redesign, which tends to add area and decrease testability, the latter a concern due to the prevalence of manufacturing defects. Our work explores the fundamental relations between the SER of sequential circuits and their testability in scan mode, and appears to be the first to improve both through retiming. Our retiming methodology relocates registers so that 1) registers become less observable with respect to primary outputs, thereby decreasing overall SER, and 2) combinational nodes become more observable with respect to registers (but not with respect to primary outputs), thereby increasing scantestability. We present experimental results which show an average decrease of 42% in the SER of latches, and an average improvement of 31% random-pattern testability.

Proceedings of the 45th annual Design Automation Conference, 2008
Soft errors, once only of concern in memories, are beginning to affect logic as well. Determining... more Soft errors, once only of concern in memories, are beginning to affect logic as well. Determining the soft error rate (SER) of a combinational circuit involves three main masking mechanisms: logic, timing and electrical. Most previous papers focus on logic and electrical masking. In this paper we develop static and statistical analysis techniques for timing masking that estimate the error-latching window of each gate. Our SER evaluation algorithms incorporating timing masking are orders of magnitude faster than comparable evaluators and can be used in synthesis and layout. We show that 62% of gates identified as error-critical using timing masking would not be identifiable by considering only logic masking. Furthermore, hardening the top 10% of errorcritical gates leads to a 43% reduction in the SER. We also propose a more subtle solution, gate-relocation for technologies where wire delay dominates gate delay. We decrease the error-latching window of each gate by relocating it in such a way that path lengths to primary outputs are equalized. Our results show a 14% improvement in SER with no area overhead.

Design, Automation and Test in Europe
Soft errors are an increasingly serious problem for logic circuits. To estimate the effects of so... more Soft errors are an increasingly serious problem for logic circuits. To estimate the effects of soft errors on such circuits, we develop a general computational framework based on probabilistic transfer matrices (PTMs). In particular, we apply them to evaluate circuit reliability in the presence of soft errors, which involves combining the PTMs of gates to form an overall circuit PTM. Information such as output probabilities, the overall probability of error, and signal observability can then be extracted from the circuit PTM. We employ algebraic decision diagrams (ADDs) to improve the efficiency of PTM operations. A particularly challenging technical problem, solved in our work, is to simultaneously extend tensor products and matrix multiplication in terms of ADDs to non-square matrices. Our PTM-based method enables accurate evaluation of reliability for moderately large circuits and can be extended by circuit partitioning. To demonstrate the power of the PTM approach, we apply it to several problems in fault-tolerant design and reliability improvement.
Proceedings of the 48th Design Automation Conference, 2011
With the growing complexity of synthetic biological circuits, robust and systematic methods are n... more With the growing complexity of synthetic biological circuits, robust and systematic methods are needed for design and test. Leveraging lessons learned from the semiconductor and design automation industries, synthetic biologists are starting to adopt computer-aided design and verification software with some success. However, due to the great challenges associated with designing synthetic biological circuits, this nascent approach has to address many problems not present in electronic circuits. In this session, three leading synthetic biologists will share how they have developed software tools to help design and verify their synthetic circuits, the unique challenges they face, and their insights into the next generation of tools for synthetic biology.
Uploads
Papers by Smita Krishnaswamy