Papers by Nikita Sakhanenko
A systematic, language-independent method of finding a minimal set of paths covering the code of ... more A systematic, language-independent method of finding a minimal set of paths covering the code of a sequential program is proposed for application in White Box testing. Execution of all paths from the set ensures also statement coverage. Execution fault marks problematic areas of the code. The method starts from a UML activity diagram of a program. The diagram is transformed into a directed graph: graph's nodes substitute decision and action points; graph's directed edges substitute action arrows.

Proceedings of IICAI- …, 2007
We describe a flexible multi-layer architecture for contextsensitive stochastic modeling. The arc... more We describe a flexible multi-layer architecture for contextsensitive stochastic modeling. The architecture incorporates a high performance stochastic modeling core based on a recursive form of probabilistic logic. On top of this modeling core, causal representations and reasoning direct a long-term incremental learning process that produces a context-partitioned library of stochastic models. The failure-driven learning procedure for expanding and refining the model library employs a combination of abductive inference together with EM model induction to construct new models when current models no longer perform acceptably. The system uses a causal finite state machine representation to control on-line model switching and model adaptation along with embedded learning. Our system is designed to support operational deployment in real-time monitoring, diagnostic, prognostic, and decision support applications. In this paper we describe the basic multi-layer architecture along with new learning algorithms inspired by developmental learning theory.
We propose a context-sensitive probabilistic modeling system (COSMOS) that rea- sons about a comp... more We propose a context-sensitive probabilistic modeling system (COSMOS) that rea- sons about a complex, dynamic environment through a series of applications of smaller, knowledge-focused models representing contextually relevant information. COSMOS uses a failure-driven architecture to determine whether a context is sup- ported, and consequently whether the current model remains applicable. The in- dividual models are specified through sets of structured,
Computing Research Repository, 2006
Shock physics experiments are often complicated and expensive. As a result, researchers are unabl... more Shock physics experiments are often complicated and expensive. As a result, researchers are unable to conduct as many experiments as they would like - leading to sparse data sets. In this paper, Support Vector Machines for regression are applied to velocimetry data sets for shock damaged and melted tin metal. Some success at interpolating between data sets is achieved. Implications
Tin coupons were shock damaged/melted under identical conditions with a diverging high explosive ... more Tin coupons were shock damaged/melted under identical conditions with a diverging high explosive shock wave. Proton Radiography images and velocimetry data from experiments with seven different tin coupons of varying thickness are analyzed. Comparing experiments with identical samples allowed us to distinguish between repeatable and random features. Shapes and velocities of the main fragments are deterministic functions of the coupon
A systematic, language-independent method of finding a minimal set of paths covering the code of ... more A systematic, language-independent method of finding a minimal set of paths covering the code of a sequential program is proposed for application in White Box testing. Execution of all paths from the set ensures also statement coverage. Execution fault marks problematic areas of the code. The method starts from a UML activity diagram of a program. The diagram is transformed into a directed graph: graph's nodes substitute decision and action points; graph's directed edges substitute action arrows.
Metal melting on release after explosion is a physical system far from equilibrium. A complete ph... more Metal melting on release after explosion is a physical system far from equilibrium. A complete physical model of this system does not exist, because many interrelated effects have to be considered. General methodology needs to be developed so as to describe and understand physical phenomena involved.

Indian International Conference on Artificial Intelligence, 2007
We describe a flexible multi-layer architecture for contextsensitive stochastic modeling. The arc... more We describe a flexible multi-layer architecture for contextsensitive stochastic modeling. The architecture incorporates a high performance stochastic modeling core based on a recursive form of probabilistic logic. On top of this modeling core, causal representations and reasoning direct a long-term incremental learning process that produces a context-partitioned library of stochastic models. The failure-driven learning procedure for expanding and refining the model library employs a combination of abductive inference together with EM model induction to construct new models when current models no longer perform acceptably. The system uses a causal finite state machine representation to control on-line model switching and model adaptation along with embedded learning. Our system is designed to support operational deployment in real-time monitoring, diagnostic, prognostic, and decision support applications. In this paper we describe the basic multi-layer architecture along with new learning algorithms inspired by developmental learning theory.
International Journal of Modern Physics C, 2006
This paper considers a set of shock physics experiments that investigate how materials respond to... more This paper considers a set of shock physics experiments that investigate how materials respond to the extremes of deformation, pressure, and temperature when exposed to shock waves. Due to the complexity and the cost of these tests, the available experimental data set is often very sparse. A support vector machine (SVM) technique for regression is used for data estimation of velocity measurements from the underlying experiments. Because of good generalization performance, the SVM method successfully interpolates the experimental data. The analysis of the resulting velocity surface provides more information on the physical phenomena of the experiment. Additionally, the estimated data can be used to identify outlier data sets, as well as to increase the understanding of the other data from the experiment.

International Journal on Artificial Intelligence Tools, 2009
In this paper we present a novel support vector machine (SVM) based framework for prognosis and d... more In this paper we present a novel support vector machine (SVM) based framework for prognosis and diagnosis. We apply the framework to sparse physics data sets, although the method can easily be extended to other domains. Experiments in applied fields, such as experimental physics, are often complicated and expensive. As a result, experimentalists are unable to conduct as many experiments as they would like, leading to very unbalanced data sets that can be dense in one dimension and very sparse in others. Our method predicts the data values along the sparse dimension providing more information to researchers. Often experiments deviate from expectations due to small misalignments in initial parameters. It can be challenging to distinguish these outlier experiments from those where a real underlying process caused the deviation. Our method detects these outlier experiments. We describe our success at prediction and outlier detection and discuss implications for future applications.
A New Approach to Model-Based Diagnosis Using Probabilistic Logic
We describe a new approach to model construction using transfer function diagrams that are conseq... more We describe a new approach to model construction using transfer function diagrams that are consequently mapped into generalized loopy logic, a first-order, Turing-complete stochastic language. Transfer function diagrams support rep-resentation of dynamic systems with ...

Biological Data Analysis as an Information Theory Problem: Multivariable Dependence Measures and the Shadows Algorithm
Journal of Computational Biology, 2015
Information theory is valuable in multiple-variable analysis for being model-free and nonparametr... more Information theory is valuable in multiple-variable analysis for being model-free and nonparametric, and for the modest sensitivity to undersampling. We previously introduced a general approach to finding multiple dependencies that provides accurate measures of levels of dependency for subsets of variables in a data set, which is significantly nonzero only if the subset of variables is collectively dependent. This is useful, however, only if we can avoid a combinatorial explosion of calculations for increasing numbers of variables. The proposed dependence measure for a subset of variables, τ, differential interaction information, Δ(τ), has the property that for subsets of τ some of the factors of Δ(τ) are significantly nonzero, when the full dependence includes more variables. We use this property to suppress the combinatorial explosion by following the "shadows" of multivariable dependency on smaller subsets. Rather then calculating the marginal entropies of all subsets at each degree level, we need to consider only calculations for subsets of variables with appropriate "shadows." The number of calculations for n variables at a degree level of d grows therefore, at a much smaller rate than the binomial coefficient (n, d), but depends on the parameters of the "shadows" calculation. This approach, avoiding a combinatorial explosion, enables the use of our multivariable measures on very large data sets. We demonstrate this method on simulated data sets, and characterize the effects of noise and sample numbers. In addition, we analyze a data set of a few thousand mutant yeast strains interacting with a few thousand chemical compounds.

Lecture Notes in Computer Science, 2003
Automatically proving properties of tail-recursive function definitions by induction is known to ... more Automatically proving properties of tail-recursive function definitions by induction is known to be challenging. The difficulty arises due to a property of a tail-recursive function definition typically expressed by instantiating the accumulator argument to be a constant only on one side of the property. The application of the induction hypothesis gets blocked in a proof attempt. Following an approach developed by Kapur and Subramaniam, a transformation heuristic is proposed which hypothesizes the other side of property to also have an occurrence of the same constant. Constraints on the transformation are identified which enable a generalization of the constant on both sides with the hope that the generalized conjecture is easier to prove. Conditions are generated from which intermediate lemmas necessary to make a proof attempt to succeed can be speculated. By considering structural properties of recursive definitions, it is possible to identify properties of the functions used in recursive definitions for the conjecture to be valid. The heuristic is demonstrated on well-known tail-recursive definitions on numbers as well as other recursive data structures, including finite lists, finite sequences, finite trees, where a definition is expressed using one recursive call or multiple recursive calls. In case, a given conjecture is not valid because of a possible bug in an implementation (a tail-recursive definition) or a specification (a recursive definition), the heuristic can be often used to generate a counter-example. Conditions under which the heuristic is applicable can be checked easily. The proposed heuristic is likely to be helpful for automatically generating loop invariants as well as in proofs of correctness of properties of programs with respect to their specifications.

Measures of dependence among variables, and measures of information content and shared informatio... more Measures of dependence among variables, and measures of information content and shared information have become valuable tools of multi-variable data analysis. Information measures, like marginal entropies, mutual and multi-information, have a number of significant advantages over more standard statistical methods, like their reduced sensitivity to sampling limitations than statistical estimates of probability densities. There are also interesting applications of these measures to the theory of complexity and to statistical mechanics. Their mathematical properties and relationships are therefore of interest at several levels. Of the interesting relationships between common information measures, perhaps none are more intriguing and as elegant as the duality relationships based on Mobius inversions. These inversions are directly related to the lattices (posets) that describe these sets of variables and their multi-variable measures. In this paper we describe extensions of the duality p...

PLoS ONE, 2014
Phenotypic variation, including that which underlies health and disease in humans, results in par... more Phenotypic variation, including that which underlies health and disease in humans, results in part from multiple interactions among both genetic variation and environmental factors. While diseases or phenotypes caused by single gene variants can be identified by established association methods and family-based approaches, complex phenotypic traits resulting from multi-gene interactions remain very difficult to characterize. Here we describe a new method based on information theory, and demonstrate how it improves on previous approaches to identifying genetic interactions, including both synthetic and modifier kinds of interactions. We apply our measure, called interaction distance, to previously analyzed data sets of yeast sporulation efficiency, lipid related mouse data and several human disease models to characterize the method. We show how the interaction distance can reveal novel gene interaction candidates in experimental and simulated data sets, and outperforms other measures in several circumstances. The method also allows us to optimize case/control sample composition for clinical studies.
Journal of Computer Science and Technology, 2010
This paper addresses parameter drift in stochastic models. We define a notion of context that rep... more This paper addresses parameter drift in stochastic models. We define a notion of context that represents invariant, stable-over-time behavior and we then propose an algorithm for detecting context changes in processing a stream of data. A context change is seen as model failure, when a probabilistic model representing current behavior is no longer able to "fit" newly encountered data. We specify our stochastic models using a first-order logic-based probabilistic modeling language called Generalized Loopy Logic (GLL). An important component of GLL is its learning mechanism that can identify context drift. We demonstrate how our algorithm can be incorporated into a failure-driven context-switching probabilistic modeling framework and offer several examples of its application.
Journal of Computational Biology, 2010
Complex, non-additive genetic interactions are common and can be critical in determining phenotypes.

Probabilistic Logic Methods and Some Applications to Biology and Medicine
Journal of Computational Biology, 2012
For the computational analysis of biological problems-analyzing data, inferring networks and comp... more For the computational analysis of biological problems-analyzing data, inferring networks and complex models, and estimating model parameters-it is common to use a range of methods based on probabilistic logic constructions, sometimes collectively called machine learning methods. Probabilistic modeling methods such as Bayesian Networks (BN) fall into this class, as do Hierarchical Bayesian Networks (HBN), Probabilistic Boolean Networks (PBN), Hidden Markov Models (HMM), and Markov Logic Networks (MLN). In this review, we describe the most general of these (MLN), and show how the above-mentioned methods are related to MLN and one another by the imposition of constraints and restrictions. This approach allows us to illustrate a broad landscape of constructions and methods, and describe some of the attendant strengths, weaknesses, and constraints of many of these methods. We then provide some examples of their applications to problems in biology and medicine, with an emphasis on genetics. The key concepts needed to picture this landscape of methods are the ideas of probabilistic graphical models, the structures of the graphs, and the scope of the logical language repertoire used (from First-Order Logic [FOL] to Boolean logic.) These concepts are interlinked and together define the nature of each of the probabilistic logic methods. Finally, we discuss the initial applications of MLN to genetics, show the relationship to less general methods like BN, and then mention several examples where such methods could be effective in new applications to specific biological and medical problems.

Journal of Computational Biology, 2014
Context dependence is central to the description of complexity. Keying on the pairwise definition... more Context dependence is central to the description of complexity. Keying on the pairwise definition of ''set complexity,'' we use an information theory approach to formulate general measures of systems complexity. We examine the properties of multivariable dependency starting with the concept of interaction information. We then present a new measure for unbiased detection of multivariable dependency, ''differential interaction information.'' This quantity for two variables reduces to the pairwise ''set complexity'' previously proposed as a context-dependent measure of information in biological systems. We generalize it here to an arbitrary number of variables. Critical limiting properties of the ''differential interaction information'' are key to the generalization. This measure extends previous ideas about biological information and provides a more sophisticated basis for the study of complexity. The properties of ''differential interaction information'' also suggest new approaches to data analysis. Given a data set of system measurements, differential interaction information can provide a measure of collective dependence, which can be represented in hypergraphs describing complex system interaction patterns. We investigate this kind of analysis using simulated data sets. The conjoining of a generalized set complexity measure, multivariable dependency analysis, and hypergraphs is our central result. While our focus is on complex biological systems, our results are applicable to any complex system.

Genetics, 2014
Background: The throughput of next-generation sequencing machines has increased dramatically over... more Background: The throughput of next-generation sequencing machines has increased dramatically over the last few years; yet the cost and time for library preparation have not changed proportionally, thus representing the main bottleneck for sequencing large numbers of samples. Here we present an economical, high-throughput library preparation method for the Illumina platform, comprising a 96-well based method for DNA isolation for yeast cells, a low-cost DNA shearing alternative, and adapter ligation using heat inactivation of enzymes instead of bead cleanups. Results: Up to 384 whole-genome libraries can be prepared from yeast cells in one week using this method, for less than 15 euros per sample. We demonstrate the robustness of this protocol by sequencing over 1000 yeast genomes at~30x coverage. The sequence information from 768 yeast segregants derived from two divergent S. cerevisiae strains was used to generate a meiotic recombination map at unprecedented resolution. Comparisons to other datasets indicate a high conservation of recombination at a chromosome-wide scale, but differences at the local scale. Additionally, we detected a high degree of aneuploidy (3.6%) by examining the sequencing coverage in these segregants. Differences in allele frequency allowed us to attribute instances of aneuploidy to gains of chromosomes during meiosis or mitosis, both of which showed a strong tendency to missegregate specific chromosomes. Conclusions: Here we present a high throughput workflow to sequence genomes of large number of yeast strains at a low price. We have used this workflow to obtain recombination and aneuploidy data from hundreds of segregants, which can serve as a foundation for future studies of linkage, recombination, and chromosomal aberrations in yeast and higher eukaryotes.
Uploads
Papers by Nikita Sakhanenko