Skip to main content

Dmitriy Fradkin

Followers

22

Following

8

Co-authors

8

Public Views

Recognition Pattern

Bogazici University

Wlodzislaw Duch

Uniwersytet Mikołaja Kopernika w Toruniu

Howard Hughes Medical Institute

University of Northampton

Interests

Uploads

Papers by Dmitriy Fradkin

Experiments with random projections for machine learning

Dimensionality reduction via Random Projections has attracted considerable attention in recent ye... more Dimensionality reduction via Random Projections has attracted considerable attention in recent years. The approach has interesting theoretical underpinnings and offers computational advantages. In this paper we report a number of experiments to evaluate Random Projections in the context of inductive supervised learning. In particular, we compare Random Projections and PCA on a number of different datasets and using different machine learning methods. While we find that the random projection approach predictively underperforms PCA, its computational advantages may make it attractive for certain applications.

Methods for learning classifier combinations

This work compares two approaches to finding effective topicindependent classifier combinations. ... more This work compares two approaches to finding effective topicindependent classifier combinations. We suggest a new federated approach and compare it against the global approach. Our results indicate that the relative effectiveness of these approaches depends on the measure used to evaluate them. We suggest explanations for these results.

An efficient pattern mining approach for event detection in multivariate temporal data

Knowledge and Information Systems, Jan 21, 2015

This work proposes a pattern mining approach to learn event detection models from complex multiva... more This work proposes a pattern mining approach to learn event detection models from complex multivariate temporal data, such as electronic health records. We present Recent Temporal Pattern mining, a novel approach for efficiently finding predictive patterns for event detection problems. This approach first converts the time series data into time-interval sequences of temporal abstractions. It then constructs more complex time-interval patterns backward in time using temporal operators. We also present the Minimal Predictive Recent Temporal Patterns framework for selecting a small set of predictive and non-spurious patterns. We apply our methods for predicting adverse medical events in real-world clinical data. The results demonstrate the benefits of our methods in learning accurate event detection models, which is a key step for developing intelligent patient monitoring and decision support systems.

Unleashing the Power of Industrial Big Data through Scalable Manual Labeling

OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information), Dec 14, 2021

Unsupervised Power System Event Detection and Classification Using Unlabeled PMU Data

2021 IEEE PES Innovative Smart Grid Technologies Europe (ISGT Europe), Oct 18, 2021

This paper proposes a novel data-driven power system event detection and classification method ba... more This paper proposes a novel data-driven power system event detection and classification method based on 5 TB of actual PMU measurements collected from the US western interconnect. Firstly, a set of comprehensive power quality rules are proposed to pre-filter the raw data and extract the regions of interest (ROI). Six distinct event categories are defined, and corresponding patterns are chosen as references. Meanwhile, detailed characteristics of patterns are summarized to enhance our understanding of the actual events. Then, the time-independent feature vectors are generated by extracting the statistical, temporal, and spectral features from the raw time-series data. Furthermore, an ensemble model is proposed to cluster the events by combining multiple K-means clustering models using a voting strategy. Besides, both system-level and PMU-level clustering models are developed. The accuracy and robustness of the event detection method are further improved through interactive evaluation of the two-level clustering results. This paper summarizes the actual characteristics of each event category and provides a reliable basis for accurate label generation. The experiments demonstrate the effectiveness of the proposed event detection and classification method.

Cybersecurity via Inverter Grid Automatic Reconfiguration (CIGAR) Year 3 Report

Classifying Spend Descriptions with Off-the-Shelf Learning Components

Analyzing spend transactions is essential to organizations for understanding their global procure... more Analyzing spend transactions is essential to organizations for understanding their global procurement. Central to this analysis is the automated classification of these transactions to hierarchical commodity coding systems. Spend classification is challenging due not only to the complexities of the commodity coding systems but also because of the sparseness and quality of each individual transaction text description and the volume of such transactions in an organization. In this paper, we demonstrate the application of off-the-shelf machine learning tools to address the challenges in spend classification. We have built a system using off-the-shelf SVM, Logistic Regression, and language processing toolkits and describe the effectiveness of these different learning techniques for spend classification.

Mining compressing sequential problems

Compression based pattern mining has been successfully applied to many data mining tasks. We prop... more Compression based pattern mining has been successfully applied to many data mining tasks. We propose an approach based on the minimum description length principle to extract sequential patterns that compress a database of sequences well. We show that mining compressing patterns is NP-Hard and belongs to the class of inapproximable problems. We propose two heuristic algorithms to mining compressing patterns. The ?rst uses a two-phase approach similar to Krimp for itemset data. To overcome performance with the required candidate generation we propose GoKrimp, an e?ective greedy algorithm that directly mines compressing patterns. We conduct an empirical study on six real-life datasets to compare the proposed algorithms by run time, compressibility, and classi?cation accuracy using the patterns found as features for SVM classi?ers.

Clusters with core-tail hierarchical structure and their applications to machine learning classification

We present a method for analysis of clustering results. This method represents every cluster as a... more We present a method for analysis of clustering results. This method represents every cluster as a stratified hierarchy of its subsets of objects (strata) ordered along a scale of their internal similarities. The "layered structures" can be described as a tool for interpretation of individual clusters rather than for describing the model of the entire data. It can be used not only for comparisons of different clusters, but also for improving existing methods to get "good" clusters. We show that this approach can also be used for improving supervised machine learning methods, particularly "active machine learning" methods, by specific analysis and pre-processing of a training data.

Deep Reinforcement Learning for DER Cyber-Attack Mitigation

The increasing penetration of DER with smartinverter functionality is set to transform the electr... more The increasing penetration of DER with smartinverter functionality is set to transform the electrical distribution network from a passive system, with fixed injection/consumption, to an active network with hundreds of distributed controllers dynamically modulating their operating setpoints as a function of system conditions. This transition is being achieved through standardization of functionality through grid codes and/or international standards. DER, however, are unique in that they are typically neither owned nor operated by distribution utilities and, therefore, represent a new emerging attack vector for cyber-physical attacks. Within this work we consider deep reinforcement learning as a tool to learn the optimal parameters for the control logic of a set of uncompromised DER units to actively mitigate the effects of a cyber-attack on a subset of network DER.

Mining Compressing Sequential Patterns

Compression based pattern mining has been successfully applied to many data mining tasks. We prop... more Compression based pattern mining has been successfully applied to many data mining tasks. We propose an approach based on the minimum description length principle to extract sequential patterns that compress a database of sequences well. We show that mining compressing patterns is NP-Hard and belongs to the class of inapproximable problems. We propose two heuristic algorithms to mining compressing patterns. The first uses a two-phase approach similar to Krimp for itemset data. To overcome performance with the required candidate generation we propose GoKrimp, an effective greedy algorithm that directly mines compressing patterns. We conduct an empirical study on six real-life datasets to compare the proposed algorithms by run time, compressibility, and classification accuracy using the patterns found as features for SVM classifiers.

Robust mining of time intervals with semi-interval partial order patterns

We present a new approach to mining patterns from symbolic interval data that extends previous ap... more We present a new approach to mining patterns from symbolic interval data that extends previous approaches by allowing semi-intervals and partially ordered patterns. The mining algorithm combines and adapts efficient algorithms from sequential pattern and itemset mining for discovery of the new semi-interval patterns. The semi-interval patterns and semi-interval partial order patterns are more flexible than patterns over full intervals, and are empirically demonstrated to be more useful as features in classification settings. We performed an extensive empirical evaluation on seven real life interval databases totalling over 146k intervals from more than 400 classes demonstrating the flexibility and usefulness of the patterns.

Image compression in real-time multiprocessor systems using divisive K-means clustering

In recent years, clustering became one of the fundamental methods of large dataset analysis. In p... more In recent years, clustering became one of the fundamental methods of large dataset analysis. In particular, clustering is an important component of real-time image compression and exploitation algorithms, such as vector quantization, segmentation of SAR, EO/IR, and hyperspectral imagery, group tracking, and behavior pattern analysis. Thus, development of fast scalable real-time clustering algorithms is important to enable exploitation of imagery coming from surveillance and reconnaissance airborne platforms. Clustering methods are widely used in pattern recognition, data compression, data mining, but the problem of using them in real-time systems has not been a focus of most algorithm designers. In this paper, we describe a practical clustering procedure that is designed specifically for compression of 2-D images and can satisfy stringent requirements of real-time onboard processing.

Single pass text classification by direct feature weighting

Knowledge and Information Systems, Jun 25, 2010

Mining recent temporal patterns for event detection in multivariate time series data

Improving the performance of classifiers using pattern mining techniques has been an active topic... more Improving the performance of classifiers using pattern mining techniques has been an active topic of data mining research. In this work we introduce the recent temporal pattern mining framework for finding predictive patterns for monitoring and event detection problems in complex multivariate time series data. This framework first converts time series into time-interval sequences of temporal abstractions. It then constructs more complex temporal patterns backwards in time using temporal operators. We apply our framework to health care data of 13,558 diabetic patients and show its benefits by efficiently finding useful patterns for detecting and diagnosing adverse medical conditions that are associated with diabetes.

Validation of epidemiological models: Chicken epidemiology in the UK

American Mathematical Society eBooks, Jun 7, 2007

In epidemiology, a standard way of constructing models is to conduct univariate analysis of indep... more In epidemiology, a standard way of constructing models is to conduct univariate analysis of independent variables, followed by fitting a multivariate logistic model to the selected features. The main measure for choosing a particular model is a goodness of fit criterion on a given dataset. While this measure indicates how well the model fits the data, it has little relation to the predictive accuracy of the model and therefore may not generalize beyond the given dataset. This aspect is not frequently considered in epidemiology. We suggest using modern machine learning methods for constructing and validating epidemiological models. The resulting models can be used to confirm epidemiologist's models or to suggest possible improvements. These approaches also provide estimates and confidence measures for the parameters and the predictive ability of the model.

We propose a streaming algorithm, based on the minimal description length (MDL) principle, for ex... more We propose a streaming algorithm, based on the minimal description length (MDL) principle, for extracting non-redundant sequential patterns. For static databases, the MDL-based approach that selects patterns based on their capacity to compress data rather than their frequency, was shown to be remarkably effective for extracting meaningful patterns and solving the redundancy issue in frequent itemset and sequence mining. The existing MDL-based algorithms, however, either start from a seed set of frequent patterns, or require multiple passes through the data. As such, the existing approaches scale poorly and are unsuitable for large datasets. Therefore, our main contribution is the proposal of a new, streaming algorithm, called Zips, that does not require a seed set of patterns and requires only one scan over the data. For Zips, we extended the Lempel-Ziv (LZ) compression algorithm in three ways: first, whereas LZ assigns codes uniformly as it builds up its dictionary while scanning the input, Zips assigns codewords according to the usage of the dictionary words; more heaviliy used words get shorter code-lengths. Secondly, Zips exploits also non-consecutive occurences of dictionary words for compression. And, third, the well-known space-saving algorithm is used to evict unpromising words from the dictionary. Experiments on one synthetic and two real-world large-scale datasets show that our approach extracts meaningful compressing patterns with similar quality to the state-of-the-art multi-pass algorithms proposed for static databases of sequences. Moreover, our approach scales linearly with the size of data streams while all the existing algorithms do not.

LogVis: Graph-Assisted Visual Analysis of Event Logs from Industrial Equipment

Visual reasoning on a graph is often a challenging task mainly due to the vast number of nodes an... more Visual reasoning on a graph is often a challenging task mainly due to the vast number of nodes and edges displayed. It becomes particularly challenging on log graph data, where thousands of events may be logged within minutes. In this study, we focus on three common log analysis tasks, namely Event Overview, Root-Cause Analysis and Pattern Analysis, and propose visualization approaches to overcome challenges particularly associated with these tasks. The proposed approaches are demonstrated on sample use-cases on industrial equipment logs.

Clustering Inside Classes Improves Performance of Linear Classifiers

Experiments with random projections for machine learning

Dimensionality reduction via Random Projections has attracted considerable attention in recent ye... more Dimensionality reduction via Random Projections has attracted considerable attention in recent years. The approach has interesting theoretical underpinnings and offers computational advantages. In this paper we report a number of experiments to evaluate Random Projections in the context of inductive supervised learning. In particular, we compare Random Projections and PCA on a number of different datasets and using different machine learning methods. While we find that the random projection approach predictively underperforms PCA, its computational advantages may make it attractive for certain applications.

Methods for learning classifier combinations

This work compares two approaches to finding effective topicindependent classifier combinations. ... more This work compares two approaches to finding effective topicindependent classifier combinations. We suggest a new federated approach and compare it against the global approach. Our results indicate that the relative effectiveness of these approaches depends on the measure used to evaluate them. We suggest explanations for these results.

An efficient pattern mining approach for event detection in multivariate temporal data

Knowledge and Information Systems, Jan 21, 2015

This work proposes a pattern mining approach to learn event detection models from complex multiva... more This work proposes a pattern mining approach to learn event detection models from complex multivariate temporal data, such as electronic health records. We present Recent Temporal Pattern mining, a novel approach for efficiently finding predictive patterns for event detection problems. This approach first converts the time series data into time-interval sequences of temporal abstractions. It then constructs more complex time-interval patterns backward in time using temporal operators. We also present the Minimal Predictive Recent Temporal Patterns framework for selecting a small set of predictive and non-spurious patterns. We apply our methods for predicting adverse medical events in real-world clinical data. The results demonstrate the benefits of our methods in learning accurate event detection models, which is a key step for developing intelligent patient monitoring and decision support systems.

Unleashing the Power of Industrial Big Data through Scalable Manual Labeling

OSTI OAI (U.S. Department of Energy Office of Scientific and Technical Information), Dec 14, 2021

Unsupervised Power System Event Detection and Classification Using Unlabeled PMU Data

2021 IEEE PES Innovative Smart Grid Technologies Europe (ISGT Europe), Oct 18, 2021

This paper proposes a novel data-driven power system event detection and classification method ba... more This paper proposes a novel data-driven power system event detection and classification method based on 5 TB of actual PMU measurements collected from the US western interconnect. Firstly, a set of comprehensive power quality rules are proposed to pre-filter the raw data and extract the regions of interest (ROI). Six distinct event categories are defined, and corresponding patterns are chosen as references. Meanwhile, detailed characteristics of patterns are summarized to enhance our understanding of the actual events. Then, the time-independent feature vectors are generated by extracting the statistical, temporal, and spectral features from the raw time-series data. Furthermore, an ensemble model is proposed to cluster the events by combining multiple K-means clustering models using a voting strategy. Besides, both system-level and PMU-level clustering models are developed. The accuracy and robustness of the event detection method are further improved through interactive evaluation of the two-level clustering results. This paper summarizes the actual characteristics of each event category and provides a reliable basis for accurate label generation. The experiments demonstrate the effectiveness of the proposed event detection and classification method.

Cybersecurity via Inverter Grid Automatic Reconfiguration (CIGAR) Year 3 Report

Classifying Spend Descriptions with Off-the-Shelf Learning Components

Analyzing spend transactions is essential to organizations for understanding their global procure... more Analyzing spend transactions is essential to organizations for understanding their global procurement. Central to this analysis is the automated classification of these transactions to hierarchical commodity coding systems. Spend classification is challenging due not only to the complexities of the commodity coding systems but also because of the sparseness and quality of each individual transaction text description and the volume of such transactions in an organization. In this paper, we demonstrate the application of off-the-shelf machine learning tools to address the challenges in spend classification. We have built a system using off-the-shelf SVM, Logistic Regression, and language processing toolkits and describe the effectiveness of these different learning techniques for spend classification.

Mining compressing sequential problems

Compression based pattern mining has been successfully applied to many data mining tasks. We prop... more Compression based pattern mining has been successfully applied to many data mining tasks. We propose an approach based on the minimum description length principle to extract sequential patterns that compress a database of sequences well. We show that mining compressing patterns is NP-Hard and belongs to the class of inapproximable problems. We propose two heuristic algorithms to mining compressing patterns. The ?rst uses a two-phase approach similar to Krimp for itemset data. To overcome performance with the required candidate generation we propose GoKrimp, an e?ective greedy algorithm that directly mines compressing patterns. We conduct an empirical study on six real-life datasets to compare the proposed algorithms by run time, compressibility, and classi?cation accuracy using the patterns found as features for SVM classi?ers.

Clusters with core-tail hierarchical structure and their applications to machine learning classification

We present a method for analysis of clustering results. This method represents every cluster as a... more We present a method for analysis of clustering results. This method represents every cluster as a stratified hierarchy of its subsets of objects (strata) ordered along a scale of their internal similarities. The "layered structures" can be described as a tool for interpretation of individual clusters rather than for describing the model of the entire data. It can be used not only for comparisons of different clusters, but also for improving existing methods to get "good" clusters. We show that this approach can also be used for improving supervised machine learning methods, particularly "active machine learning" methods, by specific analysis and pre-processing of a training data.

Deep Reinforcement Learning for DER Cyber-Attack Mitigation

The increasing penetration of DER with smartinverter functionality is set to transform the electr... more The increasing penetration of DER with smartinverter functionality is set to transform the electrical distribution network from a passive system, with fixed injection/consumption, to an active network with hundreds of distributed controllers dynamically modulating their operating setpoints as a function of system conditions. This transition is being achieved through standardization of functionality through grid codes and/or international standards. DER, however, are unique in that they are typically neither owned nor operated by distribution utilities and, therefore, represent a new emerging attack vector for cyber-physical attacks. Within this work we consider deep reinforcement learning as a tool to learn the optimal parameters for the control logic of a set of uncompromised DER units to actively mitigate the effects of a cyber-attack on a subset of network DER.

Mining Compressing Sequential Patterns

Compression based pattern mining has been successfully applied to many data mining tasks. We prop... more Compression based pattern mining has been successfully applied to many data mining tasks. We propose an approach based on the minimum description length principle to extract sequential patterns that compress a database of sequences well. We show that mining compressing patterns is NP-Hard and belongs to the class of inapproximable problems. We propose two heuristic algorithms to mining compressing patterns. The first uses a two-phase approach similar to Krimp for itemset data. To overcome performance with the required candidate generation we propose GoKrimp, an effective greedy algorithm that directly mines compressing patterns. We conduct an empirical study on six real-life datasets to compare the proposed algorithms by run time, compressibility, and classification accuracy using the patterns found as features for SVM classifiers.

Robust mining of time intervals with semi-interval partial order patterns

We present a new approach to mining patterns from symbolic interval data that extends previous ap... more We present a new approach to mining patterns from symbolic interval data that extends previous approaches by allowing semi-intervals and partially ordered patterns. The mining algorithm combines and adapts efficient algorithms from sequential pattern and itemset mining for discovery of the new semi-interval patterns. The semi-interval patterns and semi-interval partial order patterns are more flexible than patterns over full intervals, and are empirically demonstrated to be more useful as features in classification settings. We performed an extensive empirical evaluation on seven real life interval databases totalling over 146k intervals from more than 400 classes demonstrating the flexibility and usefulness of the patterns.

Image compression in real-time multiprocessor systems using divisive K-means clustering

In recent years, clustering became one of the fundamental methods of large dataset analysis. In p... more In recent years, clustering became one of the fundamental methods of large dataset analysis. In particular, clustering is an important component of real-time image compression and exploitation algorithms, such as vector quantization, segmentation of SAR, EO/IR, and hyperspectral imagery, group tracking, and behavior pattern analysis. Thus, development of fast scalable real-time clustering algorithms is important to enable exploitation of imagery coming from surveillance and reconnaissance airborne platforms. Clustering methods are widely used in pattern recognition, data compression, data mining, but the problem of using them in real-time systems has not been a focus of most algorithm designers. In this paper, we describe a practical clustering procedure that is designed specifically for compression of 2-D images and can satisfy stringent requirements of real-time onboard processing.

Single pass text classification by direct feature weighting

Knowledge and Information Systems, Jun 25, 2010

Mining recent temporal patterns for event detection in multivariate time series data

Improving the performance of classifiers using pattern mining techniques has been an active topic... more Improving the performance of classifiers using pattern mining techniques has been an active topic of data mining research. In this work we introduce the recent temporal pattern mining framework for finding predictive patterns for monitoring and event detection problems in complex multivariate time series data. This framework first converts time series into time-interval sequences of temporal abstractions. It then constructs more complex temporal patterns backwards in time using temporal operators. We apply our framework to health care data of 13,558 diabetic patients and show its benefits by efficiently finding useful patterns for detecting and diagnosing adverse medical conditions that are associated with diabetes.

Validation of epidemiological models: Chicken epidemiology in the UK

American Mathematical Society eBooks, Jun 7, 2007

In epidemiology, a standard way of constructing models is to conduct univariate analysis of indep... more In epidemiology, a standard way of constructing models is to conduct univariate analysis of independent variables, followed by fitting a multivariate logistic model to the selected features. The main measure for choosing a particular model is a goodness of fit criterion on a given dataset. While this measure indicates how well the model fits the data, it has little relation to the predictive accuracy of the model and therefore may not generalize beyond the given dataset. This aspect is not frequently considered in epidemiology. We suggest using modern machine learning methods for constructing and validating epidemiological models. The resulting models can be used to confirm epidemiologist's models or to suggest possible improvements. These approaches also provide estimates and confidence measures for the parameters and the predictive ability of the model.

We propose a streaming algorithm, based on the minimal description length (MDL) principle, for ex... more We propose a streaming algorithm, based on the minimal description length (MDL) principle, for extracting non-redundant sequential patterns. For static databases, the MDL-based approach that selects patterns based on their capacity to compress data rather than their frequency, was shown to be remarkably effective for extracting meaningful patterns and solving the redundancy issue in frequent itemset and sequence mining. The existing MDL-based algorithms, however, either start from a seed set of frequent patterns, or require multiple passes through the data. As such, the existing approaches scale poorly and are unsuitable for large datasets. Therefore, our main contribution is the proposal of a new, streaming algorithm, called Zips, that does not require a seed set of patterns and requires only one scan over the data. For Zips, we extended the Lempel-Ziv (LZ) compression algorithm in three ways: first, whereas LZ assigns codes uniformly as it builds up its dictionary while scanning the input, Zips assigns codewords according to the usage of the dictionary words; more heaviliy used words get shorter code-lengths. Secondly, Zips exploits also non-consecutive occurences of dictionary words for compression. And, third, the well-known space-saving algorithm is used to evict unpromising words from the dictionary. Experiments on one synthetic and two real-world large-scale datasets show that our approach extracts meaningful compressing patterns with similar quality to the state-of-the-art multi-pass algorithms proposed for static databases of sequences. Moreover, our approach scales linearly with the size of data streams while all the existing algorithms do not.

LogVis: Graph-Assisted Visual Analysis of Event Logs from Industrial Equipment

Visual reasoning on a graph is often a challenging task mainly due to the vast number of nodes an... more Visual reasoning on a graph is often a challenging task mainly due to the vast number of nodes and edges displayed. It becomes particularly challenging on log graph data, where thousands of events may be logged within minutes. In this study, we focus on three common log analysis tasks, namely Event Overview, Root-Cause Analysis and Pattern Analysis, and propose visualization approaches to overcome challenges particularly associated with these tasks. The proposed approaches are demonstrated on sample use-cases on industrial equipment logs.

Clustering Inside Classes Improves Performance of Linear Classifiers