Skip to main content

Péter Lehotay-Kéry

Eötvös Loránd University, Department of Information Systems, Graduate Student

Followers

3

Following

2

Co-author

1

Public Views

Andrea Farruggia

University of Pisa

National University of Singapore

Andrea Baronchelli

City, University of London

International Journals for Researchers [ER Publication, WOAR Journals, IJEAS and IJEART]

Ghent University

Massachusetts Institute of Technology (MIT)

Uploads

Papers by Péter Lehotay-Kéry

GenPress: A Novel Dictionary Based Method to Compress DNA Data of Various Species

Intelligent Information and Database Systems, 2019

There can be a data boom in the near future, due to cheaper methods make possible for everyone to... more There can be a data boom in the near future, due to cheaper methods make possible for everyone to keep their own DNA on their own device or on a central medical cloud. With the development of sequencing methods, we are able to get the sequences of more and more species. However the size of the human genome is about 3 GB for each person. And for other species it can be more. The need is growing for the efficient compression of these data and general compressors can not reach a satisfying result. These are not aware of the special structure of these data. There are already some algorithms tried to reach smaller and smaller rates. In this paper, we would like to present our new method to accomplish this task.

Membrane Clustering of Coronavirus Variants Using Document Similarity

Genes

Currently, as an effect of the COVID-19 pandemic, bioinformatics, genomics, and biological comput... more Currently, as an effect of the COVID-19 pandemic, bioinformatics, genomics, and biological computations are gaining increased attention. Genomes of viruses can be represented by character strings based on their nucleobases. Document similarity metrics can be applied to these strings to measure their similarities. Clustering algorithms can be applied to the results of their document similarities to cluster them. P systems or membrane systems are computation models inspired by the flow of information in the membrane cells. These can be used for various purposes, one of them being data clustering. This paper studies a novel and versatile clustering method for genomes and the utilization of such membrane clustering models using document similarity metrics, which is not yet a well-studied use of membrane clustering models.

Experimental Study of Some Properties of Knowledge Distillation

For more complex classification problems it is inevitable that we use increasingly complex and cu... more For more complex classification problems it is inevitable that we use increasingly complex and cumbersome classifying models. However, often we do not have the space or processing power to deploy these models.Knowledge distillation is an effective way to improve the accuracy of an otherwise smaller, simpler model using a more complex teacher network or ensemble of networks. This way we can have a classifier with an accuracy that is comparable to the accuracy of the teacher while small enough to deploy.In this paper we evaluate certain features of this distilling method, while trying to improve its results. These experiments and examinations and the discovered properties may also help to further develop this operation.

A Novel Dictionary-Based Method to Compress Log Files with Different Message Frequency Distributions

Applied Sciences, 2022

In the present day, virtually every application software generates large amounts of log entries d... more In the present day, virtually every application software generates large amounts of log entries during its work. The log files that are made from these entries are a collection of information about what happened while the program was running. This report can be used for multiple purposes such as performance monitoring, maintaining security, or improving business decision making. Log entries are usually generated in a disorganized manner. Using template miners, the different ‘event types’ can be distinguished (each log entry is an event), and the set of all entries is split into disjointed subsets according to the event types. These events consist of two parts. The first is the constant part, which is the same for all occurrences of the same event type. The second is the parameter part, which can be different for each occurrence. Since software mass-produces log files, in our previous paper, we introduced an algorithm that uses the templates mined from the data to create a dictionary...

Performance impact of network encryption on log processing with Spark (13th Joint Conference on Mathematics and Computer Science (the 13th MaCS), on October 1-3, 2020)

by Péter Lehotay-Kéry and Attila Boros

A Comparative Evaluation of Big Data Frameworks for Log Processing

International Conference on Applied Informatics, 2020

Nowadays a huge part of collected data comes from the behaviour of logging systems. Examples are ... more Nowadays a huge part of collected data comes from the behaviour of logging systems. Examples are complex monitored systems of different institutions where computations require powerful distributed environments to run. Our work aims the specific area of log data obtained from telecommunication operator systems with the goal to identify non-trivially detectable problems, like frequency of node restarts on a given time period or the reason of these events. In order to substitute significant new information from these system logs, it is important to use proper frameworks for analyzing them. This being a comprehensive problem, various frameworks have been proposed. In this paper we evaluate and compare Apache Spark and Elasticsearch (with Logstash) as two prominent frameworks for processing log data. Through our work we perform experiments on different problem solutions with different complexity in order to measure how non-functional features, like processing time and resource consumption vary between them. Additionally, our experimental data shows that how choosing between different frameworks can influence the performance of these computations.

A Novel Methodology for Measuring the Abstraction Capabilities of Image Recognition Algorithms

Journal of imaging, 2021

Creating a widely excepted model on the measure of intelligence became inevitable due to the exis... more Creating a widely excepted model on the measure of intelligence became inevitable due to the existence of an abundance of different intelligent systems. Measuring intelligence would provide feedback for the developers and ultimately lead us to create better artificial systems. In the present paper, we show a solution where learning as a process is examined, aiming to detect pre-written solutions and separate them from the knowledge acquired by the system. In our approach, we examine image recognition software by executing different transformations on objects and detect if the software was resilient to it. A system with the required intelligence is supposed to become resilient to the transformation after experiencing it several times. The method is successfully tested on a simple neural network, which is not able to learn most of the transformations examined. The method can be applied to any image recognition software to test its abstraction capabilities.

GenPress: A Novel Dictionary Based Method to Compress DNA Data of Various Species

There can be a data boom in the near future, due to cheaper methods make possible for everyone to... more There can be a data boom in the near future, due to cheaper methods make possible for everyone to keep their own DNA on their own device or on a central medical cloud. With the development of sequencing methods, we are able to get the sequences of more and more species. However the size of the human genome is about 3 GB for each person. And for other species it can be more.

A Comparative Evaluation of Big Data Frameworks for Log Processing

by Péter Lehotay-Kéry and Attila Boros

Nowadays a huge part of collected data comes from the behaviour of logging systems. Examples are ... more Nowadays a huge part of collected data comes from the behaviour of logging systems. Examples are complex monitored systems of different institutions where computations require powerful distributed environments to run. Our work aims the specific area of log data obtained from telecommunication operator systems with the goal to identify non-trivially detectable problems, like frequency of node restarts on a given time period or the reason of these events. In order to substitute significant new information from these system logs, it is important to use proper frameworks for analyzing them. This being a comprehensive problem, various frameworks have been proposed. In this paper we evaluate and compare Apache Spark and Elasticsearch (with Logstash) as two prominent frameworks for processing log data. Through our work we perform experiments on different problem solutions with different complexity in order to measure how non-functional features, like processing time and resource consumptio...

Human Genome Data Protection Using PostgreSQL DBMS

There can be a data boom in the near future, due to cheaper methods make possible for everyone to... more There can be a data boom in the near future, due to cheaper methods make possible for everyone to keep their own DNA on their own device or on a central medical cloud. These are sensitive data. There are a lot of cases, when genomes are contained in text files. The size of these can even be 3 GB on every user. Secured data management is not solved in these files.

The Use of Template Miners and Encryption in Log Message Compression

Computers, 2021

Presently, almost every computer software produces many log messages based on events and activiti... more Presently, almost every computer software produces many log messages based on events and activities during the usage of the software. These files contain valuable runtime information that can be used in a variety of applications such as anomaly detection, error prediction, template mining, and so on. Usually, the generated log messages are raw, which means they have an unstructured format. This indicates that these messages have to be parsed before data mining models can be applied. After parsing, template miners can be applied on the data to retrieve the events occurring in the log file. These events are made from two parts, the template, which is the fixed part and is the same for all instances of the same event type, and the parameter part, which varies for all the instances. To decrease the size of the log messages, we use the mined templates to build a dictionary for the events, and only store the dictionary, the event ID, and the parameter list. We use six template miners to a...

Genome Classification Using Overlap Graph Centralities

Genetics is a fast developing field and lot of its development relies on bioinformatics and solvi... more Genetics is a fast developing field and lot of its development relies on bioinformatics and solving computing problems. The genetic data are huge, for example the human reference genome is about 3 GB and for other species they can be even greater. It is not a trivial task to process them efficiently, recovering useful data for biological and medical sciences. Researchers have already developed different models and representations of genomes to provide deeper knowledge and explore hidden context in these data. Recent years a lot of publications have been made about how to represent genomes in graphs and examining the graph features of genomes like graph centrality. The aim of this paper is comparing and examining the graph centrality of viral genomes that could help in the study of these data. We use a number of concepts of genetics and bioinformatics, mostly in meaningful context. Their exact individual definition would place too much burden on the article; the interested readers ma...

Membrane Clustering Using the PostgreSQL Database Management System

Computational models have been in the center of attention for research over the years, constantly... more Computational models have been in the center of attention for research over the years, constantly evolving to aid countless distinct fields of computer science. The application of P systems (or membrane systems) is a relatively new way to define different calculations. These systems were inspired by the biological processes and communication that take place in living organisms. In recent years—among with other similar models—it has been brought into the focus of many researchers, resulting in numerous ways of utilizing the possibilities provided by this framework. A well-established such use case is solving clustering problems. With the help of evolutionary optimization techniques, the achieved outcome can be desirable over competing solutions. Our goal is to bring these procedures closer to their focal point, the data itself, by implementing a similar algorithm in the PostgreSQL database management system, whilst exploiting its capabilities, and its allowance of accessing the data ...

Process, analyze and visualize telecommunication network configuration data in graph database

Vietnam Journal of Computer Science

In network telemetry systems, nodes produce vast number of configuration files based on how they ... more In network telemetry systems, nodes produce vast number of configuration files based on how they are configured. Steps were taken to process these files into databases to help the work of the developers, testers and customer support to focus on the development and testing and to be able to give advice to the customers about how to configure the nodes. However, the processing of these data in relational database manager system is slow, hard to query and the storage takes huge disk space. In this paper, we are presenting a way to store the data produced by these nodes in graph database, changing from relational database to NoSQL environment. With our approach, one can easily represent and visualize the network of machines. In the end, we are going to compare the inserting, querying time and storage size in different database manager systems. The results could also be used for other types of configuration data too from other kinds of machines to show the connection between them and que...

Building, Visualizing and Executing Deep Learning Models as Dataflow Graphs

Acta Electrotechnica et Informatica

Document similarity for error prediction

Journal of Information and Telecommunication

In today's rushing world, there's an ever-increasing usage of networking equipment. These devices... more In today's rushing world, there's an ever-increasing usage of networking equipment. These devices log their operations; however, there could be errors that result in the restart of the given device. There could be different patterns before different errors. Our main goal is to predict the upcoming error based on the log lines of the actual file. To achieve this, we use document similarity. One of the key concepts of information retrieval is document similarity which is an indicator of how analogous (or different) documents are. In this paper, we are studying the effectiveness of prediction based on cosine similarity, Jaccard similarity, and Euclidean distance of rows before restarts. We use different features like TFIDF, Doc2Vec, LSH, and others in conjunction with these distance measures. Since networking devices produce lots of log files, we use Spark for Big data computing.

P System–Based Clustering Methods Using NoSQL Databases

Computation

Models of computation are fundamental notions in computer science; consequently, they have been t... more Models of computation are fundamental notions in computer science; consequently, they have been the subject of countless research papers, with numerous novel models proposed even in recent years. Amongst a multitude of different approaches, many of these methods draw inspiration from the biological processes observed in nature. P systems, or membrane systems, make an analogy between the communication in computing and the flow of information that can be perceived in living organisms. These systems serve as a basis for various concepts, ranging from the fields of computational economics and robotics to the techniques of data clustering. In this paper, such utilization of these systems—membrane system–based clustering—is taken into focus. Considering the growing number of data stored worldwide, more and more data have to be handled by clustering algorithms too. To solve this issue, bringing these methods closer to the data, their main element provides several benefits. Database systems...

GenPress: A Novel Dictionary Based Method to Compress DNA Data of Various Species

Intelligent Information and Database Systems, 2019

There can be a data boom in the near future, due to cheaper methods make possible for everyone to... more There can be a data boom in the near future, due to cheaper methods make possible for everyone to keep their own DNA on their own device or on a central medical cloud. With the development of sequencing methods, we are able to get the sequences of more and more species. However the size of the human genome is about 3 GB for each person. And for other species it can be more. The need is growing for the efficient compression of these data and general compressors can not reach a satisfying result. These are not aware of the special structure of these data. There are already some algorithms tried to reach smaller and smaller rates. In this paper, we would like to present our new method to accomplish this task.

Membrane Clustering of Coronavirus Variants Using Document Similarity

Genes

Currently, as an effect of the COVID-19 pandemic, bioinformatics, genomics, and biological comput... more Currently, as an effect of the COVID-19 pandemic, bioinformatics, genomics, and biological computations are gaining increased attention. Genomes of viruses can be represented by character strings based on their nucleobases. Document similarity metrics can be applied to these strings to measure their similarities. Clustering algorithms can be applied to the results of their document similarities to cluster them. P systems or membrane systems are computation models inspired by the flow of information in the membrane cells. These can be used for various purposes, one of them being data clustering. This paper studies a novel and versatile clustering method for genomes and the utilization of such membrane clustering models using document similarity metrics, which is not yet a well-studied use of membrane clustering models.

Experimental Study of Some Properties of Knowledge Distillation

For more complex classification problems it is inevitable that we use increasingly complex and cu... more For more complex classification problems it is inevitable that we use increasingly complex and cumbersome classifying models. However, often we do not have the space or processing power to deploy these models.Knowledge distillation is an effective way to improve the accuracy of an otherwise smaller, simpler model using a more complex teacher network or ensemble of networks. This way we can have a classifier with an accuracy that is comparable to the accuracy of the teacher while small enough to deploy.In this paper we evaluate certain features of this distilling method, while trying to improve its results. These experiments and examinations and the discovered properties may also help to further develop this operation.

A Novel Dictionary-Based Method to Compress Log Files with Different Message Frequency Distributions

Applied Sciences, 2022

In the present day, virtually every application software generates large amounts of log entries d... more In the present day, virtually every application software generates large amounts of log entries during its work. The log files that are made from these entries are a collection of information about what happened while the program was running. This report can be used for multiple purposes such as performance monitoring, maintaining security, or improving business decision making. Log entries are usually generated in a disorganized manner. Using template miners, the different ‘event types’ can be distinguished (each log entry is an event), and the set of all entries is split into disjointed subsets according to the event types. These events consist of two parts. The first is the constant part, which is the same for all occurrences of the same event type. The second is the parameter part, which can be different for each occurrence. Since software mass-produces log files, in our previous paper, we introduced an algorithm that uses the templates mined from the data to create a dictionary...

Performance impact of network encryption on log processing with Spark (13th Joint Conference on Mathematics and Computer Science (the 13th MaCS), on October 1-3, 2020)

by Péter Lehotay-Kéry and Attila Boros

A Comparative Evaluation of Big Data Frameworks for Log Processing

International Conference on Applied Informatics, 2020

Nowadays a huge part of collected data comes from the behaviour of logging systems. Examples are ... more Nowadays a huge part of collected data comes from the behaviour of logging systems. Examples are complex monitored systems of different institutions where computations require powerful distributed environments to run. Our work aims the specific area of log data obtained from telecommunication operator systems with the goal to identify non-trivially detectable problems, like frequency of node restarts on a given time period or the reason of these events. In order to substitute significant new information from these system logs, it is important to use proper frameworks for analyzing them. This being a comprehensive problem, various frameworks have been proposed. In this paper we evaluate and compare Apache Spark and Elasticsearch (with Logstash) as two prominent frameworks for processing log data. Through our work we perform experiments on different problem solutions with different complexity in order to measure how non-functional features, like processing time and resource consumption vary between them. Additionally, our experimental data shows that how choosing between different frameworks can influence the performance of these computations.

A Novel Methodology for Measuring the Abstraction Capabilities of Image Recognition Algorithms

Journal of imaging, 2021

Creating a widely excepted model on the measure of intelligence became inevitable due to the exis... more Creating a widely excepted model on the measure of intelligence became inevitable due to the existence of an abundance of different intelligent systems. Measuring intelligence would provide feedback for the developers and ultimately lead us to create better artificial systems. In the present paper, we show a solution where learning as a process is examined, aiming to detect pre-written solutions and separate them from the knowledge acquired by the system. In our approach, we examine image recognition software by executing different transformations on objects and detect if the software was resilient to it. A system with the required intelligence is supposed to become resilient to the transformation after experiencing it several times. The method is successfully tested on a simple neural network, which is not able to learn most of the transformations examined. The method can be applied to any image recognition software to test its abstraction capabilities.

GenPress: A Novel Dictionary Based Method to Compress DNA Data of Various Species

There can be a data boom in the near future, due to cheaper methods make possible for everyone to... more There can be a data boom in the near future, due to cheaper methods make possible for everyone to keep their own DNA on their own device or on a central medical cloud. With the development of sequencing methods, we are able to get the sequences of more and more species. However the size of the human genome is about 3 GB for each person. And for other species it can be more.

A Comparative Evaluation of Big Data Frameworks for Log Processing

by Péter Lehotay-Kéry and Attila Boros

Nowadays a huge part of collected data comes from the behaviour of logging systems. Examples are ... more Nowadays a huge part of collected data comes from the behaviour of logging systems. Examples are complex monitored systems of different institutions where computations require powerful distributed environments to run. Our work aims the specific area of log data obtained from telecommunication operator systems with the goal to identify non-trivially detectable problems, like frequency of node restarts on a given time period or the reason of these events. In order to substitute significant new information from these system logs, it is important to use proper frameworks for analyzing them. This being a comprehensive problem, various frameworks have been proposed. In this paper we evaluate and compare Apache Spark and Elasticsearch (with Logstash) as two prominent frameworks for processing log data. Through our work we perform experiments on different problem solutions with different complexity in order to measure how non-functional features, like processing time and resource consumptio...

Human Genome Data Protection Using PostgreSQL DBMS

There can be a data boom in the near future, due to cheaper methods make possible for everyone to... more There can be a data boom in the near future, due to cheaper methods make possible for everyone to keep their own DNA on their own device or on a central medical cloud. These are sensitive data. There are a lot of cases, when genomes are contained in text files. The size of these can even be 3 GB on every user. Secured data management is not solved in these files.

The Use of Template Miners and Encryption in Log Message Compression

Computers, 2021

Presently, almost every computer software produces many log messages based on events and activiti... more Presently, almost every computer software produces many log messages based on events and activities during the usage of the software. These files contain valuable runtime information that can be used in a variety of applications such as anomaly detection, error prediction, template mining, and so on. Usually, the generated log messages are raw, which means they have an unstructured format. This indicates that these messages have to be parsed before data mining models can be applied. After parsing, template miners can be applied on the data to retrieve the events occurring in the log file. These events are made from two parts, the template, which is the fixed part and is the same for all instances of the same event type, and the parameter part, which varies for all the instances. To decrease the size of the log messages, we use the mined templates to build a dictionary for the events, and only store the dictionary, the event ID, and the parameter list. We use six template miners to a...

Genome Classification Using Overlap Graph Centralities

Genetics is a fast developing field and lot of its development relies on bioinformatics and solvi... more Genetics is a fast developing field and lot of its development relies on bioinformatics and solving computing problems. The genetic data are huge, for example the human reference genome is about 3 GB and for other species they can be even greater. It is not a trivial task to process them efficiently, recovering useful data for biological and medical sciences. Researchers have already developed different models and representations of genomes to provide deeper knowledge and explore hidden context in these data. Recent years a lot of publications have been made about how to represent genomes in graphs and examining the graph features of genomes like graph centrality. The aim of this paper is comparing and examining the graph centrality of viral genomes that could help in the study of these data. We use a number of concepts of genetics and bioinformatics, mostly in meaningful context. Their exact individual definition would place too much burden on the article; the interested readers ma...

Membrane Clustering Using the PostgreSQL Database Management System

Computational models have been in the center of attention for research over the years, constantly... more Computational models have been in the center of attention for research over the years, constantly evolving to aid countless distinct fields of computer science. The application of P systems (or membrane systems) is a relatively new way to define different calculations. These systems were inspired by the biological processes and communication that take place in living organisms. In recent years—among with other similar models—it has been brought into the focus of many researchers, resulting in numerous ways of utilizing the possibilities provided by this framework. A well-established such use case is solving clustering problems. With the help of evolutionary optimization techniques, the achieved outcome can be desirable over competing solutions. Our goal is to bring these procedures closer to their focal point, the data itself, by implementing a similar algorithm in the PostgreSQL database management system, whilst exploiting its capabilities, and its allowance of accessing the data ...

Process, analyze and visualize telecommunication network configuration data in graph database

Vietnam Journal of Computer Science

In network telemetry systems, nodes produce vast number of configuration files based on how they ... more In network telemetry systems, nodes produce vast number of configuration files based on how they are configured. Steps were taken to process these files into databases to help the work of the developers, testers and customer support to focus on the development and testing and to be able to give advice to the customers about how to configure the nodes. However, the processing of these data in relational database manager system is slow, hard to query and the storage takes huge disk space. In this paper, we are presenting a way to store the data produced by these nodes in graph database, changing from relational database to NoSQL environment. With our approach, one can easily represent and visualize the network of machines. In the end, we are going to compare the inserting, querying time and storage size in different database manager systems. The results could also be used for other types of configuration data too from other kinds of machines to show the connection between them and que...

Building, Visualizing and Executing Deep Learning Models as Dataflow Graphs

Acta Electrotechnica et Informatica

Document similarity for error prediction

Journal of Information and Telecommunication

In today's rushing world, there's an ever-increasing usage of networking equipment. These devices... more In today's rushing world, there's an ever-increasing usage of networking equipment. These devices log their operations; however, there could be errors that result in the restart of the given device. There could be different patterns before different errors. Our main goal is to predict the upcoming error based on the log lines of the actual file. To achieve this, we use document similarity. One of the key concepts of information retrieval is document similarity which is an indicator of how analogous (or different) documents are. In this paper, we are studying the effectiveness of prediction based on cosine similarity, Jaccard similarity, and Euclidean distance of rows before restarts. We use different features like TFIDF, Doc2Vec, LSH, and others in conjunction with these distance measures. Since networking devices produce lots of log files, we use Spark for Big data computing.

P System–Based Clustering Methods Using NoSQL Databases

Computation

Models of computation are fundamental notions in computer science; consequently, they have been t... more Models of computation are fundamental notions in computer science; consequently, they have been the subject of countless research papers, with numerous novel models proposed even in recent years. Amongst a multitude of different approaches, many of these methods draw inspiration from the biological processes observed in nature. P systems, or membrane systems, make an analogy between the communication in computing and the flow of information that can be perceived in living organisms. These systems serve as a basis for various concepts, ranging from the fields of computational economics and robotics to the techniques of data clustering. In this paper, such utilization of these systems—membrane system–based clustering—is taken into focus. Considering the growing number of data stored worldwide, more and more data have to be handled by clustering algorithms too. To solve this issue, bringing these methods closer to the data, their main element provides several benefits. Database systems...