Skip to main content

Svetlozar Nestorov

Followers

4

Following

1

Co-author

1

Public Views

Nicola Jane Holt

University of the West of England

Richard Bellamy

University College London

Kansas State University

University of Alberta

Armando Marques-Guedes

UNL - New University of Lisbon

University of California, Berkeley

Andrew W Wilkins

Goldsmiths, University of London

Maro G Machizawa

Hiroshima University

University of Trento

Assoc. Prof (Mrs) Hauwa L . Abubakar

Nile University of Nigeria

Interests

Uploads

Papers by Svetlozar Nestorov

Expediting analytical databases with columnar approach

Decision Support Systems, 2017

Template-based wrappers in the TSIMMIS system

Proceedings of the 1997 Acm Sigmod International Conference, 1997

In order to access information from a variety of heterogeneous information sources, one has to be... more In order to access information from a variety of heterogeneous information sources, one has to be able to translate queries and data from one data model into another. This functionality is provided by so-called source wrappers 4,8 which convert queries into one or more commands queries understandable by the underlying source and transform the native results into a format understood by the application. As part of the tsimmis project 1,6 w e h a ve developed hard-coded wrappers for a variety of sources e.g., Sybase DBMS, WWW pages, etc. including legacy systems Folio. However, anyone who has built a wrapper before can attest that a lot of e ort goes into developing and writing such a wrapper. In situations where it is important or desirable to gain access to new sources quickly, this is a major drawback. Furthermore, we h a ve also observed that only a relatively small part of the code deals with the speci c access details of the source. The rest of the code is either common among wrappers or implements query and data transformation that could be expressed in a high level, declarative fashion.

Data Modeling in the Cloud

ABSTRACT

Efficient processing of relational queries with constraints over the sum of multiple attributes

We identify and study an important class of relational queries involving constraints over the sum... more We identify and study an important class of relational queries involving constraints over the sum of multiple attributes (sum constraint queries). Finding all or a given number of results of these queries requires expensive join operations. These joins, in the absence of any other join conditions, effectively become cartesian products. We develop rewriting techniques to rewrite a sum constraint query in order to enable its efficient processing by conventional relational database engines. Experimental results show that query rewriting achieves notable performance improvement for sum constraint queries without modifying database search engine. For queries asking for a given number of results, we propose a ranking algorithm to order tuples based on their probability to satisfy all sum constraints in a query. We compare it with traditional ranking algorithms that rank tuples based on value of one attribute, and show that our method is more stable and efficient to handle sum constraint queries. We also study a special but common type of sum constraint queries: self-join of a relation with symmetric sum constraints as join conditions. Considering the large number of possible execution plans, we prove that left-deep tree is always the best execution plan for this type of queries.

Infering Structure in Semistructured Data

Framework for Integrating Process and Data Logic: Connecting UML Use Cases and ER Diagrams

Proceedings of the ITI 2013 35th International Conference on INFORMATION TECHNOLOGY INTERFACES, 2013

ABSTRACT

Query Flocks: A Generalization of Association-Rule Mining

Association-rule mining has proved a highly successful technique for extracting useful informatio... more Association-rule mining has proved a highly successful technique for extracting useful information from very large databases. This success is attributed not only to the appropriateness of the objectives, but to the fact that a number of new query-optimization ideas, such as the "a-priori" trick, make association-rule mining run much faster than might be expected. In this paper we see that the same tricks can be extended to a much more general context, allowing efficient mining of very large databases for many different kinds of patterns. The general idea, called "query flocks," is a generate-and-test model for data-mining problems. We show how the idea can be used either in a general-purpose mining system or in a next generation of conventional query optimizers.

Implementing SEID as a Solution for Connecting NKCS

There has been an abundance of research within the last couple of decades in the area of multilev... more There has been an abundance of research within the last couple of decades in the area of multilevel secure (MLS) databases. Recent work in this field deals with the processing of multilevel transactions, expanding the logic of MLS query languages, and utilizing MLS principles within the realm of E-Business. However, there is a basic flaw within the MLS logic, which

Process and Data Logic Integration: Logical Links between UML Use Case Narratives and ER Diagrams

Journal of Computing and Information Technology, 2013

We propose a methodology for providing clear and consistent integration of the process and data l... more We propose a methodology for providing clear and consistent integration of the process and data logic in the analysis stage of information systems' development lifecycle. While our proposed approach is applicable across a variety of data and process modeling schemas, in this paper we discuss it in the context of UML use cases for process modeling and ER diagrams for data modeling. We illustrate our approach through an example of modeling an execution of a retail transaction. In our example we integrate a step-by-step process model and the corresponding data model at the attribute level detail. We discuss the potential benefits of this approach by illustrating how this methodology, by providing a critical link between process and data models, can result in better conceptual testing early in the analysis process, ensuring better semantic quality of both process and data models.

Augmenting Data Warehouses with Big Data

Information Systems Management, 2015

ABSTRACT

Cover Stories for Key Attributes—Expanded Database Access Control

Contemporary Issues in Database Design and Information Systems Development, 2007

Query flocks

ACM SIGMOD Record, 1998

Association-rule mining has proved a highly successful technique for extracting useful informatio... more Association-rule mining has proved a highly successful technique for extracting useful information from very large databases. This success is attributed not only to the appropriateness of the objectives, but to the fact that a number of new query-optimization ideas, such as the “a-priori” trick, make association-rule mining run much faster than might be expected. In this paper we see that

Representative objects: concise representations of semistructured, hierarchical data

Proceedings 13th International Conference on Data Engineering, 1997

In this paper we introduce the representative object, which uncovers the inherent schema(s) in se... more In this paper we introduce the representative object, which uncovers the inherent schema(s) in semistructured, hierarchical data sources and provides a concise description of the structure of the data. Semistructured data, unlike data stored in typical relational or object-oriented databases, does not have fixed schema that is known in advance and stored separately from the data. With the rapid growth of the World Wide Web, semistructured hierarchical data sources are becoming widely available to the casual user. The lack of external schema information currently makes browsing and querying these data sources inefficient at best, and impossible at worst. We show how representative objects make schema discovery efficient and facilitate the generation of meaningful queries over the data.

Ad-hoc association-rule mining within the data warehouse

36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the, 2003

Many organizations often underutilize their already constructed data warehouses. In this paper, w... more Many organizations often underutilize their already constructed data warehouses. In this paper, we suggest a novel way of acquiring more information from corporate data warehouses without the complications and drawbacks of deploying additional software systems. Association-rule mining, which captures co-occurrence patterns within data, has attracted considerable efforts from data warehousing researchers and practitioners alike. Unfortunately, most data mining tools are loosely coupled, at best, with the data warehouse repository. Furthermore, these tools can often find association rules only within the main fact table of the data warehouse (thus ignoring the information-rich dimensions of the star schema) and are not easily applied on non-transaction level data often found in data warehouses. In this paper, we present a new data-mining framework that is tightly integrated with the data warehousing technology. Our framework has several advantages over the use of separate data mining tools. First, the data stays at the data warehouse, and thus the management of security and privacy issues is greatly reduced. Second, we utilize the query processing power of a data warehouse itself, without using a separate data-mining tool. In addition, this framework allows ad-hoc data mining queries over the whole data warehouse, not just over a transformed portion of the data that is required when a standard datamining tool is used. Finally, this framework also expands the domain of association-rule mining from transactionlevel data to aggregated data as well.

Quantifying the impact and extent of undocumented biomedical synonymy

PLoS computational biology, 2014

Synonymous relationships among biomedical terms are extensively annotated within specialized term... more Synonymous relationships among biomedical terms are extensively annotated within specialized terminologies, implying that synonymy is important for practical computational applications within this field. It remains unclear, however, whether text mining actually benefits from documented synonymy and whether existing biomedical thesauri provide adequate coverage of these linguistic relationships. In this study, we examine the impact and extent of undocumented synonymy within a very large compendium of biomedical thesauri. First, we demonstrate that missing synonymy has a significant negative impact on named entity normalization, an important problem within the field of biomedical text mining. To estimate the amount synonymy currently missing from thesauri, we develop a probabilistic model for the construction of synonym terminologies that is capable of handling a wide range of potential biases, and we evaluate its performance using the broader domain of near-synonymy among general Eng...

Closing the key loophole in MLS databases

ACM SIGMOD Record, 2003

There has been an abundance of research within the last couple of decades in the area of multilev... more There has been an abundance of research within the last couple of decades in the area of multilevel secure (MLS) databases. Recent work in this field deals with the processing of multilevel transactions, expanding the logic of MLS query languages, and utilizing MLS principles within the realm of E-Business. However, there is a basic flaw within the MLS logic, which obstructs the handling of clearance-invariant aggregate queries and physical-entity related queries where some of the information in the database may be gleaned from the outside world. This flaw stands in the way of a more pervasive adoption of MLS models by the developers of practical applications. This paper clearly identifies the cause of this impediment -the cover story dependence on the value of a user-defined key -and proposes a practical solution.

Extracting schema from semistructured data

Proceedings of the 1998 ACM SIGMOD international conference on Management of data - SIGMOD '98, 1998

Semistructured data is characterized by the lack of any fixed and rigid schema, although typicall... more

Integrating Data Mining with Relational DBMS: A Tightly-Coupled Approach

Lecture Notes in Computer Science, 1999

. Data mining is rapidly finding its way into mainstream computing. The development of generic me... more . Data mining is rapidly finding its way into mainstream computing. The development of generic methods such as itemset counting has opened the area to academic inquiry and has resulted in a large harvest of research results. While the mined datasets are often in relational format, most mining systems do not use relational DBMS. Thus, they miss the opportunity to

Reshaping Text Data for Efficient Processing on Amazon EC2

Scientific Programming, 2011

Text analysis tools are nowadays required to process increasingly large corpora which are often o... more Text analysis tools are nowadays required to process increasingly large corpora which are often organized as small files (abstracts, news articles, etc). Cloud computing offers a convenient, on-demand, pay-as-you-go computing environment for solving such problems. We investigate provisioning on the Amazon EC2 cloud from the user perspective, attempting to provide a scheduling strategy that is both timely and cost effective. We rely on the empirical performance of the application of interest on smaller subsets of data, to construct an execution plan. A first goal of our performance measurements is to determine an optimal file size for our application to consume. Using the subset-sum first fit heuristic we reshape the input data by merging files in order to match as closely as possible the desired file size. This also speeds up the task of retrieving the results of our application, by having the output be less segmented. Using predictions of the performance of our application based on measurements on small data sets, we devise an execution plan that meets a user specified deadline while minimizing cost.

Uncovering Actionable Knowledge in Corporate Data with Qualified Association Rules

Principles and Applications of Business Intelligence Research, 2013

Expediting analytical databases with columnar approach

Decision Support Systems, 2017

Template-based wrappers in the TSIMMIS system

Proceedings of the 1997 Acm Sigmod International Conference, 1997

In order to access information from a variety of heterogeneous information sources, one has to be... more In order to access information from a variety of heterogeneous information sources, one has to be able to translate queries and data from one data model into another. This functionality is provided by so-called source wrappers 4,8 which convert queries into one or more commands queries understandable by the underlying source and transform the native results into a format understood by the application. As part of the tsimmis project 1,6 w e h a ve developed hard-coded wrappers for a variety of sources e.g., Sybase DBMS, WWW pages, etc. including legacy systems Folio. However, anyone who has built a wrapper before can attest that a lot of e ort goes into developing and writing such a wrapper. In situations where it is important or desirable to gain access to new sources quickly, this is a major drawback. Furthermore, we h a ve also observed that only a relatively small part of the code deals with the speci c access details of the source. The rest of the code is either common among wrappers or implements query and data transformation that could be expressed in a high level, declarative fashion.

Data Modeling in the Cloud

ABSTRACT

Efficient processing of relational queries with constraints over the sum of multiple attributes

We identify and study an important class of relational queries involving constraints over the sum... more We identify and study an important class of relational queries involving constraints over the sum of multiple attributes (sum constraint queries). Finding all or a given number of results of these queries requires expensive join operations. These joins, in the absence of any other join conditions, effectively become cartesian products. We develop rewriting techniques to rewrite a sum constraint query in order to enable its efficient processing by conventional relational database engines. Experimental results show that query rewriting achieves notable performance improvement for sum constraint queries without modifying database search engine. For queries asking for a given number of results, we propose a ranking algorithm to order tuples based on their probability to satisfy all sum constraints in a query. We compare it with traditional ranking algorithms that rank tuples based on value of one attribute, and show that our method is more stable and efficient to handle sum constraint queries. We also study a special but common type of sum constraint queries: self-join of a relation with symmetric sum constraints as join conditions. Considering the large number of possible execution plans, we prove that left-deep tree is always the best execution plan for this type of queries.

Infering Structure in Semistructured Data

Framework for Integrating Process and Data Logic: Connecting UML Use Cases and ER Diagrams

Proceedings of the ITI 2013 35th International Conference on INFORMATION TECHNOLOGY INTERFACES, 2013

ABSTRACT

Query Flocks: A Generalization of Association-Rule Mining

Association-rule mining has proved a highly successful technique for extracting useful informatio... more Association-rule mining has proved a highly successful technique for extracting useful information from very large databases. This success is attributed not only to the appropriateness of the objectives, but to the fact that a number of new query-optimization ideas, such as the "a-priori" trick, make association-rule mining run much faster than might be expected. In this paper we see that the same tricks can be extended to a much more general context, allowing efficient mining of very large databases for many different kinds of patterns. The general idea, called "query flocks," is a generate-and-test model for data-mining problems. We show how the idea can be used either in a general-purpose mining system or in a next generation of conventional query optimizers.

Implementing SEID as a Solution for Connecting NKCS

There has been an abundance of research within the last couple of decades in the area of multilev... more There has been an abundance of research within the last couple of decades in the area of multilevel secure (MLS) databases. Recent work in this field deals with the processing of multilevel transactions, expanding the logic of MLS query languages, and utilizing MLS principles within the realm of E-Business. However, there is a basic flaw within the MLS logic, which

Process and Data Logic Integration: Logical Links between UML Use Case Narratives and ER Diagrams

Journal of Computing and Information Technology, 2013

We propose a methodology for providing clear and consistent integration of the process and data l... more We propose a methodology for providing clear and consistent integration of the process and data logic in the analysis stage of information systems' development lifecycle. While our proposed approach is applicable across a variety of data and process modeling schemas, in this paper we discuss it in the context of UML use cases for process modeling and ER diagrams for data modeling. We illustrate our approach through an example of modeling an execution of a retail transaction. In our example we integrate a step-by-step process model and the corresponding data model at the attribute level detail. We discuss the potential benefits of this approach by illustrating how this methodology, by providing a critical link between process and data models, can result in better conceptual testing early in the analysis process, ensuring better semantic quality of both process and data models.

Augmenting Data Warehouses with Big Data

Information Systems Management, 2015

ABSTRACT

Cover Stories for Key Attributes—Expanded Database Access Control

Contemporary Issues in Database Design and Information Systems Development, 2007

Query flocks

ACM SIGMOD Record, 1998

Association-rule mining has proved a highly successful technique for extracting useful informatio... more Association-rule mining has proved a highly successful technique for extracting useful information from very large databases. This success is attributed not only to the appropriateness of the objectives, but to the fact that a number of new query-optimization ideas, such as the “a-priori” trick, make association-rule mining run much faster than might be expected. In this paper we see that

Representative objects: concise representations of semistructured, hierarchical data

Proceedings 13th International Conference on Data Engineering, 1997

In this paper we introduce the representative object, which uncovers the inherent schema(s) in se... more In this paper we introduce the representative object, which uncovers the inherent schema(s) in semistructured, hierarchical data sources and provides a concise description of the structure of the data. Semistructured data, unlike data stored in typical relational or object-oriented databases, does not have fixed schema that is known in advance and stored separately from the data. With the rapid growth of the World Wide Web, semistructured hierarchical data sources are becoming widely available to the casual user. The lack of external schema information currently makes browsing and querying these data sources inefficient at best, and impossible at worst. We show how representative objects make schema discovery efficient and facilitate the generation of meaningful queries over the data.

Ad-hoc association-rule mining within the data warehouse

36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the, 2003

Many organizations often underutilize their already constructed data warehouses. In this paper, w... more Many organizations often underutilize their already constructed data warehouses. In this paper, we suggest a novel way of acquiring more information from corporate data warehouses without the complications and drawbacks of deploying additional software systems. Association-rule mining, which captures co-occurrence patterns within data, has attracted considerable efforts from data warehousing researchers and practitioners alike. Unfortunately, most data mining tools are loosely coupled, at best, with the data warehouse repository. Furthermore, these tools can often find association rules only within the main fact table of the data warehouse (thus ignoring the information-rich dimensions of the star schema) and are not easily applied on non-transaction level data often found in data warehouses. In this paper, we present a new data-mining framework that is tightly integrated with the data warehousing technology. Our framework has several advantages over the use of separate data mining tools. First, the data stays at the data warehouse, and thus the management of security and privacy issues is greatly reduced. Second, we utilize the query processing power of a data warehouse itself, without using a separate data-mining tool. In addition, this framework allows ad-hoc data mining queries over the whole data warehouse, not just over a transformed portion of the data that is required when a standard datamining tool is used. Finally, this framework also expands the domain of association-rule mining from transactionlevel data to aggregated data as well.

Quantifying the impact and extent of undocumented biomedical synonymy

PLoS computational biology, 2014

Synonymous relationships among biomedical terms are extensively annotated within specialized term... more Synonymous relationships among biomedical terms are extensively annotated within specialized terminologies, implying that synonymy is important for practical computational applications within this field. It remains unclear, however, whether text mining actually benefits from documented synonymy and whether existing biomedical thesauri provide adequate coverage of these linguistic relationships. In this study, we examine the impact and extent of undocumented synonymy within a very large compendium of biomedical thesauri. First, we demonstrate that missing synonymy has a significant negative impact on named entity normalization, an important problem within the field of biomedical text mining. To estimate the amount synonymy currently missing from thesauri, we develop a probabilistic model for the construction of synonym terminologies that is capable of handling a wide range of potential biases, and we evaluate its performance using the broader domain of near-synonymy among general Eng...

Closing the key loophole in MLS databases

ACM SIGMOD Record, 2003

There has been an abundance of research within the last couple of decades in the area of multilev... more There has been an abundance of research within the last couple of decades in the area of multilevel secure (MLS) databases. Recent work in this field deals with the processing of multilevel transactions, expanding the logic of MLS query languages, and utilizing MLS principles within the realm of E-Business. However, there is a basic flaw within the MLS logic, which obstructs the handling of clearance-invariant aggregate queries and physical-entity related queries where some of the information in the database may be gleaned from the outside world. This flaw stands in the way of a more pervasive adoption of MLS models by the developers of practical applications. This paper clearly identifies the cause of this impediment -the cover story dependence on the value of a user-defined key -and proposes a practical solution.

Extracting schema from semistructured data

Proceedings of the 1998 ACM SIGMOD international conference on Management of data - SIGMOD '98, 1998

Semistructured data is characterized by the lack of any fixed and rigid schema, although typicall... more

Integrating Data Mining with Relational DBMS: A Tightly-Coupled Approach

Lecture Notes in Computer Science, 1999

. Data mining is rapidly finding its way into mainstream computing. The development of generic me... more . Data mining is rapidly finding its way into mainstream computing. The development of generic methods such as itemset counting has opened the area to academic inquiry and has resulted in a large harvest of research results. While the mined datasets are often in relational format, most mining systems do not use relational DBMS. Thus, they miss the opportunity to

Reshaping Text Data for Efficient Processing on Amazon EC2

Scientific Programming, 2011

Text analysis tools are nowadays required to process increasingly large corpora which are often o... more Text analysis tools are nowadays required to process increasingly large corpora which are often organized as small files (abstracts, news articles, etc). Cloud computing offers a convenient, on-demand, pay-as-you-go computing environment for solving such problems. We investigate provisioning on the Amazon EC2 cloud from the user perspective, attempting to provide a scheduling strategy that is both timely and cost effective. We rely on the empirical performance of the application of interest on smaller subsets of data, to construct an execution plan. A first goal of our performance measurements is to determine an optimal file size for our application to consume. Using the subset-sum first fit heuristic we reshape the input data by merging files in order to match as closely as possible the desired file size. This also speeds up the task of retrieving the results of our application, by having the output be less segmented. Using predictions of the performance of our application based on measurements on small data sets, we devise an execution plan that meets a user specified deadline while minimizing cost.

Uncovering Actionable Knowledge in Corporate Data with Qualified Association Rules

Principles and Applications of Business Intelligence Research, 2013