Papers by Svetlozar Nestorov
Expediting analytical databases with columnar approach
Decision Support Systems, 2017

Proceedings of the 1997 Acm Sigmod International Conference, 1997
In order to access information from a variety of heterogeneous information sources, one has to be... more In order to access information from a variety of heterogeneous information sources, one has to be able to translate queries and data from one data model into another. This functionality is provided by so-called source wrappers 4,8 which convert queries into one or more commands queries understandable by the underlying source and transform the native results into a format understood by the application. As part of the tsimmis project 1,6 w e h a ve developed hard-coded wrappers for a variety of sources e.g., Sybase DBMS, WWW pages, etc. including legacy systems Folio. However, anyone who has built a wrapper before can attest that a lot of e ort goes into developing and writing such a wrapper. In situations where it is important or desirable to gain access to new sources quickly, this is a major drawback. Furthermore, we h a ve also observed that only a relatively small part of the code deals with the speci c access details of the source. The rest of the code is either common among wrappers or implements query and data transformation that could be expressed in a high level, declarative fashion.
Data Modeling in the Cloud
ABSTRACT

We identify and study an important class of relational queries involving constraints over the sum... more We identify and study an important class of relational queries involving constraints over the sum of multiple attributes (sum constraint queries). Finding all or a given number of results of these queries requires expensive join operations. These joins, in the absence of any other join conditions, effectively become cartesian products. We develop rewriting techniques to rewrite a sum constraint query in order to enable its efficient processing by conventional relational database engines. Experimental results show that query rewriting achieves notable performance improvement for sum constraint queries without modifying database search engine. For queries asking for a given number of results, we propose a ranking algorithm to order tuples based on their probability to satisfy all sum constraints in a query. We compare it with traditional ranking algorithms that rank tuples based on value of one attribute, and show that our method is more stable and efficient to handle sum constraint queries. We also study a special but common type of sum constraint queries: self-join of a relation with symmetric sum constraints as join conditions. Considering the large number of possible execution plans, we prove that left-deep tree is always the best execution plan for this type of queries.
Infering Structure in Semistructured Data
Framework for Integrating Process and Data Logic: Connecting UML Use Cases and ER Diagrams
Proceedings of the ITI 2013 35th International Conference on INFORMATION TECHNOLOGY INTERFACES, 2013
ABSTRACT
Association-rule mining has proved a highly successful technique for extracting useful informatio... more Association-rule mining has proved a highly successful technique for extracting useful information from very large databases. This success is attributed not only to the appropriateness of the objectives, but to the fact that a number of new query-optimization ideas, such as the "a-priori" trick, make association-rule mining run much faster than might be expected. In this paper we see that the same tricks can be extended to a much more general context, allowing efficient mining of very large databases for many different kinds of patterns. The general idea, called "query flocks," is a generate-and-test model for data-mining problems. We show how the idea can be used either in a general-purpose mining system or in a next generation of conventional query optimizers.
There has been an abundance of research within the last couple of decades in the area of multilev... more There has been an abundance of research within the last couple of decades in the area of multilevel secure (MLS) databases. Recent work in this field deals with the processing of multilevel transactions, expanding the logic of MLS query languages, and utilizing MLS principles within the realm of E-Business. However, there is a basic flaw within the MLS logic, which

Journal of Computing and Information Technology, 2013
We propose a methodology for providing clear and consistent integration of the process and data l... more We propose a methodology for providing clear and consistent integration of the process and data logic in the analysis stage of information systems' development lifecycle. While our proposed approach is applicable across a variety of data and process modeling schemas, in this paper we discuss it in the context of UML use cases for process modeling and ER diagrams for data modeling. We illustrate our approach through an example of modeling an execution of a retail transaction. In our example we integrate a step-by-step process model and the corresponding data model at the attribute level detail. We discuss the potential benefits of this approach by illustrating how this methodology, by providing a critical link between process and data models, can result in better conceptual testing early in the analysis process, ensuring better semantic quality of both process and data models.
Augmenting Data Warehouses with Big Data
Information Systems Management, 2015
ABSTRACT
Cover Stories for Key Attributes—Expanded Database Access Control
Contemporary Issues in Database Design and Information Systems Development, 2007
Query flocks
ACM SIGMOD Record, 1998
Association-rule mining has proved a highly successful technique for extracting useful informatio... more Association-rule mining has proved a highly successful technique for extracting useful information from very large databases. This success is attributed not only to the appropriateness of the objectives, but to the fact that a number of new query-optimization ideas, such as the “a-priori” trick, make association-rule mining run much faster than might be expected. In this paper we see that
Proceedings 13th International Conference on Data Engineering, 1997
In this paper we introduce the representative object, which uncovers the inherent schema(s) in se... more In this paper we introduce the representative object, which uncovers the inherent schema(s) in semistructured, hierarchical data sources and provides a concise description of the structure of the data. Semistructured data, unlike data stored in typical relational or object-oriented databases, does not have fixed schema that is known in advance and stored separately from the data. With the rapid growth of the World Wide Web, semistructured hierarchical data sources are becoming widely available to the casual user. The lack of external schema information currently makes browsing and querying these data sources inefficient at best, and impossible at worst. We show how representative objects make schema discovery efficient and facilitate the generation of meaningful queries over the data.

36th Annual Hawaii International Conference on System Sciences, 2003. Proceedings of the, 2003
Many organizations often underutilize their already constructed data warehouses. In this paper, w... more Many organizations often underutilize their already constructed data warehouses. In this paper, we suggest a novel way of acquiring more information from corporate data warehouses without the complications and drawbacks of deploying additional software systems. Association-rule mining, which captures co-occurrence patterns within data, has attracted considerable efforts from data warehousing researchers and practitioners alike. Unfortunately, most data mining tools are loosely coupled, at best, with the data warehouse repository. Furthermore, these tools can often find association rules only within the main fact table of the data warehouse (thus ignoring the information-rich dimensions of the star schema) and are not easily applied on non-transaction level data often found in data warehouses. In this paper, we present a new data-mining framework that is tightly integrated with the data warehousing technology. Our framework has several advantages over the use of separate data mining tools. First, the data stays at the data warehouse, and thus the management of security and privacy issues is greatly reduced. Second, we utilize the query processing power of a data warehouse itself, without using a separate data-mining tool. In addition, this framework allows ad-hoc data mining queries over the whole data warehouse, not just over a transformed portion of the data that is required when a standard datamining tool is used. Finally, this framework also expands the domain of association-rule mining from transactionlevel data to aggregated data as well.

Quantifying the impact and extent of undocumented biomedical synonymy
PLoS computational biology, 2014
Synonymous relationships among biomedical terms are extensively annotated within specialized term... more Synonymous relationships among biomedical terms are extensively annotated within specialized terminologies, implying that synonymy is important for practical computational applications within this field. It remains unclear, however, whether text mining actually benefits from documented synonymy and whether existing biomedical thesauri provide adequate coverage of these linguistic relationships. In this study, we examine the impact and extent of undocumented synonymy within a very large compendium of biomedical thesauri. First, we demonstrate that missing synonymy has a significant negative impact on named entity normalization, an important problem within the field of biomedical text mining. To estimate the amount synonymy currently missing from thesauri, we develop a probabilistic model for the construction of synonym terminologies that is capable of handling a wide range of potential biases, and we evaluate its performance using the broader domain of near-synonymy among general Eng...
ACM SIGMOD Record, 2003
There has been an abundance of research within the last couple of decades in the area of multilev... more There has been an abundance of research within the last couple of decades in the area of multilevel secure (MLS) databases. Recent work in this field deals with the processing of multilevel transactions, expanding the logic of MLS query languages, and utilizing MLS principles within the realm of E-Business. However, there is a basic flaw within the MLS logic, which obstructs the handling of clearance-invariant aggregate queries and physical-entity related queries where some of the information in the database may be gleaned from the outside world. This flaw stands in the way of a more pervasive adoption of MLS models by the developers of practical applications. This paper clearly identifies the cause of this impediment -the cover story dependence on the value of a user-defined key -and proposes a practical solution.
Proceedings of the 1998 ACM SIGMOD international conference on Management of data - SIGMOD '98, 1998
Semistructured data is characterized by the lack of any fixed and rigid schema, although typicall... more Semistructured data is characterized by the lack of any fixed and rigid schema, although typically the data has some implicit structure.
Integrating Data Mining with Relational DBMS: A Tightly-Coupled Approach
Lecture Notes in Computer Science, 1999
. Data mining is rapidly finding its way into mainstream computing. The development of generic me... more . Data mining is rapidly finding its way into mainstream computing. The development of generic methods such as itemset counting has opened the area to academic inquiry and has resulted in a large harvest of research results. While the mined datasets are often in relational format, most mining systems do not use relational DBMS. Thus, they miss the opportunity to

Scientific Programming, 2011
Text analysis tools are nowadays required to process increasingly large corpora which are often o... more Text analysis tools are nowadays required to process increasingly large corpora which are often organized as small files (abstracts, news articles, etc). Cloud computing offers a convenient, on-demand, pay-as-you-go computing environment for solving such problems. We investigate provisioning on the Amazon EC2 cloud from the user perspective, attempting to provide a scheduling strategy that is both timely and cost effective. We rely on the empirical performance of the application of interest on smaller subsets of data, to construct an execution plan. A first goal of our performance measurements is to determine an optimal file size for our application to consume. Using the subset-sum first fit heuristic we reshape the input data by merging files in order to match as closely as possible the desired file size. This also speeds up the task of retrieving the results of our application, by having the output be less segmented. Using predictions of the performance of our application based on measurements on small data sets, we devise an execution plan that meets a user specified deadline while minimizing cost.
Uncovering Actionable Knowledge in Corporate Data with Qualified Association Rules
Principles and Applications of Business Intelligence Research, 2013
Uploads
Papers by Svetlozar Nestorov