Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2015, Knowledge and Information Systems
Given two collections of set objects R and S, the R ⊆ S set containment join returns all object pairs (r, s) ∈ R × S such that r ⊆ s. Besides being a basic operator in all modern data management systems with a wide range of applications, the join can be used to evaluate complex SQL queries based on relational division and as a module of data mining algorithms. The state-of-theart algorithm for set containment joins (PRETTI) builds an inverted index on the right-hand collection S and a prefix tree on the left-hand collection R that groups set objects with common prefixes and thus, avoids redundant processing. In this paper, we present a framework which improves PRETTI in two directions. First, we limit the prefix tree construction by proposing an adaptive methodology based on a cost model; this way, we can greatly reduce the space and time cost of the join. Second, we partition the objects of each collection based on their first contained item, assuming that the set objects are internally sorted. We show that we can process the partitions and evaluate the join while building the prefix tree and the inverted index progressively. This allows us to significantly reduce not only the join cost, but also the maximum memory requirements during the join. An experimental evaluation using both real and synthetic datasets shows that our framework outperforms PRETTI by a wide margin. Keywords Set-valued data • containment join • query processing • inverted index • prefix tree To appear at the Knowledge and Information Systems Journal (KAIS).
The VLDB Journal, 2021
Joins are essential and potentially expensive operations in database management systems. When data is associated with time periods, joins commonly include predicates that require pairs of argument tuples to overlap in order to qualify for the result. Our goal is to enable built-in systems support for such joins. In particular, we present an approach where overlap joins are formulated as unions of range joins, which are more general purpose joins compared to overlap joins, i.e., are useful in their own right, and are supported well by B+-trees. The approach is sufficiently flexible that it also supports joins with additional equality predicates, as well as open, closed, and half-open time periods over discrete and continuous domains, thus offering both generality and simplicity, which is important in a system setting. We provide both a stand-alone solution that performs on par with the state-of-the-art and a DBMS embedded solution that is able to exploit standard indexing and clearly...
2015 IEEE 31st International Conference on Data Engineering, 2015
Computing containment relations between massive collections of sets is a fundamental operation in data management, for example in graph analytics and data mining applications. Motivated by recent hardware trends, in this paper we present two novel solutions for computing set-containment joins over massive sets: the Patricia Trie-based Signature Join (PTSJ) and PRETTI+, a Patricia trie enhanced extension of the state-of-theart PRETTI join. The compact trie structure not only enables efficient use of main-memory, but also significantly boosts the performance of both approaches. By carefully analyzing the algorithms and conducting extensive experiments with various synthetic and real-world datasets, we show that, in many practical cases, our algorithms are an order of magnitude faster than the state-of-the-art.
We investigate the eect of query rewriting on joins involving set-valued attributes in object-relational database management systems. We show that by unnesting set-valued attributes (that are stored in an internal nested representation) prior to the actual set containment or intersection join we can improve the performance of query evaluation by an order of magnitude. By giving example query evaluation plans we show the increased possibilities for the query optimizer.
Proceedings of the 15th ACM international conference on Information and knowledge management - CIKM '06, 2006
Set-valued attributes frequently occur in contexts like marketbasked analysis and stock market trends. Late research literature has mainly focused on set containment joins and data mining without considering simple queries on set valued attributes. In this paper we address superset, subset and equality queries and we propose a novel indexing scheme for answering them on set-valued attributes. The proposed index superimposes a trie-tree on top of an inverted file that indexes a relation with set-valued data. We show that we can efficiently answer the aforementioned queries by indexing only a subset of the most frequent of the items that occur in the indexed relation. Finally, we show through extensive experiments that our approach outperforms the state of the art mechanisms and scales gracefully as database size grows.
Information Systems, 2016
Identifying similarities in large datasets is an essential operation in several applications such as bioinformatics, pattern recognition, and data integration. To make a relational database management system similarity-aware, the core relational operators have to be extended. While similarity-awareness has been introduced in database engines for relational operators such as joins and group-by, little has been achieved for relational set operators, namely Intersection, Difference, and Union. In this paper, we propose to extend the semantics of relational set operators to take into account the similarity of values. We develop efficient query processing algorithms for evaluating them, and implement these operators inside an open-source database system, namely PostgreSQL. By extending several queries from the TPC-H benchmark to include predicates that involve similarity-based set operators, we perform extensive experiments that demonstrate up to three orders of magnitude speedup in performance over equivalent queries that only employ regular operators.
Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, 2010
Prior work has identified set based comparisons as a useful primitive for supporting a wide variety of similarity functions in record matching. Accordingly, various techniques have been proposed to improve the performance of set similarity lookups. However, this body of work focuses almost exclusively on symmetric notions of set similarity. In this paper, we study the indexing problem for the asymmetric Jaccard containment similarity function that is an error-tolerant variation of set containment. We enhance this similarity function to also account for string transformations that reflect synonyms such as "Bob" and "Robert" referring to the same first name. We propose an index structure that builds inverted lists on carefully chosen token-sets and a lookup algorithm using our index that is sensitive to the output size of the query. Our experiments over real life data sets show the benefits of our techniques. To our knowledge, this is the first paper that studies the indexing problem for Jaccard containment in the presence of string transformations.
Proceedings of the May 16-19, 1983, national computer conference on - AFIPS '83, 1983
Hardware organizations for processing set-theoretic database query functions are presented. These,organizations implement the functions by processing index trees. One advantage of this approach is that the index trees can be merged in a highly parallel fashion. Hardware organizations proposed here use the database machine approach, thus processing the index-trees on the fly. Experimental results giving the performances of these organizations are presented. Finally, a slight variation of the index tree representation, requiring much less storage for the index, is given. 283 From the collection of the Computer History Museum (www.computerhistory.org) From the collection of the Computer History Museum (www.computerhistory.org)
Journal of King Saud University - Computer and Information Sciences, 2010
Enhancing the performance of large database systems depends heavily on the cost of performing join operations. When two very large tables are joined, optimizing such operation is considered one of the interesting research topics to many researchers, especially when both tables, to be joined, are very large to fit in main memory. In such case, join is usually performed by any other method than hash Join algorithms. In this paper, a novel join algorithm that is based on the use of quadtrees, is introduced. Applying the proposed algorithm on two very large tables, that are too large to fit in main memory, is proven to be fast and efficient. In the proposed new algorithm, both tables are represented by a storage efficient quadtree that is designed to handle one-dimensional arrays (1-D arrays). The algorithm works on the two 1-D arrays of the two tables to perform join operations. For the new algorithm, time and space complexities are studied. Experimental studies show the efficiency and superiority of this algorithm. The proposed join algorithm requires minimum number of I/O operations and operates in main memory with O(n log (n/k)) time complexity, where k is number of key groups with same first letter, and (n/k) is much smaller than n.
1999
Enhancements in data capturing technology have lead to exponential growth in amounts of data being stored in information systems. This growth in turn has motivated researchers to seek new techniques for extraction of knowledge implicit or hidden in the data. In this paper, we motivate the need for an incremental data mining approach based on data structure called the itemset tree. The motivated approach is shown to be effective for solving problems related to efficiency of handling data updates, accuracy of data mining results, processing input transactions, and answering user queries. We present efficient algorithms to insert transactions into the item-set tree and to count frequencies of itemsets for queries about strength of association among items. We prove that the expected complexity of inserting a transaction is ≈ O(1), and that of frequency counting is O(n), where n is the cardinality of the domain of items.
There is a trend towards operational or Live BI (Business Intelligence) that requires immediate syn-chronization between the operational source systems and the data warehouse infrastructure in order to achieve high up-to-dateness for analytical query re-sults. The high performance requirements imposed by many ad-hoc queries are typically addressed with read-optimized column stores and in-memory data management. However, Live BI additionally requires transactional and update-friendly in-memory indexing due to high update rates of propagated data. For ex-ample, in-memory indexing with prefix trees exhibits a well-balanced read/write performance because no index reorganization is required. The vision of this project is to use the underlying in-memory index struc-tures, in the form of prefix trees, for query processing as well. Prefix trees are used as intermediate results of a plan and thus, all database operations can benefit from the structure of the in-memory index by pruning workin...
2018 IEEE 34th International Conference on Data Engineering (ICDE), 2018
Set similarity join is a fundamental and wellstudied database operator. It is usually studied in the exact setting where the goal is to compute all pairs of sets that exceed a given similarity threshold (measured e.g. as Jaccard similarity). But set similarity join is often used in settings where 100% recall may not be importantindeed, where the exact set similarity join is itself only an approximation of the desired result set. We present a new randomized algorithm for set similarity join that can achieve any desired recall up to 100%, and show theoretically and empirically that it significantly improves on existing methods. The present state-of-the-art exact methods are based on prefix-filtering, the performance of which depends on the data set having many rare tokens. Our method is robust against the absence of such structure in the data. At 90% recall our algorithm is often more than an order of magnitude faster than state-of-the-art exact methods, depending on how well a data set lends itself to prefix filtering. Our experiments on benchmark data sets also show that the method is several times faster than comparable approximate methods. Our algorithm makes use of recent theoretical advances in highdimensional sketching and indexing that we believe to be of wider relevance to the data engineering community.
Proc. of the Int'l Conference on Information …
Most join algorithms can be extended to reduce wasted work when several tuples contain the same value of the join attribute. We show that separating detection of duplicates from their exploitation improves modularity and makes it easier to implement whole families of hierarchy-exploiting join algorithms that avoid duplication. The technique is also used to provide an execution technique for star-like patterns of joins around a central relation. It dominates Ingres-like substitution for the central relation, in both performance and ease of including in a conventional optimizer. Its performance dominates a cascade of conventional binary joins, and performance estimates are more accurate. We then argue that such techniques make it undesirable to implement physical-level multiway join operations within a query processor.
Proceedings of the 2004 ACM symposium on Applied computing - SAC '04, 2004
Storing sets and querying them (e.g., subset queries that provide all supersets of a given set) is known to be difficult within relational databases. We consider that being able to query efficiently both transactional data and materialized collections of sets by means of standard query language is an important step towards practical inductive databases. Indeed, data mining query languages like MINE RULE extract collections of association rules whose components are sets into relational tables. Post-processing phases often use extensively subset queries and cannot be efficiently processed by SQL servers. In this paper, we propose a new way to handle sets from relational databases. It is based on a data structure that partially encodes the inclusion relationship between sets. It is an extension of the hash group bitmap key proposed by Morzy et al. [8]. Our experiments show an interesting improvement for these useful subset queries.
Lecture Notes in Computer Science, 2014
Identifying similarities in large datasets is an essential operation in many applications such as bioinformatics, pattern recognition, and data integration. To make the underlying database system similarity-aware, the core relational operators have to be extended. Several similarity-aware relational operators have been proposed that introduce similarity processing at the database engine level, e.g., similarity joins and similarity group-by. This paper extends the semantics of the set intersection operator to operate over similar values. The paper describes the semantics of the similarity-based set intersection operator, and develops an efficient query processing algorithm for evaluating it. The proposed operator is implemented inside an open-source database system, namely PostgreSQL. Several queries from the TPC-H benchmark are extended to include similarity-based set intersetion predicates. Performance results demonstrate up to three orders of magnitude speedup in performance over equivalent queries that only employ regular operators.
Synthesis Lectures on Data Management, 2013
Synthesis Lectures on Data Management is edited by Tamer Özsu of the University of Waterloo. e series publishes 50-to 125 page publications on topics pertaining to data management. e scope will largely follow the purview of premier information and computer science conferences, such as ACM SIGMOD, VLDB, ICDE, PODS, ICDT, and ACM KDD. Potential topics include, but not are limited to: query languages, database system architectures, transaction management, data warehousing, XML and databases, data stream systems, wide scale data distribution, multimedia data management, data mining, and related subjects.
2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), 2010
Similarity joins have been studied as key operations in multiple application domains, e.g., record linkage, data cleaning, multimedia and video applications, and phenomena detection on sensor networks. Multiple similarity join algorithms and implementation techniques have been proposed. They range from out-of-database approaches for only in-memory and external memory data to techniques that make use of standard database operators to answer similarity joins. Unfortunately, there has not been much study on the role and implementation of similarity joins as database physical operators. In this paper, we focus on the study of similarity joins as first-class database operators. We present the definition of several similarity join operators and study the way they interact among themselves, with other standard database operators, and with other previously proposed similarity-aware operators. In particular, we present multiple transformation rules that enable similarity query optimization through the generation of equivalent similarity query execution plans. We then describe an efficient implementation of two similarity join operators, Ɛ-Join and Join-Around, as core DBMS operators. The performance evaluation of the implemented operators in PostgreSQL shows that they have good execution time and scalability properties. The execution time of Join-Around is less than 5% of the one of the equivalent query that uses only regular operators while Ɛ-Join's execution time is 20 to 90% of the one of its equivalent regular operators based query for the useful case of small Ɛ (0.01% to 10% of the domain range). We also show experimentally that the proposed transformation rules can generate plans with execution times that are only 10% to 70% of the ones of the initial query plans.
1992
In this paper we introduce the concept of declustering aware joins (DA-Joins), which is a family of join algorithms with each member being determined by the underlying declustering method and the join technique used. We present and analyze a member of the DA-Join family that is based on the parallel hybrid-hash join and the CMD declustering method, which we call DA-Join HH,CMD . We show that DAJoin HH,CMD has very good overall performance. Its main advantages over non declustering-aware joins are (i) its ability to partition the problem into a set of smaller independent sub-problems, each of which can utilize memory more efficiently, and (ii) pruning the problem size by not considering portions of the relations that cannot join. It shares with hash joins the desirable property of scalability with respect to the degree of parallelism. However, it can avoid problems due to data skew that is faced by parallel hash joins. The CMD declustering method has been shown to be optimal for multi-attribute range queries on parallel I/O architectures, and our analysis and experimental evaluation of DA-Join HH,CMD prove that it is also possible to perform joins efficiently on CMD-stored relations, thus providing evidence for the desirability of CMD as a basic technique for multi-attribute declustering of relations in a parallel database.
2013
Storing sets and querying them (e.g., subset queries that provide all supersets of a given set) is known to be difficult within relational databases. We consider that being able to query efficiently both transactional data and materialized collections of sets by means of standard query language is an important step towards practical inductive databases. Indeed, data mining query languages like MINE RULE extract collections of association rules whose components are sets into relational tables. Post-processing phases often use extensively subset queries and cannot be efficiently processed by SQL servers. In this paper, we propose a new way to handle sets from relational databases. It is based on a data structure that partially encodes the inclusion relationship between sets. It is an extension of the hash group bitmap key proposed by Morzy et al. [8]. Our experiments show an interesting improvement for these useful subset queries.
j. New Generation Computing, 1988
In the recent investigations of reducing the relational join operation complexity several hash-based partitioned-join stategies have been introduced. All of these strategies depend upon the costly operation of data space partitioning before the join can be carried out. We had previously introduced a partitioned-join based on a dynamic and order preserving multidimensional data organization called DYOP. The present study extends the earlier research on DYOP and constructs a simulation model. The simulation studies on DYOP and subsequent comparisons of all the partitioned-join methodologies including DYOP have proven that space utilization of DYOP improves with the increasing number of attributes. Furthermore, the DYOP based join outperforms all the hash-based methodologies by greatly reducing the total I/O bandwidth required for the entire partitioned-join operation. The comparison model is independent of the architectural issues such as multiprocessing, multiple disk usage, and large memory availability all of which help to further increase the efficiency of the operation.
Proceedings of the 14th International Conference on Extending Database Technology - EDBT/ICDT '11, 2011
In this paper we address the problem of efficiently evaluating containment (i.e., subset, equality, and superset) queries over set-valued data. We propose a novel indexing scheme, the Ordered Inverted File (OIF) which, differently from the state-of-the-art, indexes set-valued attributes in an ordered fashion. We introduce query processing algorithms that practically treat containment queries as range queries over the ordered postings lists of OIF and exploit this ordering to quickly prune unnecessary page accesses. OIF is simple to implement and our experiments on both real and synthetic data show that it greatly outperforms the current state-of-the-art methods for all three classes of containment queries.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.