Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
This paper proposes an efficient similarity join method using unsupervised learning, when no labeled data is available. In our previous work, we showed that the performance of similarity join could improve when long string attributes, such as paper abstracts, movie summaries, product descriptions, and user feedback, are used under supervised learning, where a training set exists. In this work, we adopt using long string attributes during the similarity join under unsupervised learning. Along with its importance when no labeled data exists, unsupervised learning is used when no labeled data is available, it acts also as a quick preprocessing method for huge datasets. Here, we show that using long attributes during the unsupervised learning can further enhance the performance. Moreover, we provide an efficient dynamically expandable algorithm for databases with frequent transactions.
Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings., 2003
The efficient processing of similarity joins is important for a large class of applications. The dimensionality of the data for these applications ranges from low to high. Most existing methods have focussed on the execution of high-dimensional joins over large amounts of diskbased data. The increasing sizes of main memory available on current computers, and the need for efficient processing of spatial joins suggest that spatial joins for a large class of problems can be processed in main memory. In this paper we develop two new spatial join algorithms, the Grid-join and EGO*-join, and study their performance in comparison to the state of the art algorithm, EGO-join, and the RSJ algorithm. Through evaluation we explore the domain of applicability of each algorithm and provide recommendations for the choice of join algorithm depending upon the dimensionality of the data as well as the critical ε parameter. We also point out the significance of the choice of this parameter for ensuring that the selectivity achieved is reasonable. The proposed EGO*-join algorithm always, often significantly, outperforms the EGO-join. For low-dimensional data the Grid-join outperform both the EGO-and EGO*-joins. An analysis of the cost of the Grid-join is presented and highly accurate cost estimator functions are developed. These are used to choose an appropriate grid size for optimal performance and can also be used by a query optimizer to compute the estimated cost of the Grid-join.
2011
With the rapid increase of the distributed data sources, and in order to make information integration, there is a need to combine the information that refers to the same entity from different sources. However, there are no global conventions that control the ...
2018 IEEE 34th International Conference on Data Engineering (ICDE), 2018
Set similarity join is a fundamental and wellstudied database operator. It is usually studied in the exact setting where the goal is to compute all pairs of sets that exceed a given similarity threshold (measured e.g. as Jaccard similarity). But set similarity join is often used in settings where 100% recall may not be importantindeed, where the exact set similarity join is itself only an approximation of the desired result set. We present a new randomized algorithm for set similarity join that can achieve any desired recall up to 100%, and show theoretically and empirically that it significantly improves on existing methods. The present state-of-the-art exact methods are based on prefix-filtering, the performance of which depends on the data set having many rare tokens. Our method is robust against the absence of such structure in the data. At 90% recall our algorithm is often more than an order of magnitude faster than state-of-the-art exact methods, depending on how well a data set lends itself to prefix filtering. Our experiments on benchmark data sets also show that the method is several times faster than comparable approximate methods. Our algorithm makes use of recent theoretical advances in highdimensional sketching and indexing that we believe to be of wider relevance to the data engineering community.
IEEE Transactions on Knowledge and Data Engineering, 2000
ÐCurrent data repositories include a variety of data types, including audio, images, and time series. State-of-the-art techniques for indexing such data and doing query processing rely on a transformation of data elements into points in a multidimensional feature space. Indexing and query processing then take place in the feature space. In this paper, we study algorithms for finding relationships among points in multidimensional feature spaces, specifically algorithms for multidimensional joins. Like joins of conventional relations, correlations between multidimensional feature spaces can offer valuable information about the data sets involved. We present several algorithmic paradigms for solving the multidimensional join problem and we discuss their features and limitations. We propose a generalization of the Size Separation Spatial Join algorithm, named Multidimensional Spatial Join (MSJ), to solve the multidimensional join problem. We evaluate MSJ along with several other specific algorithms, comparing their performance for various dimensionalities on both real and synthetic multidimensional data sets. Our experimental results indicate that MSJ, which is based on space filling curves, consistently yields good performance across a wide range of dimensionalities. Index TermsÐSpatial join, sort merge joins, multiple-key indexes, data structures.
2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), 2010
Similarity joins have been studied as key operations in multiple application domains, e.g., record linkage, data cleaning, multimedia and video applications, and phenomena detection on sensor networks. Multiple similarity join algorithms and implementation techniques have been proposed. They range from out-of-database approaches for only in-memory and external memory data to techniques that make use of standard database operators to answer similarity joins. Unfortunately, there has not been much study on the role and implementation of similarity joins as database physical operators. In this paper, we focus on the study of similarity joins as first-class database operators. We present the definition of several similarity join operators and study the way they interact among themselves, with other standard database operators, and with other previously proposed similarity-aware operators. In particular, we present multiple transformation rules that enable similarity query optimization through the generation of equivalent similarity query execution plans. We then describe an efficient implementation of two similarity join operators, Ɛ-Join and Join-Around, as core DBMS operators. The performance evaluation of the implemented operators in PostgreSQL shows that they have good execution time and scalability properties. The execution time of Join-Around is less than 5% of the one of the equivalent query that uses only regular operators while Ɛ-Join's execution time is 20 to 90% of the one of its equivalent regular operators based query for the useful case of small Ɛ (0.01% to 10% of the domain range). We also show experimentally that the proposed transformation rules can generate plans with execution times that are only 10% to 70% of the ones of the initial query plans.
2003
This paper describes an efficient approach to record linkage. Given two lists of records, the record-linkage problem consists of determining all pairs that are similar to each other, where the overall similarity between two records is defined based on domain-specific similarities over individual attributes constituting the record. The record-linkage problem arises naturally in the context of data cleansing that usually precedes data analysis and mining. We explore a novel approach to this problem. For each attribute of records, we first map values to a multidimensional Euclidean space that preserves domain-specific similarity. Many mapping algorithms can be applied, and we use the FastMap approach as an example. Given the merging rule that defines when two records are similar, a set of attributes are chosen along which the merge will proceed. A multidimensional similarity join over the chosen attributes is used to determine similar pairs of records. Our extensive experiments using real data sets show that our solution has very good efficiency and accuracy.
International Journal of Data Science and Analytics
We consider the problem of duplicate detection in noisy and incomplete data: Given a large data set in which each record has multiple entries (attributes), detect which distinct records refer to the same real-world entity. This task is complicated by noise (such as misspellings) and missing data, which can lead to records being different, despite referring to the same entity. Our method consists of three main steps: creating a similarity score between records, grouping records together into "unique entities", and refining the groups. We compare various methods for creating similarity scores between noisy records, considering different combinations of string matching, term frequency-inverse document frequency methods, and n-gram techniques. In particular, we introduce a vectorized soft term frequency-inverse document frequency method, with an optional refinement step. We also discuss two methods to deal with missing data in computing similarity scores. We test our method on the Los Angeles Police Department Field Interview Card data set, the Cora Citation Matching data set, and two sets of restaurant review data. The results show that the methods that use words as the basic units are preferable to those that use 3-grams. Moreover, in some (but certainly not all) parameter ranges soft term frequency-inverse document frequency methods can outperform the standard term frequency-inverse document frequency method. The results also confirm that our method for automatically determining the number of groups typically works well in many cases and allows for accurate results in the absence of a priori knowledge of the number of unique entities in the data set.
ACM SIGMOD Record, 2014
String similarity search and its variants are fundamental problems with many applications in areas such as data integration, data quality, computational linguistics, or bioinformatics. A plethora of methods have been developed over the last decades. Obtaining an overview of the state-of-the-art in this field is difficult, as results are published in various domains without much cross-talk, papers use different data sets and often study subtle variations of the core problems, and the sheer number of proposed methods exceeds the capacity of a single research group. In this paper, we report on the results of the probably largest benchmark ever performed in this field. To overcome the resource bottleneck, we organized the benchmark as an international competition, a workshop at EDBT/ICDT 2013. Various teams from different fields and from all over the world developed or tuned programs for two crisply defined problems. All algorithms were evaluated by an external group on two machines. Altogether, we compared 14 different programs on two string matching problems (k-approximate search and k-approximate join) using data sets of increasing sizes and with different characteristics from two different domains. We compare programs primarily by wall clock time, but also provide results on memory usage, indexing time, batch query effects and scalability in terms of CPU cores. Results were averaged over several runs and confirmed on a second, different hardware platform. A particularly interesting observation is that disciplines can and should learn more from each other, with the three best teams rooting in computational linguistics, databases, and bioinformatics, respectively.
2012
Many application scenarios, e.g., marketing analysis, sensor networks, and medical and biological applications, require or can significantly benefit from the identification and processing of similarities in the data. Even though some work has been done to extend the semantics of some operators, e.g., join and selection, to be aware of data similarities; there has not been much study on the role, interaction, and implementation of similarity-aware operators as first-class database operators. The focus of the thesis work presented in this paper is the proposal and study of several similarity-aware database operators and a systematic analysis of their role as query operators, interactions, optimizations, and implementation techniques. This paper presents the core research questions that drive our research work and the physical database operators that were studied as part of this thesis work so far, i.e., Similarity Group-by and Similarity Join. We describe multiple optimization techniques for the introduced operators. Specifically, we present: (1) multiple non-trivial equivalence rules that enable similarity query transformations, (2) Eager and Lazy aggregation transformations for Similarity Group-by and Similarity Join to allow pre-aggregation before potentially expensive joins, and (3) techniques to use materialized views to answer similarity-based queries. This paper also presents the main guidelines to implement the presented operators as integral components of a DBMS query engine and some of the key performance evaluation results of this implementation in an open source DBMS. In addition, we present the way the proposed operators are efficiently exploited to answer more useful business questions in a decision support system.
Proceedings of the 4th International Workshop on Privacy and Anonymity in the Information Society - PAIS '11, 2011
During the similarity join process, one or more sources may not allow sharing the whole data with other sources. In this case, privacy preserved similarity join is required. We showed in our previous work [4] that using long attributes, such as paper abstracts, movie summaries, product descriptions, and user feedbacks, could improve the similarity join accuracy under supervised learning. However, the existing secure protocols for similarity join methods can not be used to join tables using these long attributes. Moreover, the majority of the existing privacypreserving protocols did not consider the semantic similarities during the similarity join process. In this paper, we introduce a secure efficient protocol to semantically join tables when the join attributes are long attributes. Furthermore, instead of using machine learning methods, which are not always applicable, we use similarity thresholds to decide matched pairs. Results show that our protocol can efficiently join tables using the long attributes by considering the semantic relationships among the long string values. Therefore, it improves the overall secure similarity join performance.
Information Systems, 2007
The efficient processing of multidimensional similarity joins is important for a large class of applications. The dimensionality of the data for these applications ranges from low to high. Most existing methods have focused on the execution of high-dimensional joins over large amounts of disk-based data. The increasing sizes of main memory available on current computers, and the need for efficient processing of spatial joins suggest that spatial joins for a large class of problems can be processed in main memory. In this paper, we develop two new in-memory spatial join algorithms, the Grid-join and EGO*-join, and study their performance. Through evaluation, we explore the domain of applicability of each approach and provide recommendations for the choice of a join algorithm depending upon the dimensionality of the data as well as the expected selectivity of the join. We show that the two new proposed join techniques substantially outperform the state-of-the-art join algorithm, the EGO-join.
Proceedings of the 18th International Conference on Enterprise Information Systems, 2016
A critical task in data cleaning and integration is the identification of duplicate records representing the same real-world entity. A popular approach to duplicate identification employs similarity join to find pairs of similar records followed by a clustering algorithm to group together records that refer to the same entity. However, the clustering algorithm is strictly used as a post-processing step, which slows down the overall performance and only produces results at the end of the whole process. In this paper, we propose SjClust, a framework to integrate similarity join and clustering into a single operation. Our approach allows to smoothly accommodating a variety of cluster representation and merging strategies into set similarity join algorithms, while fully leveraging state-of-the-art optimization techniques.
2015
Similarity Join is an important operation in data integration and cleansing, record linkage, data deduplication and pattern matching. It finds similar sting pairs from two collections of strings. Number of approaches have been proposed as well as compared for string similarity joins. The rising era of big data demands for scalable algorithms to support large scale string similarity joins. In this paper we study the string similarity joins, their use. Further we look at three different techniques for scalable string similarity join using MapReduce, which areParallel set-similarity join, MGJoin and MassJoin. Finally, we try to compare them based on some common characteristics. Keywords— Similarity Join, MapReduce
International Journal of Computer Applications
Similarity Join plays an important role in data integration and cleansing, record linkage and data de-duplication. It finds similar sting pairs from collections of strings. If two strings are similar they share a common token. Number of approaches has been proposed for in-memory string similarity joins. But due to the rising era of big data, demands for scalable algorithms to support large scale string similarity joins arises. The proposed architecture uses the MapReduce concept and is based on inverted index and multiple prefix filtering methods. The prefix filtering is made of different global ordering which reduces the number of candidate pairs significantly, thus improving the pruning power as compared to other approach.
Synthesis Lectures on Data Management, 2013
Synthesis Lectures on Data Management is edited by Tamer Özsu of the University of Waterloo. e series publishes 50-to 125 page publications on topics pertaining to data management. e scope will largely follow the purview of premier information and computer science conferences, such as ACM SIGMOD, VLDB, ICDE, PODS, ICDT, and ACM KDD. Potential topics include, but not are limited to: query languages, database system architectures, transaction management, data warehousing, XML and databases, data stream systems, wide scale data distribution, multimedia data management, data mining, and related subjects.
Lecture Notes in Computer Science, 2002
The clustering problem, which aims at identifying the distribution of patterns and intrinsic correlations in large data sets by partitioning the data points into similarity clusters, has been widely studied. Traditional clustering algorithms use distance functions to measure similarity and are not suitable for high dimensional spaces. In this paper, we propose CoFD algorithm, which is a non-distance based clustering algorithm for high dimensional spaces. Based on the maximum likelihood principle, CoFD is to optimize parameters to maximize the likelihood between data points and the model generated by the parameters. Experimental results on both synthetic data sets and a real data set show the efficiency and effectiveness of CoFD.
Computing and Informatics, 2017
String similarity join is a basic and essential operation in many applications. In this paper, we investigate the problem of string similarity join with edit distance constraints. A trie-based edit similarity join framework has been proposed recently. The main advantage of existing trie-based algorithms is support for similarity join on short strings. The main problem is when joining long and distant strings. These methods generate and maintain lots of similar prefixes called active nodes which need to be further removed in a subsequent pruning phase. With large edit distance, the number of active nodes becomes quite large. In this paper, we propose a new trie-based join algorithm called PreJoin, which improves upon current trie-based join methods. It efficiently finds all similar string pairs using a novel active-node generation method, which minimizes the number of generated active nodes by applying the pruning heuristics early in the generation process. The performance of PreJoin is scaled in two different ways: First, a dynamic reordering of the trie index is used to accelerate the search for similar string pairs. Second, a partitioning method of string space is used to improve performance on large edit distance thresholds. Experiments show that our approach is highly efficient for processing short as well as long strings, and outperforms the state-of-the-art trie-based join approaches by a factor five.
Citeseer
Information integration from multiple heterogeneous sources is one of the major challenges facing enterprises and service providers today, and one of the important problems in this domain is the integration of structured and unstructured (or text) data. In this paper we describe our work on a data-driven approach to integrating various sources of text data, without relying on the availability of schema information. To this end, we have used various existing tools from natural language processing, data mining and related areas in a novel manner. The tools are used at the 'preprocessing' stage to (a) characterise each set of unstructured information (or collection of text data), (b) identify the related sets of unstructured information and (c) relate these sets to various reference data sets. All these steps are based solely on the instance values of the data sets. Subsequently the information compiled in the preprocessing stage may be used at query time to query the structured and text data. We also present our results on applying our techniques for data integration across multiple unstructured data sources, relating to customer comments of a service provider. • Present a view of the various data sets as a single repository, for subsequent querying.
MATTER: International Journal of Science and Technology, 2017
Similarity Join is a data processing and analysis operation that retrieves all data pairs whose their distance is less than a pre-defined threshold. The similarity join algorithms are used in different real world applications such as finding similarity in documents, images, and strings. In this survey we will explain some of the similarity join algorithms which are based on MapReduce approach. These algorithms are: Set-Similarity Join, SSJ-2R, MRSimJoin, Pair-wise similarity, multi-sig-er method, Trie-join, and PreJoin algorithm. We then make a comparison between these algorithms according to some criteria and discuss the results.
Information Systems, 2020
This is a PDF file of an article that has undergone enhancements after acceptance, such as the addition of a cover page and metadata, and formatting for readability, but it is not yet the definitive version of record. This version will undergo additional copyediting, typesetting and review before it is published in its final form, but we are providing this version to give early visibility of the article. Please note that, during the production process, errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.