EFFICIENT SIMILARITY JOIN METHOD USING UNSUPERVISED LEARNING

International Journal of Computer Science &amp; Information Technology  (IJCSIT)

EFFICIENT SIMILARITY JOIN METHOD USING UNSUPERVISED LEARNING

International Journal of Computer Science & Information Technology (IJCSIT)

Sign up for access to the world's latest research

checkGet notified about relevant papers

checkSave papers to use in your research

checkJoin the discussion with peers

checkTrack your impact

Abstract

This paper proposes an efficient similarity join method using unsupervised learning, when no labeled data is available. In our previous work, we showed that the performance of similarity join could improve when long string attributes, such as paper abstracts, movie summaries, product descriptions, and user feedback, are used under supervised learning, where a training set exists. In this work, we adopt using long string attributes during the similarity join under unsupervised learning. Along with its importance when no labeled data exists, unsupervised learning is used when no labeled data is available, it acts also as a quick preprocessing method for huge datasets. Here, we show that using long attributes during the unsupervised learning can further enhance the performance. Moreover, we provide an efficient dynamically expandable algorithm for databases with frequent transactions.

Sunil Prabhakar

Eighth International Conference on Database Systems for Advanced Applications, 2003. (DASFAA 2003). Proceedings., 2003

The efficient processing of similarity joins is important for a large class of applications. The dimensionality of the data for these applications ranges from low to high. Most existing methods have focussed on the execution of high-dimensional joins over large amounts of diskbased data. The increasing sizes of main memory available on current computers, and the need for efficient processing of spatial joins suggest that spatial joins for a large class of problems can be processed in main memory. In this paper we develop two new spatial join algorithms, the Grid-join and EGO*-join, and study their performance in comparison to the state of the art algorithm, EGO-join, and the RSJ algorithm. Through evaluation we explore the domain of applicability of each algorithm and provide recommendations for the choice of join algorithm depending upon the dimensionality of the data as well as the critical ε parameter. We also point out the significance of the choice of this parameter for ensuring that the selectivity achieved is reasonable. The proposed EGO*-join algorithm always, often significantly, outperforms the EGO-join. For low-dimensional data the Grid-join outperform both the EGO-and EGO*-joins. An analysis of the cost of the Grid-join is presented and highly accurate cost estimator functions are developed. These are used to choose an appropriate grid size for optimal performance and can also be used by a query optimizer to compute the estimated cost of the Grid-join.

Log In

EFFICIENT SIMILARITY JOIN METHOD USING UNSUPERVISED LEARNING

Sign up for access to the world's latest research

Abstract

Related papers

Related topics

Related papers