Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2000, Proceedings of the ninth international conference on Information and knowledge management - CIKM '00
…
8 pages
1 file
In this paper, w epropose a subsequence matching algorithm that supports normalization transform in timeseries databases. Normalization transform enables nding sequences with similar uctuation patterns although they are not close to each other before the normalization transform. Application of the existing whole matching algorithm supporting normalization transform to the subsequence matching is feasible, but requires an index for ev ery possible length of the query sequence causing serious overhead on both storage space and update time. The proposed algorithm generates indexes only for a small number of di erent lengths of query sequences. F or subsequence matching it selects the most appropriate index among them. We can obtain better searc h performance by using more indexes. We c a l l o u r approach index interp olation. We formally pro ve t h a t the proposed algorithm does not cause false dismissal. F or performance evaluation, we h a ve conducted experiments using the indexes for only ve di erent lengths out of the lengths 256 512 of the query sequence. The results show that the proposed algorithm outperforms the sequential scan by up to 14.6 times on the average when the selectivity of the query is 10 ;5 .
2007
Existing work on similar sequence matching has focused on either whole matching or range subsequence matching. In this paper, we present novel methods for ranked subsequence matching under time warping, which finds top-k subsequences most similar to a query sequence from data sequences. To the best of our knowledge, this is the first and most sophisticated subsequence matching solution mentioned in the literature. Specifically, we first provide a new notion of the minimum-distance matching-window pair (MDMWP) and formally define the mdmwp-distance, a lower bound between a data subsequence and a query sequence. The mdmwp-distance can be computed prior to accessing the actual subsequence. Based on the mdmwp-distance, we then develop a ranked subsequence matching algorithm to prune unnecessary subsequence accesses. Next, to reduce random disk I/Os and bad buffer utilization, we develop a method of deferred group subsequence retrieval. We then derive another lower bound, the window-group distance, that can be used to effectively prune unnecessary subsequence accesses during deferred group-subsequence retrieval. Through extensive experiments with many data sets, we showcase the superiority of the proposed methods.
Information Systems, 2004
This paper discusses the effective processing of similarity search that supports time warping in large sequence databases. Time warping enables sequences with similar patterns to be found even when they are of different lengths. Prior methods for processing similarity search that supports time warping failed to employ multi-dimensional indexes without false dismissal since the time warping distance does not satisfy the triangular inequality. They have to scan the entire database, thus suffering from serious performance degradation in large databases. Another method that hires the suffix tree, which does not assume any distance function, also shows poor performance due to the large tree size. In this paper, we propose a novel method for similarity search that supports time warping. Our primary goal is to enhance the search performance in large databases without permitting any false dismissal. To attain this goal, we have devised a new distance function, D twÀlb ; which consistently underestimates the time warping distance and satisfies the triangular inequality. D twÀlb uses a 4-tuple feature vector that is extracted from each sequence and is invariant to time warping. For the efficient processing of similarity search, we employ a multi-dimensional index that uses the 4-tuple feature vector as indexing attributes, and D twÀlb as a distance function. We prove that our method does not incur false dismissal. To verify the superiority of our method, we have performed extensive experiments. The results reveal that our method achieves a significant improvement in speed up to 43 times faster with a data set containing real-world S&P 500 stock data sequences, and up to 720 times with data sets containing a very large volume of synthetic data sequences. The performance gain increases: (1) as the number of data sequences increases, (2) the average length of data sequences increases, and (3) as the tolerance in a query decreases. Considering the characteristics of real databases, these tendencies imply that our approach is suitable for practical applications.
Lecture Notes in Computer Science
The purpose of subsequence matching is to find a query sequence from a long data sequence. Due to the abundance of applications, many solutions have been proposed. Virtually all previous solutions use the Euclidean measure as the basis for measuring distance between sequences. Recent studies, however, suggest that the Euclidean distance often fails to produce proper results due to the irregularity in the data, which is not so uncommon in our problem domain. Addressing this problem, some non-Euclidean measures, such as Dynamic Time Warping (DTW) and Longest Common Subsequence (LCS), have been proposed. However, most of the previous work in this direction focused on the whole sequence matching problem where query and data sequences are the same length. In this paper, we propose a novel subsequence matching framework using a non-Euclidean measure, in particular, LCS, and a new index query scheme. The proposed framework is based on the Dual Match framework where data sequences are divided into a series of disjoint equi-length subsequences and then indexed in an R-tree. We introduced similarity bound for index matching with LCS. The proposed query matching scheme reduces significant numbers of false positives in the match result. Furthermore, we developed an algorithm to skip expensive LCS computations through observing the warping paths. We validated our framework through extensive experiments using 48 different time series datasets. The results of the experiments suggest that our approach significantly improves the subsequence matching performance in various metrics.
Proceedings 1999 Workshop on Knowledge and Data Engineering Exchange (KDEX'99) (Cat. No.PR00453), 2000
Although the Euclidean distance has been the most popular similarity measure in sequence databases, recent techniques prefer to use high-cost distance functions such as the time warping distance and the editing distance for wider applicability. However, if these distance functions are applied to the retrieval of similar subsequences, the number of subsequences to be inspected during the search is quadratic to the average length L of data sequences. In this paper, we propose a novel subsequence matching scheme, called the aligned subsequence matching, where the number of subsequences to be compared with a query sequence is reduced to linear to L. We also present an indexing technique to speed-up the aligned subsequence matching using the similarity measure of the modified time warping distance. The experiments on the synthetic data sequences demonstrate the effectiveness of our proposed approach; ours consistently outperformed the sequential scanning and achieved up to 6.5 times speed-up.
Journal of Information Science, 2006
This paper discusses the way of processing time-series subsequence matching under time warping. Time warping enables sequences to be found with similar patterns even when they are of different lengths. The prefix-querying method is the first index-based approach that efficiently performs time-series subsequence matching under time warping without false dismissals. This method employs the L distance metric as a base distance function so as to allow users to issue queries conveniently. In this paper, we extend the prefix-querying method for absorbing L1, which is the most widely used as a base distance function in time-series subsequence matching under time warping, instead of L. We formally prove that the prefix-querying method with the L1 distance metric does not incur any false dismissals in the subsequence matching. To show its superiority, we conduct performance evaluation via a variety of experiments. The results reveal that our method achieves significant performance improvemen...
Proceedings. Eleventh International Conference on Scientific and Statistical Database Management
We address the problem of similarity search in large time series databases. We introduce a novel indexing algorithm that allows faster retrieval. The index is formed by creating bins that contain time series subsequences of approximately the same shape. For each bin, we can quickly calculate a lower-bound on the distance between a given query and the most similar element of the bin. This bound allows us to search the bins in best first order, and to prune some bins from the search space without having to examine the contents. Additional speedup is obtained by optimizing the data within the bins such that we can avoid having to compare the query to every item in the bin. We call our approach STB-indexing and experimentally validate it on space telemetry, medical and synthetic data, demonstrating approximately an order of magnitude speed-up.
Studies in Computational Intelligence, 2009
In terms of a general time theory which addresses time-elements as typed pointbased intervals, a formal characterization of time-series and state-sequences is introduced. Based on this framework, the subsequence matching problem is specially tackled by means of being transferred into bipartite graph matching problem. Then a hybrid similarity model with high tolerance of inversion, crossover and noise is proposed for matching the corresponding bipartite graphs involving both temporal and non-temporal measurements. Experimental results on reconstructed time-series data from UCI KDD Archive demonstrate that such an approach is more effective comparing with the traditional similarity model based algorithms, promising robust techniques for lager time-series databases and real-life applications such as Contentbased Video Retrieval (CBVR), etc.
Manuscript, October, 1997
Large sequence databases are becoming increasingly common. They range from protein and gene sequences in biology, to time series data in soil sciences, to MIDI sequences in multimedia applications, to text documents in information retrieval. An important operation on a sequence database is approximate subsequence matching, where all subsequences that are within some distance from a given query string are retrieved. This paper introduces S-Hash, a scheme that enables e cient approximate subsequence ...
Information Systems, 2018
This is a PDF file of an unedited manuscript that has been accepted for publication. As a service to our customers we are providing this early version of the manuscript. The manuscript will undergo copyediting, typesetting, and review of the resulting proof before it is published in its final form. Please note that during the production process errors may be discovered which could affect the content, and all legal disclaimers that apply to the journal pertain. Highlights • Top-Index a novel data structure for multi-dimensional top-k subsequence matching. • Efficient Top-Index construction algorithm. • Space-efficient Delta-Top-Index with compression rates of 10300 and microsecond-fast query latency.
Information and Computation, 2004
We define the problem of bounded similarity querying in time-series databases, which generalizes earlier notions of similarity querying. Given a (sub)sequence S, a query sequence Q, lower and upper bounds on shifting and scaling parameters, and a tolerance , S is considered boundedly similar to Q if S can be shifted and scaled within the specified bounds to produce a modified sequence S whose distance from Q is within. We use similarity transformation to formalize the notion of bounded similarity. We then describe a framework that supports the resulting set of queries; it is based on a fingerprint method that normalizes the data and saves the normalization parameters. For off-line data, we provide an indexing method with a single index structure and search technique for handling all the special cases of bounded similarity querying. Experimental investigations find the performance of our method to be competitive with earlier, less general approaches.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
Proceedings of the 2001 ACM symposium on Applied computing - SAC '01, 2001
Proceedings of 16th International Conference on Data Engineering (Cat. No.00CB37073), 2000
Proceedings 17th International Conference on Data Engineering, 2000
Lecture Notes in Computer Science, 2005
Information Sciences, 2007
Advanced Information Systems Engineering, 2006
AASRI Procedia, 2013
2007 IEEE 23rd International Conference on Data Engineering, 2007
IEEE Internet Computing, 2010
Lecture Notes in Computer Science, 2005
Lecture Notes in Computer Science, 2007
International Journal of Image and Graphics, 2003
Proceedings of the sixth international conference on Information and knowledge management - CIKM '97, 1997
2010 IEEE International Conference on Data Mining, 2010
abstract.cs.washington.edu
Indonesian Journal of Electrical Engineering and Computer Science, 2022