Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
Let D = {d 1 , d 2 , ...d D } be a given collection of D string documents of total length n, our task is to index D, such that whenever a pattern P (of length p) and an integer k come as a query, those k documents in which P appears the most number of times can be listed efficiently. In this paper, we propose a compressed index taking 2|CSA| + D log n D + O(D) + o(n) bits of space, which answers a query with O(t sa log k log n) per document report time. This improves the O(t sa log k log 1+ n) per document report time of the previously best-known index with (asymptotically) the same space requirements [Belazzougui and Navarro, SPIRE 2011]. Here, |CSA| represents the size (in bits) of the compressed suffix array (CSA) of the text obtained by concatenating all documents in D, and t sa is the time for decoding a suffix array value using the CSA.
An optimal index solving top-k document retrieval [Navarro and Nekrich, SODA'12] takes O(m + k) time for a pattern of length m, but its space is at least 80n bytes for a collection of n symbols. We reduce it to 1:5n{ 3n bytes, with O(m+(k+log log n) log log n) time, on typical texts. The index is up to 25 times faster than the best previous compressed solutions, and requires at most 5% more space in practice (and in some cases as little as one half). Apart from replacing classical by compressed data structures, our main idea is to replace sux tree sampling by frequency thresholding to achieve compression.
Lecture Notes in Computer Science, 2012
Let D ={d 1 , d 2 , ...d D } be a given set of D = |D| string documents of total length D i=1 |d i | = n, whose characters are taken from an alphabet set Σ of size σ. Our task is to index D, in order to efficiently retrieve the k most relevant documents for an online query pattern P of length p. Since these are string documents, where there are no word demarcations, traditional inverted indexes will not be space efficient. When we consider frequency as the relevance metric (where the relevance of a document is proportional to the number of occurrences of P within it), the best known theoretical index [Hon et al., FOCS 2009] can answer the query in near-optimal O(p + k log k) time using O(n log n) bits index space. On the other hand, the practically most space-efficient implementation [Culpepper et al., ESA 2010] takes only nH h +n log D +o(n log D) bits space, where H h ≤ log σ represents the hth order empirical entropy (h = o(log σ n)) of the concatenated text of all documents in D (separated by a special symbol); however it does not guarantee any theoretical bounds on the query time. For D < n ǫ , ǫ < 1 and σ = O(polylog(n)), we revisit this problem and propose indexes with the following space-time trade-offs:
Journal of Discrete Algorithms, 2010
In the document retrieval problem (Muthukrishnan ), we are given a collection of documents (strings) of total length D in advance, and our target is to create an index for these documents such that for any subsequent input pattern P , we can identify which documents in the collection contain P . In this paper, we study a natural extension to the above document retrieval problem. We call this top-k frequent document retrieval, where instead of listing all documents containing P , our focus is to identify the top k documents having most occurrences of P . This problem forms a basis for search engine tasks of retrieving documents ranked with TFIDF (Term Frequency-Inverse Document Frequency) metric.
The inverted index is the backbone of modern web search engines. For each word in a collection of web documents, the index records the list of documents where this word occurs. Given a set of query words, the job of a search engine is to output a ranked list of the most relevant documents containing the query. However, if the query consists of an arbitrary string — which can be a partial word, multiword phrase, or more generally any sequence of characters — then word boundaries are no longer relevant and we need a different approach. In string retrieval settings, we are given a set D = {d 1 , d 2 , d 3 ,. .. , d D } of D strings with n characters in total taken from an alphabet set Σ = [σ], and the task of the search engine, for a given query pattern P of length p, is to report the " most relevant " strings in D containing P. The query may also consist of two or more patterns. The notion of relevance can be captured by a function score(P, dr), which indicates how relevant document dr is to the pattern P. Some example score functions are the frequency of pattern occurrences, proximity between pattern occurrences, or pattern-independent PageRank of the document. The first formal framework to study such kinds of retrieval problems was given by Muthukrishnan [SODA 2002]. He considered two metrics for relevance: frequency and proximity. He took a threshold-based approach on these metrics and gave data structures that use O(n log n) words of space. We study this problem in a somewhat more natural top-k framework. Here, k is a part of the query, and the top k most relevant (highest-scoring) documents are to be reported in sorted order of score. We present the first linear-space framework (i.e., using O(n) words of space) that is capable of handling arbitrary score functions with near-optimal O(p + k log k) query time. The query time can be made optimal O(p + k) if sorted order is not necessary. Further, we derive compact space and succinct space indexes (for some specific score functions). This space compression comes at the cost of higher query time. At last, we extend our framework to handle the case of multiple patterns. Apart from providing a robust framework, our results also improve many earlier results in index space or query time or both.
2009 50th Annual IEEE Symposium on Foundations of Computer Science, 2009
Given a set D = {d 1 , d 2 , ..., d D } of D strings of total length n, our task is to report the "most relevant" strings for a given query pattern P . This involves somewhat more advanced query functionality than the usual pattern matching, as some notion of "most relevant" is involved. In information retrieval literature, this task is best achieved by using inverted indexes. However, inverted indexes work only for some predefined set of patterns. In the pattern matching community, the most popular pattern-matching data structures are suffix trees and suffix arrays. However, a typical suffix tree search involves going through all the occurrences of the pattern over the entire string collection, which might be a lot more than the required relevant documents.
We engineer a self-index based retrieval system capable of rank-safe evaluation of top-k queries. The framework generalizes the GREEDY approach of Culpepper et al. (ESA 2010) to handle multiterm queries, including over phrases. We propose two techniques which significantly reduce the ranking time for a wide range of popular Information Retrieval (IR) relevance measures, such as TF×IDF and BM25. First, we reorder elements in the document array according to document weight. Second, we introduce the repetition array, which generalizes Sadakane's (JDA 2007) document frequency structure to document subsets. Combining document and repetition array, we achieve attractive functionality-space trade-offs. We provide an extensive evaluation of our system on terabyte-sized IR collections.
2012
For over forty years the dominant data structure for ranked document retrieval has been the inverted index. Inverted indexes are effective for a variety of document retrieval tasks, and particularly efficient for large data collection scenarios that require disk access and storage. However, many efficiency-bound search tasks can now easily be supported entirely in-memory as a result of recent hardware advances.
Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, 2019
We propose algorithms that, given the input string of length n over integer alphabet of size σ, construct the Burrows-Wheeler transform (BWT), the permuted longest-common-prefix (PLCP) array, and the LZ77 parsing in O(n/ log σ n + r polylog n) time and working space, where r is the number of runs in the BWT of the input. These are the essential components of many compressed indexes such as compressed suffix tree, FMindex, and grammar and LZ77-based indexes, but also find numerous applications in sequence analysis and data compression. The value of r is a common measure of repetitiveness that is significantly smaller than n if the string is highly repetitive. Since just accessing every symbol of the string requires Ω(n/ log σ n) time, the presented algorithms are time and space optimal for inputs satisfying the assumption n/r ∈ Ω(polylog n) on the repetitiveness. For such inputs our result improves upon the currently fastest general algorithms of Belazzougui (STOC 2014) and Munro et al. (SODA 2017) which run in O(n) time and use O(n/ log σ n) working space. We also show how to use our techniques to obtain optimal solutions on highly repetitive data for other fundamental string processing problems such as: Lyndon factorization, construction of run-length compressed suffix arrays, and some classical "textbook" problems such as computing the longest substring occurring at least some fixed number of times.
2007
Compressed text (self-)indexes have matured up to a point where they can replace a text by a data structure that requires less space and, in addition to giving access to arbitrary text passages, support indexed text searches. At this point those indexes are competitive with traditional text indexes (which are very large) for counting the number of occurrences of a pattern in the text. Yet, they are still hundreds to thousands of times slower when it comes to locating those occurrences in the text. In this paper we introduce a new compression scheme for suffix arrays which permits locating the occurrences extremely fast, while still being much smaller than classical indexes. In addition, our index permits a very efficient secondary memory implementation, where compression permits reducing the amount of I/O needed to answer queries.
Lecture Notes in Computer Science, 2010
We prove that a document collection, represented as a unique sequence T of n terms over a vocabulary Σ, can be represented in nH0(T) + o(n)(H0(T) + 1) bits of space, such that a conjunctive query t1 ∧ • • • ∧ t k can be answered in O(kδ log log |Σ|) adaptive time, where δ is the instance difficulty of the query, as defined by Barbay and Kenyon in their SODA'02 paper, and H0(T) is the empirical entropy of order 0 of T. As a comparison, using an inverted index plus the adaptive intersection algorithm by Barbay and Kenyon takes O(kδ log n M δ), where nM is the length of the shortest and longest occurrence lists, respectively, among those of the query terms. Thus, we can replace an inverted index by a more space-efficient in-memory encoding, outperforming the query performance of inverted indices when the ratio n M δ is ω(log |Σ|).
2010
The field of compressed data structures seeks to achieve fast search time, but using a compressed representation, ideally requiring less space than that occupied by the original input data. The challenge is to construct a compressed representation that provides the same functionality and speed as traditional data structures. In this invited presentation, we discuss some breakthroughs in compressed data structures over the course of the last decade that have significantly reduced the space requirements for fast text and document indexing. One interesting consequence is that, for the first time, we can construct data structures for text indexing that are competitive in time and space with the well-known technique of inverted indexes, but that provide more general search capabilities. Several challenges remain, and we focus in this presentation on two in particular: building I/O-efficient search structures when the input data are so massive that external memory must be used, and incorporating notions of relevance in the reporting of query answers.
The proliferation of online text, such as found on the World Wide Web and in online databases, motivates the need for space-efficient text indexing methods that support fast string searching. We model this scenario as follows: Consider a text T consisting of n symbols drawn from a fixed alphabet Σ. The text T can be represented in n lg |Σ| bits by encoding each symbol with lg |Σ| bits. The goal is to support fast online queries for searching any string pattern P of m symbols, with T being fully scanned only once, namely, when the index is created at preprocessing time. The text indexing schemes published in the literature are greedy in terms of space usage: they require Ω(n lg n) additional bits of space in the worst case. For example, in the standard unit cost RAM, suffix trees and suffix arrays need Ω(n) memory words, each of Ω(lg n) bits. These indexes are larger than the text itself by a multiplicative factor of Ω(lg |Σ| n), which is significant when Σ is of constant size, such as in ascii or unicode. On the other hand, these indexes support fast searching, either in O(m lg |Σ|) time or in O(m + lg n) time, plus an output-sensitive cost O(occ) for listing the occ pattern occurrences. We present a new text index that is based upon compressed representations of suffix arrays and suffix trees. It achieves a fast O(m/ lg |Σ| n + lg |Σ| n) search time in the worst case, for any constant 0 < < ≤ 1, using at most −1 + O(1) n lg |Σ| bits of storage. Our result thus presents for the first time an efficient index whose size is provably linear in the size of the text in the worst case, and for many scenarios, the space is actually sublinear in practice. As a concrete example, the compressed suffix array for a typical 100 MB ascii file can require 30–40 MB or less, while the raw suffix array requires 500 MB. Our theoretical bounds improve both time and space of previous indexing schemes. Listing the pattern occurrences introduces a sublogarithmic slowdown factor in the output-sensitive cost, giving O(occ lg |Σ| n) time as a result. When the patterns are sufficiently long, we can use auxiliary data structures in O(n lg |Σ|) bits to obtain a total search bound of O(m/ lg |Σ| n + occ) time, which is optimal.
Lecture Notes in Computer Science, 2012
Supporting top-k document retrieval queries on general text databases, that is, finding the k documents where a given pattern occurs most frequently, has become a topic of interest with practical applications. While the problem has been solved in optimal time and linear space, the actual space usage is a serious concern. In this paper we study various reduced-space structures that support top-k retrieval and propose new alternatives. Our experimental results show that our novel structures and algorithms dominate almost all the space/time tradeoff.
Journal of the ACM, 2005
We design two compressed data structures for the full-text indexing problem that support efficient substring searches using roughly the space required for storing the text in compressed form.
Journal of Algorithms, 2003
New text indexing functionalities of the compressed suffix arrays are proposed. The compressed suffix array proposed by Grossi and Vitter is a space-efficient data structure for text indexing. It occupies only O(n) bits for a text of length n; however it also uses the text itself that occupies n log 2 |A| bits for the alphabet A. In this paper we modify the data structure so that pattern matching can be done without any access to the text. In addition to the original functions of the compressed suffix array, we add new operations search, decompress and inverse to the compressed suffix arrays. We show that the new index can find occ occurrences of any substring P of the text in O(|P | log n + occ log n) time for any fixed 1 > 0 without access to the text. The index also can decompress a part of the text of length m in O(m + log n) time. For a text of length n on an alphabet A such that |A| = polylog(n), our new index occupies only O(nH 0 + n log log |A|) bits where H 0 log |A| is the order-0 entropy of the text. Especially for = 1 the size is nH 0 + O(n log log |A|) bits. Therefore the index will be smaller than the text, which means we can perform fast queries from compressed texts.
Computing Research Repository, 2007
A compressed full-text self-index represents a text in a compressed form and still answers queries efficiently. This technology represents a breakthrough over the text indexing techniques of the previous decade, whose indexes required several times the size of the text. Although it is relatively new, this technology has matured up to a point where theoretical research is giving way to practical developments. Nonetheless this requires significant programming skills, a deep engineering effort, and a strong algorithmic background to dig into the research results. To date only isolated implementations and focused comparisons of compressed indexes have been reported, and they missed a common API, which prevented their re-use or deployment within other applications.
2000
A compressed text database based on the compressed sufffix array is proposed. The compressed suffix array of Grossi and Vitter occupies only O(n) bits for a text of length n; however it also uses the text itself that occupies $ O(n\log |\Sigma |) $ bits for the alphabet ∑. On the other hand, our data structure does not use the text itself, and supports important operations for text databases: inverse, search and decompress. Our algorithms can find occ occurrences of any substring P of the text in $ O(|P|\log n + occ\log ^\varepsilon n) $ time and decompress a part of the text of length l in $ O(l + \log ^e n) $ time for any given 1 ≥ ∈ > 0. Our data structure occupies only $ n(\frac{2} {\varepsilon }(\frac{3} {2} + H_0 + 2logH_0 ) + 2 + \frac{{4log^\varepsilon n}} {{log^\varepsilon n - 1}}) + o(n) + O(|\Sigma |log|\Sigma |) $ bits where $ {\rm H}0 \leqslant {\text{log}}\left| \sum \right| $ is the order-0 entropy of the text. We also show the relationship with the opportunistic data structure of Ferragina and Manzini.
Journal of the ACM, 2014
The inverted index is the backbone of modern web search engines. For each word in a collection of web documents, the index records the list of documents where this word occurs. Given a set of query words, the job of a search engine is to output a ranked list of the most relevant documents containing the query. However, if the query consists of an arbitrary string—which can be a partial word, multiword phrase, or more generally any sequence of characters—then word boundaries are no longer relevant and we need a different approach. In string retrieval settings, we are given a set D ={ d 1 , d 2 , d 3 , …, d D } of D strings with n characters in total taken from an alphabet set Σ = [σ], and the task of the search engine, for a given query pattern P of length p , is to report the “most relevant” strings in D containing P . The query may also consist of two or more patterns. The notion of relevance can be captured by a function score ( P , d r ), which indicates how relevant document d r...
ACM Transactions on Algorithms, 2007
Let T be a string with n characters over an alphabet of constant size. The recent breakthrough on compressed indexing allows us to build an index for T in optimal space (i.e., O(n) bits), while supporting very efficient pattern matching ]. Yet the compressed nature of such indexes also makes them difficult to update dynamically.
Lecture Notes in Computer Science, 2004
Let T be a string with n characters over an alphabet of bounded size. The recent breakthrough on compressed indexing allows us to build an index for T in optimal space (i.e., O(n) bits), while supporting very efficient pattern matching [2, 4]. This paper extends the work on optimal-space indexing to a dynamic collection of texts. Precisely, we give a compressed index using O(n) bits where n is the total length of texts, such that searching for a pattern P takes O(|P | log n + occ log 2 n) time where occ is the number of occurrences, and inserting or deleting a text T takes O(|T | log n) time.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.