0% found this document useful (0 votes)

65 views10 pages

Icde 2007 K-Skyline

This document presents a novel problem called "top-k representative skyline points" (top-k RSP) which aims to select k skyline points from a dataset that maximize the number of points dominated by the k points. The document first defines the problem and discusses related work on computing full skylines. It then presents exact and approximation algorithms for solving top-k RSP in 2D and higher dimensions. The main contributions are an efficient dynamic programming algorithm for 2D, proving NP-hardness for dimensions 3+, and developing a scalable randomized indexing technique with accuracy guarantees to solve the problem.

Uploaded by

Sanchit Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views10 pages

Icde 2007 K-Skyline

Uploaded by

Sanchit Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Selecting Stars: The k Most Representative Skyline Operator

Xuemin Lin1
1

Yidong Yuan1

Qing Zhang2
2

Ying Zhang1

School of Computer Science and Engineering The University of New South Wale & NICTA, Australia {lxue, yyidong, yingz}@cse.unsw.edu.au

E-Health Research Center CSIRO ICT Center, Australia [email protected]

Abstract
Skyline computation has many applications including multi-criteria decision making. In this paper, we study the problem of selecting k skyline points so that the number of points, which are dominated by at least one of these k skyline points, is maximized. We rst present an efcient dynamic programming based exact algorithm in a 2d-space. Then, we show that the problem is NP-hard when the dimensionality is 3 or more and it can be approximately solved by a polynomial time algorithm with the guaranteed approximation ratio 1 1 e . To speed-up the computation, an efcient, scalable, index-based randomized algorithm is developed by applying the FM probabilistic counting technique. A comprehensive performance evaluation demonstrates that our randomized technique is very efcient, highly accurate, and scalable.

1. Introduction
Given a set of d-dimensional points, the skyline consists of the points, called skyline points, which are not dominated by another point. A point p = (p[1], p[2], ..., p[d]) dominates another point q = (q [1], q [2], ..., q [d]) iff p[i] q [i] (for 1 i d) and there is at least one dimension j such that p[j ] < q [j ]. The skyline computation (or the skyline operator) is crucial to many multi-criteria decision making applications. A typical example is a list of hotels, each of which contains two numerical attributes distance (say, to the beach) and price, for on-line booking. Figure 1(a) shows a sample list. In this application, the best choice to a client, who wants to spend holiday in the beach, may be as close as possible to the beach while also cost effective. Consequently, the best choices form the skyline (see Figure 1(b)).
id

p1 p2 p3 p4 p5 p6 p7

dist dist (km) price ($) 4 150 p1 3 110 p 2 2.5 240 p3 2 180 p4 p5 p skyline 7 1.7 270 p6 1 195 price 1.2 210 (b) Skyline (a) Hotels Figure 1. A Skyline Example

Skyline computation has recently received a great deal of attention in the database community. A number of efcient algorithms for computing all skyline points (i.e. full skyline)

have been reported in the literature [4, 9, 13, 22, 29, 33]. It has been shown in [3, 13] that the expected number of skyline points is (lnd1 n/(d 1)!) for a random dataset. With the presence of a possibly large number of skyline points, the full skyline may be less informative. In the above example, it may be hard for users to make a good, quick selection by referencing the full skyline that consists of too many hotels. To resolve this, a system may be required to provide k skyline points. Such k skyline points should be most representative. Consider that the full skyline represents a whole dataset. In this paper, we quantify the concept of representative by the population. Specically, we investigate the problem (called top-k representative skyline points) of computing k skyline points such that the total number of (distinct) data points dominated by one of the k skyline points is maximized. In the above example, p6 is returned if k = 1, while p2 and p6 are returned if k = 2. The computation of top-k representative skyline points (top-k RSP) may facilitate many applications including topk queries when multi-criteria are involved. With respect to the above example, if users only want to see k hotels to make a selection based on the price and the distance, then a solution to the top-k RSP can provide users with the condence that these k hotels are already representing (better than) the maximum possible number of available options (hotels). Moreover, the top-k RSP also provides a novel ranking mechanism for top-k queries. Selecting data points with certain designated dominance properties has been recently investigated in [7, 8, 21, 29]. Nevertheless, to the best of our knowledge our top-k RSP problem is novel and it is inherently different than the problems in [7, 8, 21, 29]; consequently these existing techniques are not applicable to the top-k RSP problem. Motivated by this, in this paper we develop efcient, novel algorithms to compute the top-k representative skyline points. This is the rst work regarding the problem of top-k RSP. Our contributions can be summarized as follows. We propose a novel skyline operator, top-k representative skyline, so that the k skyline points with the maximal number of dominated points can be produced to facilitate on-line user queries. In a 2d-space, we develop an efcient dynamic programming based algorithm to solve the problem. We show that the problem is NP-hard in a d-dimensional space when d is 3 or more. Then, we show that by an immediate transformation to the set cover problem, the

problem can be solved by a greedy heuristic with the approximation ratio 1 1 e. Observe such a greedy heuristic may not be scalable nor efcient. We develop a novel, efcient, scalable, indexbased randomized algorithm, with a theoretical accuracy guarantee, by using a probabilistic counting technique FM algorithm [12]. Besides theoretical analysis, an extensive experimental evaluation demonstrates that our randomized algorithm is both time- and space- efcient, as well as highly accurate. The rest of the paper is organized as follows. Section 2 gives the problem denition and presents the related work. Section 3 reviews two existing techniques that will be employed in our algorithms. Section 4 presents the exact algorithm in 2-dimensional space and our results for a multidimensional space are presented in section 5. Our extensive experimental evaluation is reported in section 6. Section 7 concludes the paper.

2. Background Information
We rst state the problem. Then, we present the related work. Below in Table 1 we summarise the math notation used throughout the paper.
Notation P d n p, q SP m s (S ) D(S ) e S.f m (s.f m) eH Denition a set of data points dimensionality of a space |P | a data point the set of skyline points of P |S P | a (set of) skyline point (points) the set of points dominated by one s S an entry in an R-tree FM sketch set at S (s) the heap top of a heap H

Table 1. Math Notation

2.1. Problem Statement

Suppose that P is a set of d-dimensional data points. For a data point q P , D({q }) denotes the set of points in P that are dominated by q ; for a set Q of data points, D(Q) denotes the set of points each of which is dominated by a q Q. Clearly, D(Q) = qQ D({q }). Example 1. Regarding Figure 1(b), D({p4 }) = {p3 }, D({p6 }) = {p3 , p5 , p7 }, and D({p4 , p6 }) = {p3 , p5 , p7 }. The problem of top-k representative skyline points (topk RSP) is formally dened as follows. Top-k RSP. Given a set P of points and an integer k , compute a set S of k skyline points such that |D(S )| is maximized. Note that when |SP | k , SP is the solution. In this paper, we study the problem of efciently computing top-k RSP.

2.2. Related work

Computing full skyline. Efciently computing skyline is rst investigated by Kung et al. in [23]. Bentley et al. [3] provide an efcient algorithm with an expected linear running time if the data distribution on each dimension is independent.

B orzs onyi et al. [4] rst investigate the skyline computation problem in the context of databases and propose an SQL syntax for the skyline query. They also develop the skyline computation techniques based on block-nested-loop and divide-conquer paradigms, respectively. Chomicki et al. [9] propose another block-nested-loop based computation technique, SFS (sort-lter-skyline), to take the advantages of a pre-sorting. The SFS paradigm is signicantly improved by Godfrey et al. in [13]. Tan et al. [33] propose the rst progressive technique that can output skyline points without having to scan the whole dataset. Two auxiliary data structures are proposed, bitmap and search tree. Kossmann et al. [22] present another progressive technique based on the nearest neighbour search technique on R-tree [32, 15], which adopts a divide-and-conquer paradigm on the dataset indexed by R-tree. Papadias et al. [29] propose a branch and bound search technique (BBS) to progressively output skyline points on datasets indexed by R-tree. One of the most important properties of BBS in [29] is that it guarantees the minimum I/O costs. Kapoor [20] studies the problem of dynamically maintaining an effective data structure for an incremental skyline computation in a 2-dimensional space. Chan et al. [6] investigate the skyline computation problem for partiallyordered value domains. Data points with designated dominance properties. Observe that the number of skyline points may be large; thus the full skyline is not always very informative. The problem of selecting data points with some designated dominance properties has been recently investigated in [8, 7, 21, 29]. Koltun and Papadimitriou [21] aim to nd a minimum set of points to approximately dominate all data points. Specifically, for a given ( 0), nd a subset Q of a given P , with the minimum cardinality, such that every p P is dominated by a (1 )q where q Q. In [21], the maximal vector problem is investigated. Here, we state their problem with our setting. It has been shown that the problem can be solved by a greedy heuristic in a 2d-space, and it is NP-hard for d = 3 or more. Then, it shows the problem can be approximately solved with a poly-logarithmic cardinality (for > 0) of Q regarding the values domain. This problem is inherently different than our problem top-k RSP. First, it does not guarantee that the data points in Q are skyline points unless = 0; when = 0, all skyline points are returned. Second, the size of Q is poly-logarithmic regarding the values domain. Consequently, the results and the techniques in [21] are not applicable to top-k RSP. Chan et al. [8] investigate the problem of computing topk frequent skyline points based on a new metric, skyline frequency. Skyline frequency of a point p is the number of subspaces where p is a skyline point. In [7], Chan et al. develop efcient algorithms to compute skyline points each of which is a skyline point in all subspaces with dimensionality k for a given k . The most related problem to our top-k RSP has been investigated in [29]. Papadias et al. [29] propose a k dominating query. It aims to compute a set Q of k points such that pQ |D({p})| is maximized. The problem of k dominating query is also inherently different than our top-

k RSP. First, a k -dominating query does not always return skyline points; for instance, p6 and p7 are returned regarding Figure 1 when k = 2 (p7 is not a skyline point). Second, in the problem of k -dominating query we only need to record the number of dominated points for each data point and there is no need to consider the situation that a data point may be dominated by many other points. The algorithms for processing a k -dominating query cannot be used to our top-k RSP. Other Related Work. There also have been a number of research results in the literature regarding variations of skyline computation. These include computing skyline in a distributed environment [1, 18], continuously processing skyline queries in data streams [26, 34], skyline cube computation [30, 37] and its dynamic maintenance [36], computing skyline efciently in a subspace [35], effectively materializing dominance relationships [24], and multi-source skyline query processing [10].

P rob{h(p) = i} = 2i1 +1 . In our implementation, we use the public code from Massive Data Analysis Lab [27] to randomly generate such hash functions. An FM sketch on P is a bitmap with length L which is dened as: F m = {B : 0 j L 1, B [j ] = 1 iff p P , h(p) = j }. In order to improve the accuracy of FM algorithm, multiple copies (say ) of FM sketches are constructed; each is constructed against an independently generated hash function. Let f m(P ) represent the set of FM sketches generated over P . That is, f m(P ) = {F m1 , F m2 , . . . , F m }, where each element p P is hashed into these FM sketches, respectively, as described above. Let min(B ) denote the least bit (from left) of a bitmap B with value 0; if no such bit exists then min(B ) = L. The number n of distinct elements in P is estimated by: E (min(F m1 )) 1 P def 2 . (1) A = 2 i=1 min(F mi )/ , where = n Note that in formula (1), E (min(F m1 )) cannot be explicitly represented and n is not known. In our implementation we approximately choose as 0.775351 according to the approximate results in [12]. Each min(F mi ) related to F mi is dened in the same way as min(B ) related to B . As shown in [12], E (min(F mi )) = E (min(F mj )) (1 i < j ); this, together with Theorem 2 in [12] and the Central Limit Theorem (pp 229 in [11]), immediately leads the following theorem by the independence assumption. Theorem 1. Let n be the number of distinct elements in P and A be the estimation of FM algorithm as shown in (1). For a given 0 < < 1 and , if L = O(log n + log + log 1 ), then |A n| < n holds with probability at least 1 , where = O(
log 1

3. Preliminaries
We present briey the skyline computation algorithm, BBS [29], as well as a probabilistic algorithm, FM [12], for counting distinct data elements. They will be employed in our approximate algorithm for top-k RSP.

3.1. BBS Algorithm

Suppose that a dataset P is indexed by an R-tree. To compute skyline, BBS traverses the R-tree in the order, such that it always evaluates and expands the tree node closest to the origin among all un-visited nodes. To do that, a minheap is built against a designated mindist (say, the summation of all coordinate values) of the lower-left corner of the minimum bounding box (MBB) of every entry (node). Initially, BBS inserts all the child entries of the root of the R-tree into the heap. Iteratively, the heap top e of the heap is examined against the already computed skyline points. If e is dominated by an already computed skyline point,1 then e is just simply discarded from the heap. Otherwise, if e is a data point, then the data point is output as a skyline point; if e is not a data point, then discard e and insert the child entries of e, which are not dominated by any current skyline point, into the heap. BBS terminates when the heap is empty. In order to efciently examine the dominance relationship, an in-memory R-tree is maintained on the current skyline points. BBS has the properties that 1) any progressively generated skyline point is guaranteed to be a skyline point against P , 2) it is I/O optimal, 3) a node entry is read by disk I/O only once.

Two sets of f m sketches (say, f m(P ) and f m(Q)) generated by the same L and the same set of hashing functions may be merged by the bitwise-or operator (denoted by ) as follows. Let f m(P ) = {F mi : 1 i }, f m(Q) = {F mi : 1 i }. We dene f m(P ) f m(Q) as {F mi F mi : 1 i }, where each F mi F mi is also a bitmap with subindexes [0, L 1], such that for 1 i : 0 j L 1, (F mi F mi )[j ] = 1 iff F mi [j ] = 1 or F mi [j ] = 1. An important feature of FM algorithm is that the bitwiseor operator provides an equivalent way to generate a set of FM sketches over P Q. The following lemma can be immediately veried. Lemma 1. Given a set of hash functions and two collections, P and Q, of data points, we have f m(P Q) = f m(P ) f m(Q).

3.2. FM Algorithm
FM algorithm proposed by Flajolet and Martin [12] is a bitmap based algorithm that can efciently estimate the number of distinct elements (data points). Let B be a bitmap of length L with subindexes [0, L 1], and all bits are initialized as 0 (i.e. B [j ] = 0 for 0 j L 1). Suppose that h() is a randomly generated hash function which hashes each elementID into an integer in [0, L 1] such that for each data point p in a collection P of data points,
1A

4. Two-dimensional Space
We investigate the problem of the top-k RSP in a 2dspace. We rst show the problem can be solved by a dynamic programming algorithm. Then, we develop a sweepline technique [31] to efciently compute the parameters needed in the dynamic programming algorithm.

point s dominates entry e iff s dominates the lower-left corner of e.

4.1. Dynamic Programming Based Algorithm

Suppose that {s1 , s2 , . . . , sm } is a collection of skyline points in a 2d-space, which are sorted in the ascending order of x-coordinate values; consequently, they are also sorted in the descending order of y -coordinate values. Each si is represented by (si [x], si [y ]). As depicted in Figure 2, for each pair {si , sj } of skyline points, (si , sj ) denotes the set of data points that are dominated by si but not dominated by sj (see Figure 2 for an example).
10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8

|Cj +1,b |
ibm

(5)

For example, |(s4 , s1 )| = |(s4 , s2 )| + |C2,4 | + |C2,5 | = 2, as depicted in Figure 3.

y p1 s1 p7 s2 s3 p5 p6 p8 s4 p9 p1 0 s5 x
9 10

yb = 1 p1 s1

2 p2 p5 p7 p6 s3

3 p4 p3

4 5

D(s4,s1) are the points in the shadowed region, i.e., p9, p10

7 6 5 4 3 2 1

p8 s4

D(s4,s1) are the points in the shadowed cells p9 2

3 4 p1 0 5 s x

a=1

sweep-line
1 2 3 4

C2,4
5 6 7 8

C2,5 5
9 10

D(s4,s2) are the points in the bold border cells

Figure 2. Skyline and (si , sj ) Let opt(si , k ) denote the number of data points dominated by at least one of the k skyline points in an exact solution of the top-k RSP restricted to {s1 , . . . , si }, where the exact solution contains si . We have opt(si , k ) = max {|(si , sj )| + opt(sj , k 1)} (2)
1j<i

Note that opt(si , 1) is the number of points dominated by the skyline point si . Let OP T (k ) denote the the number of points dominated by at least one of the k skyline points in the exact solution of top-k RSP with respect to the whole set of skyline points. Clearly, OP T (k ) = max {opt(si , k )} (3)
1im

Based on formulae (2) and (3), the V-optimal dynamic programming technique [19] can be immediately used to solve our top-k RSP. The algorithm runs in O(km2 ) time if each |(si , sj )| (for 1 j < i m) is pre-computed and the skyline points are also pre-computed. In the following subsection, we present an efcient technique to compute the skyline points and ||. 4.2. Computing Skyline and || As an immediate approach, the skyline and || can be computed separately. First, all skyline points are computed over a given dataset by an existing algorithm. Then, compute the corresponding || values (initially, 0) for each data point. This na ve approach is not efcient because a data point may be counted multiple times if it is contained by several (si , sj ). Below, we present a sweep-line [31] based algorithm to efciently compute || values and skyline points simultaneously by sorting data points rst. As depicted in Figures 3, the region dominated by the skyline points can be partitioned into a number of cells {Ca,b : 1 a b m} where the lower-left corner and the upper-right corner of Ca,b are (sb [x], sa [y ]) and (sb+1 [x], sa1 [y ]), respectively. When a = 1 (or b = m), s0 [y ] (or sm+1 [x]) is dened as the maximum value of y coordinates (or x-coordinates) in the data space. Let |Ca,b | denote the number of data points that are contained by Ca,b but not on the top horizontal line (i.e., y = sa1 [y ]) except a = 1 nor on the right vertical line

Figure 3. Cells and (si , sj ) Our sweep-line algorithm uses a horizontal line to sweep along y -dimension from the bottom to the top to iteratively compute the skyline points and |Ca,b |s. Specically, for data points encountered by the sweep-line, we process those data points from left to right as follows. If a data point p is dominated by a current skyline point, then increase the number of points of the cell, containing p, by 1; otherwise, it is a new skyline point. Iteratively, once a new skyline point sj is found, the computation of every |Cj +1,b | (for b j ) has been completed; consequently, |(si , sj )| (for i > j ) now can be calculated by using formula (5). Then, all |Cj +1,b | (for b j ) can be discarded. Clearly, the total space required is O(m2 ), which is dominated by maintaining all |(si , sj )| values (1 j < i m). To facilitate such a sweep-line technique, we rst sort the data points p lexicographically on (p[y ], p[x]). To speedup the computation, all ibm |Cj +1,b | can be accumulatively computed from i = m to i = j + 1. Note that {Cj +1,b : j + 1 b m} are already sorted according to the x-coordinates of their lower-left corners. Therefore, the computation of the skyline points and all || values can be done in O(m2 + n log m) where n is the number of total data points. Consequently, the total time of our dynamic programming algorithm runs in O(km2 + n log m).

5. Multi-dimensional Space
The dynamic programming algorithm provides an exact solution to top-k RSP in a 2d-space; nevertheless, there are two issues to be addressed. Issue 1: The dynamic programming algorithm cannot be extended to computing the top-k RSP in a multidimensional space when d is 3 or more. Consequently, the problem of top-k RSP is left open generally. Issue 2: The space requirement in our dynamic programming algorithm is quadratic O(m2 ). Therefore, it is not scalable when m is large since additional I/O costs may be required if the space requirement is too large to t in memory. Moreover, if the dataset is indexed (say, by an R-tree), this algorithm does not make the use of such index to reduce the I/O costs.

This section is organized as follows. We rst address Issue 1 by sections 5.1 and 5.2. Then, we develop an efcient, scalable, index-based (R-tree based) randomized algorithm to address Issue 2.

5.1. Complexity
We show that top-k RSP is NP-hard in a multidimensional space when the dimensionality is 3.
z y

Step 1. Then, for each p P we compute {D({s}) : s SP } by a window query per data point p P SP against SP , where the window uses the origin as the lower-left corner and p as the upper-right corner, as suggested in [29]. We apply the sort-merge paradigm in Step 3. The space required to compute and store all D({s}) is O(mn) and may not t in the memory. In fact, we found that when d = 5, 45% of D({s})s have to be written to the disk with 1G memory. This not only requires additional disk I/O in Step 2 but also in Step 3 (I/O costs could be very expensive). We address the issue by our algorithm in the next subsection.

x (a) Plane x + y + z = 1

(b) Grid

5.3. FM-based Algorithm

We modify Algorithm 1 by removing the requirement of computing and storing every D({s}) (s SP ). As with in 2d-space we may divide the space dominated by the skyline points into grid cells and compute the number of points contained by each cell. Unlike 2d-space, in a multidimensional space with d 3 an arbitrary combination of k skyline points may be probed in Step 3. Consequently, the count of each cell should be stored and the space required is O(md ) that may be too larger to t in memory when m is large and d 3. Thus, this is not a good alternative. On the other hand, keeping only |D({s})| (s SP ) does not provide enough information to accurately estimate |D(S )| for a set S of skyline points since there could be many data points dominated by more than one skyline point in S . For instance, in the example of Figure 2, point p3 is dominated by s1 , s2 , and s3 . To overcome the over-counting issue, we apply a duplicate-insensitive counting technique, FM algorithm, to approximately counting the number of points dominated by a skyline point s; that is, we maintain a set s.f m of FM sketches (bitmaps) at s, each of which has L bits and is generated by hashing the data points in D({s}). Then, to compute |D(S )| we only need to apply the bitwise-or operator , as described in section 3.2, on {s.f m : s S } to get a FM sketch set (and then use formula (1)). According to Lemma 1, this is the same as if we use FM algorithm to approximately computing |D(S )|. This is the basic idea of our algorithm. Below, we outline our FM-based algorithm in Algorithm 2. Note that in our algorithm, all FM sketch sets are created by the same set of hash functions randomly generated. Algorithm 2 FMGreedy ( k , , L, P )
Input: k, , L: integers; P : a dataset. Output: k skyline points. Description: Step 1: compute SP ; Step 2: s S , compute FM sketch set s.f m; Step 3: 1: S := ; initialize the bitmaps in S.f m; 2: while |S | < k and SP S = do W 3: choose s SP S such that Est(s.f m S.f m) is maximized; W 4: S := {s} S ; S.f m := s.f m S.f m; 5: return S ;

Figure 4. A Transformation Theorem 2. Given a set of points in a 3d-space, the problem of computing top-k RSP is NP-hard. Proof. As depicted in Figure 4, we can divide one part (see the shaded area in Figure 4(a)) of the plane x + y + z = 1 into grid cells (see Figure 4(b)) such that each grid cell has at most 3 data points. According to the proofs of Theorem 3.3 and Theorem 4.1 in [25], the problem of nding k grid cells to contain the maximum number of points is NP-hard regarding this setting. For a cell containing at least one data point, we choose 3 grid points including all data points in the cell; in case if the number of data points in the cell is less than 3, then we randomly choose non-data grid points to make it 3. Then, we create a new point by using the minimums, among these 3 grid points, in each coordinate, respectively, as its 3 coordinate values. It can be immediately veried those new created data points are the skyline points which only dominate the data points in its corresponding cells. Therefore, the problem of computing top-k RSP is also NP-hard.

5.2. Greedy Algorithm

In fact, top-k RSP can be immediately transformed into the maximum coverage problem [16]; consequently it can be solved approximately by a greedy heuristic. Below in Algorithm 1, we present the greedy heuristic. Note that D(S ) is the set of points each of which is dominated by at least one point in S and SP is the skyline of dataset P . Algorithm 1 Greedy ( k , P )
Input: k: an integer; P : a set of data points. Output: k skyline points. Description: Step 1: compute SP ; Step 2: s SP : compute D({s}); Step 3: 1: S := ; 2: while |S | < k and SP S = do 3: choose s SP S such that |D({s} S )| is maximized; 4: S := {s} S ; 5: return S ;

Lemma 2. [16]: Algorithm 1 returns an approximate solu2 tion to top-k RSP with the approximate ratio 1 1 e. In our implementation, we assume that the dataset P is indexed by R-tree [2, 14]. We use BBS to compute SP in
2 Here,

e is Eulers constant rather than an entry in an R-tree.

Note that Est(s.f m

S.f m) is calculated by the for-

mula (1) in section 3.2. The following Theorem immediately follows from Theorem 1 and Lemma 2. Theorem 3. Algorithm 2 has the approximate ratio (1 + O( log )) with condence 1 if each bitmap has the length L = O(log n + log + log 1 ). Our experiments in section 6 demonstrate that when = 32 and L = 32, Algorithm 2 has very similar accuracy to Algorithm 1. The Step 3 of Algorithm 2 can be executed in time O(k m). Since is up-to 32 in our implementation, the Step 3 runs in time O(km). The space requirement to store all FM sketch sets is O( m) that equals O(m) in our execution. Therefore, we reduce the space requirement O(mn) to store {D({s}) : s SP } to O(m) to store all FM sketch sets with respect to the skyline points. Next, we present an efcient algorithm to conduct Step 1 and Step 2 by one-scan of dataset by extending BBS algorithm. For this purpose, we rst augment R-tree to include FM sketches. 5.3.1. RF M -tree At each intermediate entry e of an RF M -tree, besides the required information in a conventional R-tree (e.g., MBB), we maintain a set f m of FM sketches for estimating the number data points in the subtree of e. Lemma 1 in section 3.2 implies that the sketch set e.f m at entry e can be equivalently obtained by using the bitwise-or operator to merge the sets of FM sketches at all es child entries. Figure 5 shows an RF M -tree when = 1 and L = 4.
1 e )(1
10 9 8 7 6 5 4 3 2 1 1 2 3 4 5 6 7
1

y p1 s1 e1 s2

e6 p2 p5 e2 p7 s3 e3 p4 p6 e4 e7
8

1110

p3
1110

e6 e1 e2 e3
1010

e7
1100

e5
1100

p8 s4 e 5 s5 x
9 10

p9 1100 p 1 0 s1 s2 p 1
1000

p2 p3 p4
0100

s5 p 9 p 1 0

p5 p6 p7

s3 s4 p 8

Figure 5. RF M -tree Note that the FM-based aggregate information practically requires small space only. Our experiments show that if each page has size 4Kbytes, the fanout of an RF M -tree is only 1 less than that of a conventional R-tree when each FM sketch set consists of 8 bitmaps with 32 bits each, and about 3% less when each FM sketch set consists of 32 bitmaps with 32 bits each. It is immediate that such an RF M -tree can be updated in a similar way as a conventional R-tree. 5.3.2. Computing Skyline and FM Sketches Given that a dataset P is indexed by an RF M -tree, in this subsection we present an efcient algorithm, OnePass, to compute the set of skyline points and their FM sketches (to estimate every |D({s})|) simultaneously, so that each entry in RF M -tree is read at most once by disk I/O. I/O-efciency. The algorithm BBS computes the skyline points with the minimum I/O costs; nevertheless, it is not immediately applicable to compute FM sketches simultaneously; see the following example. Example 2. In the example of Figure 5, BBS outputs the skyline points in the order of s2 , s1 , s3 , s4 , and s5 if mindist

is x + y . After s2 is returned as the skyline point, e7 is expanded to e4 and e5 . Then, e2 is the next to be processed. Since e2 is dominated by s2 , e2 is discarded from the heap H1 in BBS. As s3 is not yet returned as the skyline point, removing e2 means that we have to start from the root of the RF M -tree to compute the FM sketches at s3 . To resolve the problem in Example 2, in our algorithm OnePass we use another heap H2 to keep the entries discarded from H1 by BBS. Minimizing the size of H2 . Clearly, there will be a huge memory requirement if we keep every thing in H2 until all skyline points are calculated. We need to minimize the size of H2 so that our algorithm is scalable while pursuing I/O efciency. It seems difcult to nd a necessary and sufcient condition to determine the time to discard an entry e H2 so that the progressively returned skyline points afterwards do not dominate any point in e. In our algorithm, we use a sufcient condition below for this purpose. Suppose that in BBS, the mindist is the summation of the coordinate values. For each entry e H2 , we dene e.maxdist as the summation of the coordinate values of es upper-right corner (e is represented geometrically by MBB); H2 is a min-heap maintained against e.maxdist. Note that BBS has the property that the minimum value of mindist among the entries in the current H1 is not greater than that in the next iteration. Therefore, it can be immediately veried that if eH2 .maxdist eH1 .mindist then any skyline points progressively returned afterwards do not dominate any point in eH2 . Consequently, eH2 should be used to update the FM sketches for the current skyline points; then, eH2 can be discarded from H2 . Example 3. Continuing the example in Figure 5, e2 will be popped up (then discarded) from H2 to update the FM sketches at s1 , s2 , and s3 when BBS encounters s4 in H1 . Algorithm description. Our algorithm to compute the skyline points and their FM sketches is based on the framework of BBS. It is presented below in Algorithm 3. Algorithm 3 OnePass ( P )
Input: P is a dataset indexed by an RF M -tree R. Output: SP : the skyline point set; FM sketches at s (s SP ) Description: 1: S := ; H1 := ; H2 := ; 2: insert all entries of the root of R into H1 ; 3: while H1 = or H2 = do 4: while H2 = do 5: if H1 = or eH2 .maxdist eH1 .mindist then 6: UpdateSketch(eH2 , S ); remove eH2 from H2 ; 7: else break the whileloop; 8: if eH1 is dominated by an s S then 9: if eH1 is a point then 10: UpdateSketch(eH1 , S ); remove eH1 from H1 ; 11: else remove eH1 from H1 to H2 ; 12: else 13: if eH1 is a point then 14: add eH1 into S ; 15: else add child entries dominated by an s S to H2 and others to H1 ; 16: return S and FM sketches at each s S ;

Note that in Algorithm 3, H1 is a heap maintained in the

same way as that in BBS, while H2 is a heap discussed above. When the heap top eH1 in H1 is a point that is dominated by an already obtained skyline point s, instead of putting eH1 into H2 we can immediately use it to update the FM sketches (then discard eH1 ). This is because that eH1 will not be dominated by any skyline point output afterwards due to the following two reasons: An entrys mindist equals its maxdist if it is a point. Any skyline point output afterwards has its mindist not smaller than the current mindist. In fact, it can be immediately veried that Algorithm 3 is correct, based on our discussions in the part Minimizing the size of H2 ; that is, once an entry e removes from H2 , none of the data points included in e is dominated by the skyline points output afterwards. Algorithm 3 also follows exactly BBS for computing the skyline points SP ; thus, all the skyline points can be obtained correctly. In Algorithm 3, there is a key operation, UpdateSketch(), to update FM sketch sets at the current skyline points, respectively. Next, we present algorithm for UpdateSketch(). Update FM sketches. Regarding the example in Figure 5, s2 dominates the entry e2 . Therefore, the update to the FM sketch set at s2 is already materialized in RF M -tree; that is, we only need to use bitwise-or operator on the bitmap (1, 0, 0, 0) and the existing s2 .f m, instead of hashing the points p5 , p6 , and p7 . Moreover, if we had a skyline point s allocated at (0.5, 0.5), then an update to the FM sketch set at s would be done by only reading the FM sketch set at the root in this example. These are the reasons why an RF M -tree is maintained. As with the algorithm BBS, in Algorithm 3 the progressively generated skyline points are also indexed by an inmemory index. Different than BBS, we use an in-memory RF M -tree instead of R-tree to speed-up the computation (update) of FM sketches as well. Such an in-memory RF M tree has the same data structure as that to index the dataset except that we also attach FM sketches at each data point in an in-memory RF M -tree. In an in-memory RF M -tree, the FM sketch set at a tree node is used to estimate the number of data points that are captured by our algorithm to be commonly dominated by all points in its subtree. Example 4. In the example of Figure 5, the points in e3 are dominated by s1 and s2 . Consequently, if one entry e1,2 of the in-memory RF M -tree consists of s1 and s2 , then we can immediately use the FM sketch set at e3 to update the FM sketch set at e1,2 by the bitwise-or operator . Based on Lemma 1, we can use the bitwise-or operator iteratively on the FM sketch sets along the path in the inmemory RF M -tree from the root to a skyline point s to obtain the global FM sketch set at s to estimate the number of distinct points dominated by s. We need the following notation in our algorithm description. An entry e (bounding box) fully dominates another entry e if the upper-right corner of e dominates the lower-left corner of e . An entry e partially dominates another entry e if the lower-left corner of e dominates the upper-right corner of e but e does not fully dominate e . An entry e does not dominate another entry e if e does not fully nor partially

partially dominated
10 9 8

y p2 s1 e1, 2 s2 s3 e3,4,5 s4 s5 x
1 2 3 4 5 6 7 8 9 10

e3 p4

7 6 5 4 3 2 1

e4 e3

fully dominated not dominated x

Figure 6. Dominance Relation Figure 7. Update f m dominate e . For example, in Figure 6 e4 fully dominates e1 , partially dominates e2 , and does not dominate e3 . We treat UpdateSketch (eH2 , S ) as one kind of spatial join [5, 17]. We apply the R-tree traversal paradigms from [5, 17, 28]. To avoid reading a data entry e from disk more than once, we group the entries, at which FM sketch sets may need to be updated by the points or child entries of e, of in-memory RF M -tree for skyline together into a group Ee . We iteratively and synchronously traverse e and Ee , while always group the new expanded entries of the in-memory RF M -tree (for skyline points) together for each new expanded data entry for the next iteration. We outline our algorithm in Algorithm 4 and present the details of the procedure Traversal (e, E ) in Algorithm 5. Algorithm 4 UpdateSketch ( eH2 , S )
eH2 : an entry of RF M -tree of a dataset; S : skyline points indexed by an in-memory RF M -tree; Output: updated FM sketches at the skyline points in S ; Description: 1: if eS fully dominates W eH2 then 2: eS .f m := eS .f m eH2 .f m; 3: else Traversal(eH2 , {eS }); Input:

In Algorithm 4, eS is the root entry (MBB) of the inmemory RF M -tree on S ; according to Algorithm 3, eS either fully or partially dominates eH2 . It reads in the node in RF M -tree corresponding to eH2 to get its FM sketches eH2 .f m. If eH2 is a data point, it has to be hashed by FM algorithm to get eH2 .f m. Example 5. Regarding the example in Figure 5, suppose in the in-memory RF M -tree of skyline points s1 , s2 , . . . , s5 , s1 and s2 form one entry e1,2 and the entry e3,4,5 consists of 3 points s3 , s4 , and s5 (see Figure 7). When e3 is pop-up from H2 to be processed against the in-memory RF M -tree, it enters Algorithm 5. In Algorithm 5, e3 is subsequently decomposed into p2 , p3 , and p4 , while the root of in-memory RF M -tree is decomposed into e1,2 and e3,4,5 . When p2 is processed against e1,2 and e3,4,5 , e1,2 .f m is updated; nevertheless, (p2 , {s3 , s4 , s5 }) has been sent to Algorithm 5 for a further processing. We can immediately verify that in Algorithm 4, a data point p dominated by an s SP must be captured by one entry from the root to s in the in-memory RF M -tree. Consequently, as discussed earlier, computing FM sketch set at each skyline point s by performing bitwise-or operator iteratively on the FM sketch sets along the path from the root of in-memory RF M -tree to the skyline point s ensures the correctness of Algorithm 4; that is, it is equivalent to creating

Algorithm 5 Traversal ( e, E )
e: an entry of R -tree of a dataset; E : a set of entries of RF M -tree of skyline points; Output: updated FM sketches at the skyline points in E ; Description: 1: if e is a data point then ED := {e}; 2: else load in the child entries of e to ED ; 3: for each entry e ED do 4: Ee := ; 5: for each entry ei E do 6: if ei fully dominates e then W 7: GetFM(e ); ei .f m := ei .f m e .f m; 8: else if ei partially dominates e then 9: if ei is a data point entry then 10: Ee := Ee {ei }; 11: else 12: for each child entry ec of ei do 13: if ec fully dominates e then W 14: GetFM(e ); ec .f m = ec .f m e .f m; 15: else if ec partially dominates e then 16: Ee := Ee {ec }; 17: if Ee = then Traversal(e , Ee ); Procedure GetFM (e) if e is a data point, then hash e by FM algorithm to get e.f m; otherwise, read in e.f m by disk I/O if not already in memory. Input:
FM

trated in our experiment, FMG is highly accurate (with relative errors less than 5%) when 32 bitmaps (FM sketches) are used; in this case, the fanout (number of children entries) of RF M -tree is about 97% of that by a conventional R-tree. All algorithms are implemented by C++. The experiments are conducted on the PCs with Intel P4 2.8GHz CPU and 1G memory under the operating system Debian Linux. Table 2 below lists the parameters which may potentially have an impact on our performance study. In our experiments, all parameters use default values unless otherwise specied.
Notation n k d Denition (Default Values) Number of points in the dataset (1M) Number of representative skyline points (30) Dimensionality of the dataset (3) Number of sketches in FMG (32)

Table 2. System Parameters

6.1. Evaluating Accuracy

In this subsection, we experimentally evaluate the accuracy of FMG against different settings; that is, the number of points dominated by one of the k skyline points (comparing with that by EXACT in a 2d-space and that by GDY when d is between 3 to 5, respectively). Note that we did not do the accuracy evaluation against the exact solution for d 3 because the problem is NP-hard and it is too slow to produce exact solution for top-k RSP. The rst set of experiment is conducted against 4 datasets, Anti-correlated datasets (Anti) (in 2d and 5d spaces, respectively), Independent (Indep) (in a 5d space), and Stock. The experiment results are reported in Figure 8. It demonstrates that FMG is quite accurate when = 32, while GDY generates exact solution for Anti in 2d.
EXACT GDY

the FM sketch set at s by hashing each data point dominated by s by FM algorithm. Thus, Theorem 3 still holds. Note that if an entry of RF M -tree of the dataset is no longer needed in Algorithm 4, then it immediately removes from memory.

6. Performance Evaluation
We evaluate our algorithms only. Specically, we focus on evaluating our FM-based algorithm in section 5.3. As there is no existing work, we use the dynamic programming based algorithm in section 4 and the greedy heuristic in section 5.2 as bench-marking algorithms. The following algorithms have been implemented. 1. EXACT : the dynamic programming based algorithm proposed in section 4. 2. GDY : the greedy algorithm, Algorithm 1, in section 5.2. 3. FMG : the FM based greedy algorithm, Algorithm 2. in section 5.3. Following the common methodology in the literature, two synthetic datasets, Anti-correlated and Independent (random), have been employed in our performance evaluation, which are produced by the data generator in [4]. Their dimensionality varies from 2 to 5 and the number of data points varies from 200K to 3M . A real data set, Stock, from NYSE (New York Stock Exchange) is also used in our performance study. It contains 3M stock transaction records of NT (Nortel Networks Corporation,USA) from Dec 1st 2000 to May 22nd 2001. The average price per volume, volume, and time are recorded for each transaction; consequently it is used as a dataset in a 3d-space. All of the datasets are indexed by an RF M -tree with node page size 4Kbytes. In our implementation, each bitmap (FM sketch) has 32 bits (i.e., 4 bytes); a bitmap with 32 bits should be large enough to guarantee the FM accuracy while counting a massive number of distinct elements. As illus-

FMG

# of Points Dominated

1M 0.8M 0.6M 0.4M 2d Anti Stock 5d Indep 5d Anti

Figure 8. Number of Points Dominated In the rest of this subsection, we evaluate the accuracy G N | FMG by relative errors |NF M . N is the number of N data points dominated by at least one of the k skyline points in the exact solution in a 2d-space, N is such a number by GDY when d 3, NF M G is such a number by FMG.
0.05 Relative Error 0.04 0.03 0.02 0.01 10 20 30 40 50 Relative Error 2d Anti Stock 3d Indep 0.05 0.04 0.03 0.02 0.01 200K 400K 600K 800K 1M 2d Anti Stock 3d Indep

Figure 9. Various k Figure 10. Various n The second experiment evaluates possible impacts by different k values. It is conducted against three datasets, Anti in 2d space, Indep in 3d space, and the real datasets, Stock. The experiment results are depicted in Figure 9.

Again, it shows FMG is highly accurate. As anticipated, the accuracy improves when k increases. The third experiment examines an impact of the cardinality of datasets. The results depicted in Figure 10 indicate that the accuracy is not quite relevant to a dataset size. The last experiment examines an impact of the number of FM sketches. The results reported in Figure 11 conrm Theorem 3; that is, the accuracy improves when increases.
0.2 Relative Error 0.15 0.1 0.05 0 2 4 8 16 32 64 2d Anti Relative Error 0.05 0.04 0.03 Stock 3d Indep

sketches are involved in a bitwise-or operator gets increased.

Processing Time (s) Processing Time (s) 60 40 20 0 200K 400K 600K 800K 1M GDY EXACT FMG 150 100 50 0 200K 400K 600K GDY (Stock) GDY (Indep) FMG (Stock) FMG (Indep)

when

800K

(a) 2d Anti
Processing Time (s) Processing Time (s) 10 8 6 4 2 0 2 4 8 16 32 64 2d Anti

(b) 3d Indep & Stock

Stock 3d Indep

Figure 14. Time Efciency vs Cardinality

40 30 20 10 0 2 4 8 16 32 64

0.02 0.01 0 2 4 8 16 32 64

(a) 2d Anti

(b) 3d Indep & Stock

Figure 11. Relative Error vs

6.2. Evaluating Efciency

The rst experiment, as depicted in Figure 12 where the numbers on these bar gure tops give the actual running time, is conducted against the 4 datasets and the 3 algorithms EXACT, GDY, and FMG. It shows that GDY is very slow when d = 5; this is because {D({s}) : s SP } is too large to t in memory. Thus, additional I/O costs in Step 2 and Step 3 are required as stated in section 5.2. This makes Step 3 very slow. The results in Figure 12 demonstrate that FMG is most efcient.

(a) 2d Anti

(b) 3d Indep & Stock

Figure 15. Time Efciency vs

6.3. Space and I/O Efciency

We rst evaluate the space efciency. We illustrate the results for FMG only, since EXACT only works for 2dspace and GDY requires very large memory space at least (but could be much larger than) the size of whole dataset. In our experiment, we record the maximum space usage in the whole computation, including storing H1 , H2 , the space required to execute Algorithm 4. In H1 and H2 , we store ID and its MBB (or data point) for each entry. The rst set of experiments, conducted against Anticorrelated and Independent datasets with d in [2, 5], evaluate effects of d. The results are reported in Figure 16 where linear scale is used in y -coordinate and we show some important marks only on y -coordinate. It demonstrates that the memory space required is signicantly less than the dataset size. Note that we also examined the space requirements of EXACT in a 2d-space and storing {D({s}) : s SP } in GDY, respectively. Our experiment demonstrates that 1) EXACT requires space about 10 times more than that of FMG even the number m of skyline points is less than 67, and 2) GDY requires space about 3.63 times (d = 2) to 82.5 times (d = 5) of the corresponding dataset sizes.
23,437 Space (Kbyte) Dataset Anti Indep Space (Kbyte) 70,312 Dataset Anti Indep

EXACT

GDY

FMG

Processing Time (s)

10 10 10 10 10

4 3 2 1 0

15693

16858 1207

104 39 6 4 Stock 22

267

2d Anti

5d Indep

5d Anti

Figure 12. Processing Time The second experiment tests an effect of k . The experiment results are depicted in Figure 13. They also demonstrate that FMG is most efcient. While FMG and EXACT are not very sensitive to different k values, the time costs of GDY increase when k increases. This is mainly because that Step 3 in GDY involves a sort-merge process.
Processing Time (s) Processing Time (s) 60 40 20 0 10 20 30 40 50 GDY EXACT FMG 200 150 100 50 0 10 20 30 40 50 GDY (Stock) GDY (Indep) FMG (Stock) FMG (Indep)

11,718 8000 2000 2 3

35,156 15000 2000 2 3

(a) 2d Anti

(b) 3d Indep & Stock

(a) n = 1M

(b) n = 3M

Figure 13. Time Efciency vs k The results of our third experiment are reported in Figure 14. They show that EXACT, GDY, FMG are all sensitive to different data sizes. Note that GDY is very sensitive to data sizes due to the same reason as mentioned in the last experiment computation in Step 3. The nal experiment in this part is to evaluate an impact of in FMG. As depicted in Figure 15, the time costs of FMG increase when increases; this is simply because that a data element needs to be hashed more times and more FM

Figure 16. Space vs Dimensionality d The second experiment investigates impacts from different data sizes and . The results are depicted in Figure 17. They demonstrate that the memory space is not sensitive to because we do not keep FM sketches in heaps and we get FM sketches only when update them. It is also interesting to note that the memory space requirement is not very sensitive to a data size. This implies that FMG is scalable. We then evaluate the I/O efciency of FMG. For a 2d independent dataset with size 1M, only 60% of the nodes are

15,625 Space (Kbyte) Space (Kbyte) Dataset 3d Indep Stock 50 25 0 2 4 8 16 32 64

46,875 7,812 Dataset 3d Indep Stock

50 10 500K 1M

(a) Various

(b) Various n

Figure 17. Space vs and Cardinality n accessed by FMG, while one-scan of the whole RF M -tree is required by FMG when d 3. In our experiment, we also consider an alternative to implement FMG by two-scans: rst, the skyline points are computed by BBS and maintained by the in-memory RF M -tree; second, the data pages are read in one by one to perform Algorithm 4. Our experiment indicates the I/O cost ratio of FMG to this alternative decreases signicantly as d increases. For instance, the ratio is around 99% on Anti-correlated data set with d = 2 and n = 1M, while it drops to 63% when d = 5. Moreover, our experiment also shows that I/O costs of GDY are much more expensive than FMG regarding 1G xed memory. The I/O costs of GDY vary from 1.4 (d = 2) to 1772 (d = 5) times of those of FMG.

6.4. Summary
As a short summary, our performance evaluation indicates that FMG is quite accurate, efcient, and scalable regarding data size. When the number of skyline points is small, EXACT is a good choice in 2d-space. Although GDY is slightly more accurate than FMG, it requires huge memory space; thus, it is not scalable.

7. Conclusion
In this paper, we investigate the problem of computing the top-k representative skyline points. This is among the rst attempts to develop efcient and scalable algorithms to solve the problem. After introducing the novel skyline operator: top-k representative skyline points, we present an efcient dynamic programming based algorithm for a 2d-space in which an exact solution can be achieved. As shown in the paper, this problem is NP hard for space with dimensionality d 3 and the greedy heuristic for set cover problem can be immediately applied to provide the approximation ratio 1 1 e . We then develop an efcient, scalable randomised algorithm with a theoretical accuracy guarantee. As our performance study indicated, our randomized algorithm is both time and space efcient, as well as highly accurate. Acknowledgement. The work was partially supported by ARC-DP(DP0666428) and UNSW FRG(FRGP, PS08709).

References
[1] W.-T. Balke, U. G untzer, and J. X. Zheng. Efcient distributed skylining for web information systems. In EDBT 2004. [2] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The R*-tree: An efcient and robust access method for points and rectangles. In SIGMOD 1990. [3] J. L. Bentley, H. T. Kung, M. Schkolnick, and C. D. Thompson. On the average number of maxima in a set of vectors and applications. JACM, 25(4):536543, 1978. [4] S. B orzs onyi, D. Kossmann, and K. Stocker. The skyline operator. In ICDE 2001. [5] T. Brinkhoff, H.-P. Kriegel, and B. Seeger. Efcient processing of spatial joins using r-trees. In SIGMOD 1993.

[6] C. Y. Chan, P.-K. Eng, and K.-L. Tan. Stratied computation of skylines with partially-ordered domains. In SIGMOD 2005. [7] C. Y. Chan, H. V. Jagadish, K.-L. Tan, A. K. Tung, and Z. Zhang. Finding k-dominant skylines in high dimensional space. In SIGMOD 2006. [8] C. Y. Chan, H. V. Jagadish, K.-L. Tan, A. K. Tung, and Z. Zhang. On high dimensional skylines. In EDBT 2006. [9] J. Chomicki, P. Godfrey, J. Gryz, and D. Liang. Skyline with presorting. In ICDE 2003. [10] K. Deng, X. Zhou, and H. T. Shen. Multi-source skyline query processing in road networks. In ICDE 2007. [11] W. Feller. An Introduction to Probability Theory and Its Applications. John Wiley & Sons, Inc., 1966. [12] P. Flajolet and G. N. Martin. Probabilistic counting algorithms for data base applications. Journal of Computer and System Sciences, 31(2):182209, 1985. [13] P. Godfrey, R. Shipley, and J. Gryz. Maximal vector computation in large data sets. In VLDB 2005. [14] A. Guttman. R-trees: A dynamic index structure for spatial searching. In SIGMOD 1984. [15] G. R. Hjaltason and H. Samet. Distance browsing in spatial databases. TODS, 24(2):265318, 1999. [16] D. S. Hochbaum. Approximation algorithms for the set covering and vertex cover problems. SIAM Journal on Computing, 11(3):555556, 1982. [17] Y.-W. Huang, N. Jing, and E. A. Rundensteiner. Spatial joins using R-trees: Breadth-rst traversal with global optimizations. In VLDB 1997. [18] Z. Huang, C. S. Jensen, H. Li, and B. C. Ooi. Skyline queries against mobile lightweight devices in MANETs. In ICDE 2006. [19] H. V. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. C. Sevcik, and T. Suel. Optimal histograms with quality guarantees. In VLDB 1998. [20] S. Kapoor. Dynamic maintenance of maxima of 2-d point sets. SIAM Journal on Computing, 29(6):18581877, 2000. [21] V. Koltun and C. H. Papadimitriou. Approximately dominating representatives. In ICDT 2005. [22] D. Kossmann, F. Ramsak, and S. Rost. Shooting stars in the sky: An online algorithm for skyline queries. In VLDB 2002. [23] H. T. Kung, F. Luccio, and F. P. Preparata. On nding the maxima of a set of vectors. JACM, 22(4):469476, 1975. [24] C. Li, B. C. Ooi, A. K. Tung, and S. Wang. DATA: A data cube for dominant relationship analysis. In SIGMOD 2006. [25] X. Lin, Q. Liu, Y. Yuan, X. Zhou, and H. Lu. Summarizing level-two topological relations in large spatial datasets. TODS, 31(2):584630, 2006. [26] X. Lin, Y. Yuan, W. Wang, and H. Lu. Stabbing the sky: Efcient skyline computation over sliding windows. In ICDE 2005. [27] Massive Data Analysis Lab. http://www.cs. rutgers.edu/muthu/massdal.html. [28] D. Papadias, N. Mamoulis, and Y. Theodoridis. Processing and optimization of multiway spatial joins using R-trees. In PODS 1999. [29] D. Papadias, Y. Tao, G. Fu, and B. Seeger. An optimal and progressive algorithm for skyline queries. In SIGMOD 2003. [30] J. Pei, W. Jin, M. Ester, and Y. Tao. Catching the best views of skyline: A semantic approach based on decisive subspaces. In VLDB 2005. [31] P. Rigaux, M. Scholl, and A. Voisard. Introduction to Spatial Databases: Applications to GIS. 2000. [32] N. Roussopoulos, S. Kelley, and F. Vincent. Nearest neighbor queries. In SIGMOD 1995. [33] K.-L. Tan, P.-K. Eng, and B. C. Ooi. Efcient progressive skyline computation. In VLDB 2001. [34] Y. Tao and D. Papadias. Maintaining sliding window skylines on data streams. TKDE, 18(2):377391, 2006. [35] Y. Tao, X. Xiao, and J. Pei. SUBSKY: Efcient computation of skylines in subspaces. In ICDE 2006. [36] T. Xia and D. Zhang. Refreshing the sky: The compressed skycube with efcient support for frequent updates. In SIGMOD 2006. [37] Y. Yuan, X. Lin, Q. Liu, W. Wang, J. X. Yu, and Q. Zhang. Efcient computation of the skyline cube. In VLDB 2005.

Computing Skyline Points For Decision Making Applications
No ratings yet
Computing Skyline Points For Decision Making Applications
8 pages
An Effective Framework For Skyline Queries Using PCA
No ratings yet
An Effective Framework For Skyline Queries Using PCA
5 pages
The Spatial Nearest Neighbor Skyline Queries
No ratings yet
The Spatial Nearest Neighbor Skyline Queries
15 pages
Weighing The Techniques For Data Optimization in A Database: Anagha Radhakrishnan Politecnico Di Milano Milan, Italy
No ratings yet
Weighing The Techniques For Data Optimization in A Database: Anagha Radhakrishnan Politecnico Di Milano Milan, Italy
9 pages
Skyline Algorithms for Data Queries
No ratings yet
Skyline Algorithms for Data Queries
25 pages
Interactive Skyline Ranking with ML
No ratings yet
Interactive Skyline Ranking with ML
8 pages
Skyline Sets Queries For Incomplete Data
No ratings yet
Skyline Sets Queries For Incomplete Data
14 pages
Presentation For CSG399 by Jian Wen
No ratings yet
Presentation For CSG399 by Jian Wen
32 pages
Efficient Ranking On Road Network by Quality Preferences
No ratings yet
Efficient Ranking On Road Network by Quality Preferences
10 pages
Understanding Skyline Queries in Databases
No ratings yet
Understanding Skyline Queries in Databases
51 pages
ProcessingSkylineQueriesinIncompleteDatabase Issues Challenges and Future Trends Published Version
No ratings yet
ProcessingSkylineQueriesinIncompleteDatabase Issues Challenges and Future Trends Published Version
13 pages
Ranking Spatial Data by Quality Preferences: Abstract
0% (1)
Ranking Spatial Data by Quality Preferences: Abstract
16 pages
An Efficient Skyline
No ratings yet
An Efficient Skyline
14 pages
Speedy Privacy-Preserving Skyline Queries On Outsourced Data
No ratings yet
Speedy Privacy-Preserving Skyline Queries On Outsourced Data
21 pages
Padmapriya - KSR Engineering College
No ratings yet
Padmapriya - KSR Engineering College
6 pages
Efficient Skyline Computation On Massive Incomplet
No ratings yet
Efficient Skyline Computation On Massive Incomplet
19 pages
3 Improving Viability of Electric Taxis by Taxi Service Strategy Optimization
No ratings yet
3 Improving Viability of Electric Taxis by Taxi Service Strategy Optimization
3 pages
Geometric Intersection Query Algorithms
No ratings yet
Geometric Intersection Query Algorithms
4 pages
Multi-Day Itinerary Planning Service
No ratings yet
Multi-Day Itinerary Planning Service
14 pages
Top-K Dominating Queries Using Multidimensional Data
No ratings yet
Top-K Dominating Queries Using Multidimensional Data
5 pages
"K Nearest and Furthest Points in M-Dimensional Space
No ratings yet
"K Nearest and Furthest Points in M-Dimensional Space
3 pages
10 1109@tkde 2018 2857471
No ratings yet
10 1109@tkde 2018 2857471
14 pages
Data Aggregation for Uncertain Queries
No ratings yet
Data Aggregation for Uncertain Queries
4 pages
Efficient Privacy-Preserving Skyline Queries
No ratings yet
Efficient Privacy-Preserving Skyline Queries
16 pages
Database Skyline Join Queries
No ratings yet
Database Skyline Join Queries
12 pages
Report On Nearest Dominating Point Queries: Naman Mishra, Sreeramji K S November 2024
No ratings yet
Report On Nearest Dominating Point Queries: Naman Mishra, Sreeramji K S November 2024
5 pages
Exploratory Data Analysis
No ratings yet
Exploratory Data Analysis
15 pages
Calculating The Longest Dominated Locations: Abstract
No ratings yet
Calculating The Longest Dominated Locations: Abstract
3 pages
Query Profiler for Skyline Optimization
No ratings yet
Query Profiler for Skyline Optimization
4 pages
Answering Metric Skyline Queries by PM-tree
No ratings yet
Answering Metric Skyline Queries by PM-tree
16 pages
Zhang RLGZS22
No ratings yet
Zhang RLGZS22
15 pages
Skyline Queries Project Report
No ratings yet
Skyline Queries Project Report
10 pages
Multi Dim Point Data
No ratings yet
Multi Dim Point Data
143 pages
2012 Clustering Based $K$ Anonymity
No ratings yet
2012 Clustering Based $K$ Anonymity
13 pages
1 Larson Sol
No ratings yet
1 Larson Sol
134 pages
CS412, Fall 2010, Assignment 1: Solution
No ratings yet
CS412, Fall 2010, Assignment 1: Solution
8 pages
20 Cs 112
No ratings yet
20 Cs 112
11 pages
A Framework For Processing K-Best Site Query
No ratings yet
A Framework For Processing K-Best Site Query
12 pages
Spatial Data Management
No ratings yet
Spatial Data Management
7 pages
Multi Dim DC
No ratings yet
Multi Dim DC
5 pages
GCUBE Indexing
No ratings yet
GCUBE Indexing
12 pages
Progressive Top-K Subarray Query Processing in Array Databases
No ratings yet
Progressive Top-K Subarray Query Processing in Array Databases
13 pages
Indexing and Ranking in Spatial Database
No ratings yet
Indexing and Ranking in Spatial Database
10 pages
Aaa 7
No ratings yet
Aaa 7
36 pages
NPHardkgeq 3 Euclid Plane
No ratings yet
NPHardkgeq 3 Euclid Plane
9 pages
Closest Pair of Coordinates
No ratings yet
Closest Pair of Coordinates
24 pages
K-Nearest Neighbor (KNN) : Salwa Muhammad Akhtar
No ratings yet
K-Nearest Neighbor (KNN) : Salwa Muhammad Akhtar
16 pages
Algorithms For Spatial Outlier Detection: Chang-Tien Lu Dechang Chen Yufeng Kou
No ratings yet
Algorithms For Spatial Outlier Detection: Chang-Tien Lu Dechang Chen Yufeng Kou
4 pages
Mondrian K-Anonymity
No ratings yet
Mondrian K-Anonymity
11 pages
Monitoring Reverse Top-K Queries Over Mobile Devices: Akrivi Vlachou Christos Doulkeridis Kjetil Nørvåg
No ratings yet
Monitoring Reverse Top-K Queries Over Mobile Devices: Akrivi Vlachou Christos Doulkeridis Kjetil Nørvåg
8 pages
Lesson 4 - Supervised Learning
No ratings yet
Lesson 4 - Supervised Learning
36 pages
Solutions For HW6-CS 6033 Fall 2023
No ratings yet
Solutions For HW6-CS 6033 Fall 2023
9 pages
Ranking Spatial Data by Quality Preferences
No ratings yet
Ranking Spatial Data by Quality Preferences
14 pages
Ifsa Eusflat 2015 Submission 24
No ratings yet
Ifsa Eusflat 2015 Submission 24
7 pages
Understanding Skyline Queries in Databases
No ratings yet
Understanding Skyline Queries in Databases
22 pages
Tulip Fund
No ratings yet
Tulip Fund
1 page
Readme PDF
No ratings yet
Readme PDF
1 page
Balanced Baskets for Trading & Hedging
100% (1)
Balanced Baskets for Trading & Hedging
44 pages
Crown Burgers Menu: Sr. No. Item Rate
No ratings yet
Crown Burgers Menu: Sr. No. Item Rate
2 pages
Linux Boot Process PDF
100% (2)
Linux Boot Process PDF
17 pages
Pursuing Research in Computer Science
No ratings yet
Pursuing Research in Computer Science
2 pages
CBSE Class 9 Physics Worksheet - Gravitation
No ratings yet
CBSE Class 9 Physics Worksheet - Gravitation
2 pages
Drainage Calculation: As Per IRC SP 50, Table 6.1
No ratings yet
Drainage Calculation: As Per IRC SP 50, Table 6.1
1 page
Scaffolding Inspection Checklist
No ratings yet
Scaffolding Inspection Checklist
5 pages
Implementation of VI Editor With NCURSES
No ratings yet
Implementation of VI Editor With NCURSES
10 pages
CSE Applied Physics Exam Questions Guide
No ratings yet
CSE Applied Physics Exam Questions Guide
2 pages
A Mathematical Simulation Model of A CH-47B Helicopter
No ratings yet
A Mathematical Simulation Model of A CH-47B Helicopter
136 pages
Catalogo Enxerto Pediatrico Gore
No ratings yet
Catalogo Enxerto Pediatrico Gore
12 pages
Multimedia Systems Overview
No ratings yet
Multimedia Systems Overview
24 pages
Machine Learning for ICT Students
No ratings yet
Machine Learning for ICT Students
26 pages
Solving Equations in Social Sciences
No ratings yet
Solving Equations in Social Sciences
29 pages
Periodic Table and Ionic Compounds
No ratings yet
Periodic Table and Ionic Compounds
48 pages
Engineering Mechanics: 3D Equilibrium
50% (2)
Engineering Mechanics: 3D Equilibrium
45 pages
Multiagent Based Reinforcement Learning MA-RL An Automated Designer For Complex Analog Circuits
No ratings yet
Multiagent Based Reinforcement Learning MA-RL An Automated Designer For Complex Analog Circuits
14 pages
Final Proj PDF
No ratings yet
Final Proj PDF
46 pages
Engineering Graphics
No ratings yet
Engineering Graphics
6 pages
Chapter 4
No ratings yet
Chapter 4
14 pages
12th Maths EM Unit Test 2 Model Question Paper English Medium PDF Download
No ratings yet
12th Maths EM Unit Test 2 Model Question Paper English Medium PDF Download
2 pages
Rounding Numbers Questions MME
No ratings yet
Rounding Numbers Questions MME
6 pages
MD1 08B Power Screws May2022 NS
No ratings yet
MD1 08B Power Screws May2022 NS
8 pages
Data Structures: Key Concepts and Comparisons
No ratings yet
Data Structures: Key Concepts and Comparisons
3 pages
Radial Flow Between Parallel Disks
No ratings yet
Radial Flow Between Parallel Disks
29 pages
Wrapped Exponentiated Inverted Weibull Distribution
No ratings yet
Wrapped Exponentiated Inverted Weibull Distribution
8 pages
Degrees of Freedom in Gases Explained
No ratings yet
Degrees of Freedom in Gases Explained
2 pages
DIO1000 Datasheet
No ratings yet
DIO1000 Datasheet
3 pages
Prometheus Part 12 Internals Storage & Security
No ratings yet
Prometheus Part 12 Internals Storage & Security
27 pages
Furnace Efficiency Optimization Guide
No ratings yet
Furnace Efficiency Optimization Guide
27 pages
Bayesian Non and Semi Parametric Methods and Applications Peter Rossi No Waiting Time
No ratings yet
Bayesian Non and Semi Parametric Methods and Applications Peter Rossi No Waiting Time
107 pages
A6-Governing Scheme Write Up
No ratings yet
A6-Governing Scheme Write Up
4 pages
MATH 69: Assessment and Evaluation in Mathematics
No ratings yet
MATH 69: Assessment and Evaluation in Mathematics
29 pages

Icde 2007 K-Skyline

Uploaded by

Icde 2007 K-Skyline

Uploaded by

Selecting Stars: The k Most Representative Skyline Operator

E-Health Research Center CSIRO ICT Center, Australia [email protected]

Table 1. Math Notation

2.1. Problem Statement

2.2. Related work

3.1. BBS Algorithm

point s dominates entry e iff s dominates the lower-left corner of e.

4.1. Dynamic Programming Based Algorithm

For example, |(s4 , s1 )| = |(s4 , s2 )| + |C2,4 | + |C2,5 | = 2, as depicted in Figure 3.

D(s4,s1) are the points in the shadowed cells p9 2

D(s4,s2) are the points in the bold border cells

5.3. FM-based Algorithm

5.2. Greedy Algorithm

e is Eulers constant rather than an entry in an R-tree.

Note that Est(s.f m

S.f m) is calculated by the for-

Note that in Algorithm 3, H1 is a heap maintained in the

fully dominated not dominated x

Table 2. System Parameters

6.1. Evaluating Accuracy

1M 0.8M 0.6M 0.4M 2d Anti Stock 5d Indep 5d Anti

sketches are involved in a bitwise-or operator gets increased.

(b) 3d Indep & Stock

Figure 14. Time Efciency vs Cardinality

(b) 3d Indep & Stock

Figure 11. Relative Error vs

6.2. Evaluating Efciency

(b) 3d Indep & Stock

Figure 15. Time Efciency vs

6.3. Space and I/O Efciency

Processing Time (s)

11,718 8000 2000 2 3

35,156 15000 2000 2 3

(b) 3d Indep & Stock

15,625 Space (Kbyte) Space (Kbyte) Dataset 3d Indep Stock 50 25 0 2 4 8 16 32 64

46,875 7,812 Dataset 3d Indep Stock

You might also like