Nowadays, ridesharing becomes a popular commuting mode. Dynamically arriving riders post their or... more Nowadays, ridesharing becomes a popular commuting mode. Dynamically arriving riders post their origins and destinations, then the platform assigns drivers to serve them. In ridesharing, different groups of riders can be served by one driver if their trips can share common routes. Recently, many ridesharing companies (e.g., Didi and Uber) further propose a new mode, namely "ridesharing with meeting points". Specifically, with a short walking distance but less payment, riders can be picked up and dropped off around their origins and destinations, respectively. In addition, meeting points enables more flexible routing for drivers, which can potentially improve the global profit of the system. In this paper, we first formally define the Meeting-Point-based Online Ridesharing Problem (MORP). We prove that MORP is NP-hard and there is no polynomial-time deterministic algorithm with a constant competitive ratio for it. We notice that a structure of vertex set,-skip cover, fits well to the MORP.-skip cover tends to find the vertices (meeting points) that are convenient for riders and drivers to come and go. With meeting points, MORP tends to serve more riders with these convenient vertices. Based on the idea, we introduce a convenience-based meeting point candidates selection algorithm. We further propose a hierarchical meeting-point oriented graph (HMPO graph), which ranks vertices for assignment effectiveness and constructs-skip cover to accelerate the whole assignment process. Finally, we utilize the merits of-skip cover points for ridesharing and propose a novel algorithm, namely SMDB, to solve MORP. Extensive experiments on real and synthetic datasets validate the effectiveness and efficiency of our algorithms.
In recent years, how to prevent the widespread transmission of infectious diseases in communities... more In recent years, how to prevent the widespread transmission of infectious diseases in communities has been a research hot spot. Tracing close contact with infected individuals is one of the most severe problems. In this work, we present a model called Follower Prediction Graph Network (FPGN) to identify high-risk visitors, which is known as follower prediction. The model is designed to identify visitors who may be infected with a disease by tracking their activities at the exact location of infected visitors. FPGN is inspired by the state-of-the-art temporal graph edge prediction algorithm TGN and draws on the shortcomings of existing algorithms. It utilizes graph structure information based on ($$\alpha $$ α , $$\beta $$ β )-core, time interval statistics by using the statistics of timestamp information, and a GAT-based prediction module to achieve high accuracy in follower prediction. Extensive experiments are conducted on two real datasets, demonstrating the progress of FPGN. The...
Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
Along with the rapid technological and commercial innovation on the e-commerce platforms, there a... more Along with the rapid technological and commercial innovation on the e-commerce platforms, there are an increasing number of frauds that bring great harm to these platforms. Many frauds are conducted by organized groups of fraudsters for higher efficiency and lower costs, which are also known as group-based frauds. Despite the high concealment and strong destructiveness of group-based fraud, there is no existing research work that can thoroughly exploit the information within the transaction networks of e-commerce platforms for group-based fraud detection. In this work, we analyze and summarize the characteristics of group-based frauds, based on which we propose a novel end-to-end semi-supervised Group-based Fraud Detection Network (GFDN) to support such fraud detection in real-world applications. Experimental results on large-scale ecommerce datasets from Taobao and Bitcoin trading datasets show the superior effectiveness and efficiency of our proposed model for group-based fraud detection on bipartite graphs. CCS CONCEPTS • Information systems → Data mining.
With the increasing pervasiveness of the geo-positioning technologies, there is an enormous amoun... more With the increasing pervasiveness of the geo-positioning technologies, there is an enormous amount of spatio-textual objects available in many applications such as location based services and social networks. Consequently, various types of spatial keyword searches which explore both locations and textual descriptions of the objects have been intensively studied by the research communities and commercial organizations. In many important applications (e.g., location based services), the closeness of two spatial objects is measured by the road network distance. Moreover, the result diversification is becoming a common practice to enhance the quality of the search results. Motived by the above facts, in this paper we study the problem of diversified spatial keyword search on road networks which considers both the relevance and the spatial diversity of the results. An efficient signature-based inverted indexing technique is proposed to facilitate the spatial keyword query processing on r...
Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015
With the proliferation of graph applications, the problem of efficiently computing all k-edge con... more With the proliferation of graph applications, the problem of efficiently computing all k-edge connected components of a graph G for a user-given k has been recently investigated. In this paper, we study the problem of efficiently computing the steiner component with the maximum connectivity; that is, given a set q of query vertices in a graph G, we aim to find the maximum induced subgraph g of G such that g contains q and g has the maximum connectivity, where g is denoted as SMCC. To accommodate online query processing, we present an efficient algorithm based on a novel index such that the algorithm runs in linear time regarding the result size; thus, the algorithm is optimal since it needs at least linear time to output the result. Moreover, in this paper we also investigate variations of the above problem. We show that such a problem with the constraint that the size of the SMCC is not smaller than a given size can also be solved in linear time regarding the result size (thus, optimal). We also show that the problem of computing the connectivity (rather than the graph details) of SMCC can be solved in linear time regarding the query size (thus, optimal). To build the index, we extend the techniques in [7] to accommodate batch processing and computation sharing. To efficiently support the applications with graph updates, we also present novel increment techniques. Finally, we conduct extensive performance studies on large real and synthetic graphs, which demonstrate that our index-based algorithms significantly outperform baseline algorithms by several orders of magnitude and our indexing algorithms are efficient.
Proceedings of the 22nd ACM international conference on Multimedia, 2014
Often, a data object described by many features can be decomposed as multi-modalities, which alwa... more Often, a data object described by many features can be decomposed as multi-modalities, which always provide complementary information to each other. In this paper, we study subspace clustering for multi-modal data by effectively exploiting data correlation consensus across modalities, while keeping individual modalities well encapsulated. Our technique can yield a more ideal data similarity matrix, which encodes strong data correlations for the cross-modal data objects in the same subspace. To these ends, we propose a novel angular based regularizer coupled with our objective function, which is aided by trace lasso and minimized to yield sparse representation vectors encoding data correlations in multiple modalities. As a result, the sparse code vectors of the same cross-modal data have small angular difference so as to achieve the data correlation consensus simultaneously. This can generate a compatible data similarity matrix for multi-modal data. The final subspace clustering result is obtained by applying spectral clustering on such data similarity matrix.
Often, a data object described by many features can be naturally decomposed into multiple "views"... more Often, a data object described by many features can be naturally decomposed into multiple "views", where each view consists of a subset of features. For example, a video clip may have a video view and an audio view. Given a set of training data objects with multiple views, where some objects are labeled and the others are not, semi-supervised learning with graphs from multi-views tries to learn a classifier by treating each view as a similarity graph on all objects, where edges are defined by the similarity on object pairs based on the view attributes. Labels and label relevance ranking scores of labeled objects can be propagated from labeled objects to unlabeled objects on the similarity graphs so that similar objects receive similar labels. The state-of-the-art, onecombo-fits-all methods linearly and independently combine either the metrics or the label propagation results from multi-views and then build a model based on the combined results. However, more often than not, the similarities between various objects may be manifested differently by different views. In such situations, the one-combo-fits-all methods may not perform well. To tackle the problem, we develop an iterative Semi-Supervised Metric Fusion (SSMF) approach in this paper. SSMF fuses metrics and label propagation results from multi-views iteratively until the fused metric and label propagation results converge simultaneously. Views are weighted dynamically during the fusion process so that the adversary effect of irrelevant views, identified at each iteration of fusion process, can be reduced effectively. To evaluate the effectiveness of SSMF, we apply it on multi-view based and content based image retrieval and multi-view based multi-label image classification on real world data set, which demonstrates that our method outperforms the state-of-the-art methods.
2015 IEEE 31st International Conference on Data Engineering, 2015
Maximal clique enumeration is a fundamental problem in graph theory and has been extensively stud... more Maximal clique enumeration is a fundamental problem in graph theory and has been extensively studied. However, maximal clique enumeration is timeconsuming in large graphs and always returns enormous cliques with large overlaps. Motivated by this, in this paper, we study the diversified top-k clique search problem which is to find top-k cliques that can cover most number of nodes in the graph. Diversified top-k clique search can be widely used in a lot of applications including community search, motif discovery, and anomaly detection in large graphs. A naive solution for diversified top-k clique search is to keep all maximal cliques in memory and then find k of them that cover most nodes in the graph by using the approximate greedy max k-cover algorithm. However, such a solution is impractical when the graph is large. In this paper, instead of keeping all maximal cliques in memory, we devise an algorithm to maintain k candidates in the process of maximal clique enumeration. Our algorithm has limited memory footprint and can achieve a guaranteed approximation ratio. We also introduce a novel lightweight
In the paper, we study the problems of nearest neighbor queries (NN) and all nearest neighbor que... more In the paper, we study the problems of nearest neighbor queries (NN) and all nearest neighbor queries (ANN) on location data, which have a wide range of applications such as Geographic Information System (GIS) and Location based Service (LBS). We propose a new structure, termed AVR-Tree, based on the R-tree and Voronoi diagram techniques. Compared with the existing indexing techniques used for NN and ANN queries on location data, AVR-Tree can achieve a better tradeoff between the pruning effectiveness and the index size for NN and ANN queries. We also conduct a comprehensive performance evaluation for the proposed techniques based on both real and synthetic data, which shows that AVR-Tree based NN and ANN algorithms achieve better performance compared with their best competitors in terms of both CPU and I/O costs.
Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, 2014
With the prevalence of the geo-position enabled devices and services, a rapidly growing amount of... more With the prevalence of the geo-position enabled devices and services, a rapidly growing amount of tweets are associated with geo-tags. Consequently, the real time search on geotagged Twitter streams has attracted great attentions. In this paper, we advocate the significance of the co-occurrence of keywords for the geo-tagged tweets data analytics, which is overlooked by existing studies. Particularly, we formally introduce the problem of identifying local frequent keyword co-occurrence patterns over the geo-tagged Twitter streams, namely LFP query. To accommodate the high volume and the rapid updates of the Twitter stream, we develop an inverted KMV sketch (IK sketch for short) structure to capture the co-occurrence of keywords in limited space. Then efficient algorithms are developed based on IK sketch to support LFP queries as well as its variant. The extensive empirical study on real Twitter dataset confirms the effectiveness and efficiency of our approaches.
With the emergence of location-aware mobile device technologies, communication technologies and G... more With the emergence of location-aware mobile device technologies, communication technologies and GPS systems, various locationaware queries have attracted great attentions in the database literature. In many user recommendation systems, the spatial preference query is used to suggest the objects based on their spatial proximity to the facilities. In this paper, we study the problem of general spatial skyline which can provide a minimal set of candidates that contain optimal solutions for any monotonic distance based spatial preference query. An efficient algorithm is proposed to significantly reduce the number of nonpromising objects in the computation. The paper also covers a comprehensive performance study of the proposed techniques based on both real and synthetic data.
2009 IEEE 25th International Conference on Data Engineering, 2009
Uncertain data are inherent in many applications such as environmental surveillance and quantitat... more Uncertain data are inherent in many applications such as environmental surveillance and quantitative economics research. Recently, considerable research efforts have been put into the field of analysing uncertain data. In this paper, we study the problem of processing the uncertain location based range aggregate in a multi-dimensional space. We first formally introduce the problem, then propose a general filtering-and-verification framework to solve the problem. Two filtering techniques, named STF and PCR respectively, are proposed to signficantly reduce the verification cost.
2012 IEEE 28th International Conference on Data Engineering, 2012
Top-k pairs queries have received significant attention by the research community. k-closest pair... more Top-k pairs queries have received significant attention by the research community. k-closest pairs queries, kfurthest pairs queries and their variants are among the most well studied special cases of the top-k pairs queries. In this paper, we present the first approach to answer a broad class of top-k pairs queries over sliding windows. Our framework handles multiple top-k pairs queries and each query is allowed to use a different scoring function, a different value of k and a different size of the sliding window. Although the number of possible pairs in the sliding window is quadratic to the number of objects N in the sliding window, we efficiently answer the top-k pairs query by maintaining a small subset of pairs called Kskyband which is expected to consist of O(K log(N/K)) pairs. For all the queries that use the same scoring function, we need to maintain only one K-skyband. We present efficient techniques for the K-skyband maintenance and query answering. We conduct a detailed complexity analysis and show that the expected cost of our approach is reasonably close to the lower bound cost. We experimentally verify this by comparing our approach with a specially designed supreme algorithm that assumes the existence of an oracle and meets the lower bound cost. * Corresponding Author. when the divergence between the two stocks returns to normal. A top-k pairs query can be issued to obtain the pairs of stocks that are correlated (e.g., they belong to the same business sector and have similar fundamentals such as market caps, dividends etc.) and display different trends. Pair-trading can be profitable only if the trader is the first one to capitalize on the opportunity [7]. Hence, the trader may want to continuously monitor the top-k pairs from the most recent data (e.g., a sliding window containing most recent n items). Consider another example of an online auction website. A user may be interested in finding the pairs of products that have similar specifications but are sold at very different prices (i.e., different final bids). Such pairs may be used to understand the users behavior and market trends, e.g., suitable bidding time for buyers and suitable bidding closing time for sellers etc. An analyst or a user may issue the following query to obtain top-k pairs of such products sold during last 7 days. Select a.id, b.id from auction a, auction b where a.id < b.id order by dist(a.spec,b.spec)-|a.bid-b.bid| limit k window [7 days] Here dist(a.spec, b.spec) computes the distance (or difference) between their specifications and |a.bid − b.bid| denotes the absolute difference between the final bids they receive. Note that the query prefers the pairs of products that have small difference between their specifications but have large difference between their selling prices. The condition a.id < b.id ensures that a pair (a, b) is not repeated as (b, a). While the above example shows a simple scoring function, in real-world applications, the users may specify a more sophisticated scoring function. Our framework allows the users to define arbitrarily complex scoring functions. A query that retrieves top-k pairs among the most recent n data items (i.e., sliding window of size n) and uses the scoring function s is denoted as Q (k,n,s). A. Contributions Our framework has following features. Unified framework. To the best of our knowledge, we are the first to study top-k pairs queries over sliding windows. We present a unified framework that efficiently solves the
IEEE Transactions on Knowledge and Data Engineering, 2014
Given a scoring function that computes the score of a pair of objects, a top-k pairs query return... more Given a scoring function that computes the score of a pair of objects, a top-k pairs query returns k pairs with the smallest scores. In this paper, we present a unified framework for answering generic top-k pairs queries including k-closest pairs queries, kfurthest pairs queries and their variants. Note that k-closest pairs query is a special case of top-k pairs queries where the scoring function is the distance between the two objects in a pair. We are the first to present a unified framework to efficiently answer a broad class of top-k queries including the queries mentioned above. We present efficient algorithms and provide a detailed theoretical analysis that demonstrates that the expected performance of our proposed algorithms is optimal for two dimensional data sets. Furthermore, our framework does not require pre-built indexes, uses limited main memory and is easy to implement. We also extend our techniques to support top-k pairs queries on multi-valued (or uncertain) objects. We also demonstrate that our framework can handle exclusive top-k pairs queries. Our extensive experimental study demonstrates effectiveness and efficiency of our proposed techniques. Index Terms-Closest pairs queries, furthest pairs queries, top-k queries, multi-valued objects, uncertain objects !
IEEE Transactions on Knowledge and Data Engineering, 2014
Top-k pairs and top-k objects queries have received significant attention by the research communi... more Top-k pairs and top-k objects queries have received significant attention by the research community. In this paper, we present the first approach to answer a broad class of top-k pairs and top-k objects queries over sliding windows. Our framework handles multiple top-k queries and each query is allowed to use a different scoring function, a different value of k, and a different size of the sliding window. Furthermore, the framework allows the users to define arbitrarily complex scoring functions and supports out-of-order data streams. For all the queries that use the same scoring function, we need to maintain only one K-skyband. We present efficient techniques for the K-skyband maintenance and query answering. We conduct a detailed complexity analysis and show that the expected cost of our approach is reasonably close to the lower bound cost. For top-k pairs queries, we demonstrate the efficiency of our approach by comparing it with a specially designed supreme algorithm that assumes the existence of an oracle and meets the lower bound cost. For top-k objects queries, our experimental results demonstrate the superiority of our algorithm over the state-of-the-art algorithm.
Given a set of criterions, an object o dominates another object o ′ if o is more preferable than ... more Given a set of criterions, an object o dominates another object o ′ if o is more preferable than o ′ according to every criterion. A skyline query returns every object that is not dominated by any other object. In this paper, we study the problem of continuously monitoring a moving skyline query where one of the criterions is the distance between the objects and the moving query. We propose a safe zone based approach to address the challenge of efficiently updating the results as the query moves. A safe zone is the area such that the results of a query remain unchanged as long as the query lies inside this area. Hence, the results are required to be updated only when the query leaves its safe zone. Although the main focus of this paper is to present the techniques for Euclidean distance metric, the proposed techniques are applicable to any metric distance (e.g., Manhattan distance, road network distance). We present several non-trivial optimizations and propose an efficient algorithm for safe zone construction. Our experiments demonstrate that the cost of our safe zone based approach is reasonably close to a lower bound cost and is three orders of magnitude lower than the cost of a naïve algorithm.
Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 2012
Given a query graph q and a data graph G, computing all occurrences of q in G, namely exact all-m... more Given a query graph q and a data graph G, computing all occurrences of q in G, namely exact all-matching, is fundamental in graph data analysis with a wide spectrum of real applications. It is challenging since even finding one occurrence of q in G (subgraph isomorphism test) is NP-Complete. Consider that in many real applications, exploratory queries from users are often inaccurate to express their real demands. In this paper, we study the problem of efficiently computing all approximate occurrences of q in G. Particularly, we study the problem of efficiently retrieving all matches of q in G with the number of possible missing edges bounded by a given threshold θ, namely similarity allmatching. The problem of similarity all-matching is harder than the problem of exact all-matching since it covers the problem of exact all-matching as a special case with θ = 0. In this paper, we develop a novel paradigm to conduct similarity all-matching. Specifically, we propose to use a minimal set QT of spanning trees in q to cover all connected subgraphs q of q missing at most θ edges; that is, each q is spanned by a spanning tree in QT. Then, we conduct exact all-matching for each spanning tree in QT to induce all similarity matches. A rigid theoretic analysis shows that our new search paradigm significantly reduces the times of conducting exact all-matching against the existing techniques. To further speed-up the computation, we develop new filtering, computation sharing, and search ordering techniques. Our comprehensive experiments on both real and synthetic datasets demonstrate that our techniques outperform the state of the art technique by 7 orders of magnitude.
In this paper, we study the problem of (p, q)-biclique counting and enumeration for large sparse ... more In this paper, we study the problem of (p, q)-biclique counting and enumeration for large sparse bipartite graphs. Given a bipartite graph G = (U , V , E) and two integer parameters p and q, we aim to efficiently count and enumerate all (p, q)-bicliques in G, where a (p, q
In this paper, we study the problem of ( p , q)-biclique counting and enumeration for large spars... more In this paper, we study the problem of ( p , q)-biclique counting and enumeration for large sparse bipartite graphs. Given a bipartite G = ( U, V , E), and two integer parameters p and q, we aim to efficiently count and enumerate all (p, q)-bicliques in G , where a (p, q)-biclique B ( L, R ) is a complete subgraph of G with L ⊆ U, R ⊆ V , |L| = p, and |R| = q. The problem of (p, q)-biclique counting and enumeration has many applications, such as graph neural network information aggregation, densest subgraph detection, and cohesive subgroup analysis, etc. Despite the wide range of applications, to the best of our knowledge, we note that there is no efficient and scalable solution to this problem in the literature. This problem is computationally challenging, due to the worst-case exponential number of (p, q)-bicliques. In this paper, we propose a competitive branch-and-bound baseline method, namely BCList, which explores the search space in a depth-first manner, together with a varie...
Nowadays, ridesharing becomes a popular commuting mode. Dynamically arriving riders post their or... more Nowadays, ridesharing becomes a popular commuting mode. Dynamically arriving riders post their origins and destinations, then the platform assigns drivers to serve them. In ridesharing, different groups of riders can be served by one driver if their trips can share common routes. Recently, many ridesharing companies (e.g., Didi and Uber) further propose a new mode, namely "ridesharing with meeting points". Specifically, with a short walking distance but less payment, riders can be picked up and dropped off around their origins and destinations, respectively. In addition, meeting points enables more flexible routing for drivers, which can potentially improve the global profit of the system. In this paper, we first formally define the Meeting-Point-based Online Ridesharing Problem (MORP). We prove that MORP is NP-hard and there is no polynomial-time deterministic algorithm with a constant competitive ratio for it. We notice that a structure of vertex set,-skip cover, fits well to the MORP.-skip cover tends to find the vertices (meeting points) that are convenient for riders and drivers to come and go. With meeting points, MORP tends to serve more riders with these convenient vertices. Based on the idea, we introduce a convenience-based meeting point candidates selection algorithm. We further propose a hierarchical meeting-point oriented graph (HMPO graph), which ranks vertices for assignment effectiveness and constructs-skip cover to accelerate the whole assignment process. Finally, we utilize the merits of-skip cover points for ridesharing and propose a novel algorithm, namely SMDB, to solve MORP. Extensive experiments on real and synthetic datasets validate the effectiveness and efficiency of our algorithms.
In recent years, how to prevent the widespread transmission of infectious diseases in communities... more In recent years, how to prevent the widespread transmission of infectious diseases in communities has been a research hot spot. Tracing close contact with infected individuals is one of the most severe problems. In this work, we present a model called Follower Prediction Graph Network (FPGN) to identify high-risk visitors, which is known as follower prediction. The model is designed to identify visitors who may be infected with a disease by tracking their activities at the exact location of infected visitors. FPGN is inspired by the state-of-the-art temporal graph edge prediction algorithm TGN and draws on the shortcomings of existing algorithms. It utilizes graph structure information based on ($$\alpha $$ α , $$\beta $$ β )-core, time interval statistics by using the statistics of timestamp information, and a GAT-based prediction module to achieve high accuracy in follower prediction. Extensive experiments are conducted on two real datasets, demonstrating the progress of FPGN. The...
Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
Along with the rapid technological and commercial innovation on the e-commerce platforms, there a... more Along with the rapid technological and commercial innovation on the e-commerce platforms, there are an increasing number of frauds that bring great harm to these platforms. Many frauds are conducted by organized groups of fraudsters for higher efficiency and lower costs, which are also known as group-based frauds. Despite the high concealment and strong destructiveness of group-based fraud, there is no existing research work that can thoroughly exploit the information within the transaction networks of e-commerce platforms for group-based fraud detection. In this work, we analyze and summarize the characteristics of group-based frauds, based on which we propose a novel end-to-end semi-supervised Group-based Fraud Detection Network (GFDN) to support such fraud detection in real-world applications. Experimental results on large-scale ecommerce datasets from Taobao and Bitcoin trading datasets show the superior effectiveness and efficiency of our proposed model for group-based fraud detection on bipartite graphs. CCS CONCEPTS • Information systems → Data mining.
With the increasing pervasiveness of the geo-positioning technologies, there is an enormous amoun... more With the increasing pervasiveness of the geo-positioning technologies, there is an enormous amount of spatio-textual objects available in many applications such as location based services and social networks. Consequently, various types of spatial keyword searches which explore both locations and textual descriptions of the objects have been intensively studied by the research communities and commercial organizations. In many important applications (e.g., location based services), the closeness of two spatial objects is measured by the road network distance. Moreover, the result diversification is becoming a common practice to enhance the quality of the search results. Motived by the above facts, in this paper we study the problem of diversified spatial keyword search on road networks which considers both the relevance and the spatial diversity of the results. An efficient signature-based inverted indexing technique is proposed to facilitate the spatial keyword query processing on r...
Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, 2015
With the proliferation of graph applications, the problem of efficiently computing all k-edge con... more With the proliferation of graph applications, the problem of efficiently computing all k-edge connected components of a graph G for a user-given k has been recently investigated. In this paper, we study the problem of efficiently computing the steiner component with the maximum connectivity; that is, given a set q of query vertices in a graph G, we aim to find the maximum induced subgraph g of G such that g contains q and g has the maximum connectivity, where g is denoted as SMCC. To accommodate online query processing, we present an efficient algorithm based on a novel index such that the algorithm runs in linear time regarding the result size; thus, the algorithm is optimal since it needs at least linear time to output the result. Moreover, in this paper we also investigate variations of the above problem. We show that such a problem with the constraint that the size of the SMCC is not smaller than a given size can also be solved in linear time regarding the result size (thus, optimal). We also show that the problem of computing the connectivity (rather than the graph details) of SMCC can be solved in linear time regarding the query size (thus, optimal). To build the index, we extend the techniques in [7] to accommodate batch processing and computation sharing. To efficiently support the applications with graph updates, we also present novel increment techniques. Finally, we conduct extensive performance studies on large real and synthetic graphs, which demonstrate that our index-based algorithms significantly outperform baseline algorithms by several orders of magnitude and our indexing algorithms are efficient.
Proceedings of the 22nd ACM international conference on Multimedia, 2014
Often, a data object described by many features can be decomposed as multi-modalities, which alwa... more Often, a data object described by many features can be decomposed as multi-modalities, which always provide complementary information to each other. In this paper, we study subspace clustering for multi-modal data by effectively exploiting data correlation consensus across modalities, while keeping individual modalities well encapsulated. Our technique can yield a more ideal data similarity matrix, which encodes strong data correlations for the cross-modal data objects in the same subspace. To these ends, we propose a novel angular based regularizer coupled with our objective function, which is aided by trace lasso and minimized to yield sparse representation vectors encoding data correlations in multiple modalities. As a result, the sparse code vectors of the same cross-modal data have small angular difference so as to achieve the data correlation consensus simultaneously. This can generate a compatible data similarity matrix for multi-modal data. The final subspace clustering result is obtained by applying spectral clustering on such data similarity matrix.
Often, a data object described by many features can be naturally decomposed into multiple "views"... more Often, a data object described by many features can be naturally decomposed into multiple "views", where each view consists of a subset of features. For example, a video clip may have a video view and an audio view. Given a set of training data objects with multiple views, where some objects are labeled and the others are not, semi-supervised learning with graphs from multi-views tries to learn a classifier by treating each view as a similarity graph on all objects, where edges are defined by the similarity on object pairs based on the view attributes. Labels and label relevance ranking scores of labeled objects can be propagated from labeled objects to unlabeled objects on the similarity graphs so that similar objects receive similar labels. The state-of-the-art, onecombo-fits-all methods linearly and independently combine either the metrics or the label propagation results from multi-views and then build a model based on the combined results. However, more often than not, the similarities between various objects may be manifested differently by different views. In such situations, the one-combo-fits-all methods may not perform well. To tackle the problem, we develop an iterative Semi-Supervised Metric Fusion (SSMF) approach in this paper. SSMF fuses metrics and label propagation results from multi-views iteratively until the fused metric and label propagation results converge simultaneously. Views are weighted dynamically during the fusion process so that the adversary effect of irrelevant views, identified at each iteration of fusion process, can be reduced effectively. To evaluate the effectiveness of SSMF, we apply it on multi-view based and content based image retrieval and multi-view based multi-label image classification on real world data set, which demonstrates that our method outperforms the state-of-the-art methods.
2015 IEEE 31st International Conference on Data Engineering, 2015
Maximal clique enumeration is a fundamental problem in graph theory and has been extensively stud... more Maximal clique enumeration is a fundamental problem in graph theory and has been extensively studied. However, maximal clique enumeration is timeconsuming in large graphs and always returns enormous cliques with large overlaps. Motivated by this, in this paper, we study the diversified top-k clique search problem which is to find top-k cliques that can cover most number of nodes in the graph. Diversified top-k clique search can be widely used in a lot of applications including community search, motif discovery, and anomaly detection in large graphs. A naive solution for diversified top-k clique search is to keep all maximal cliques in memory and then find k of them that cover most nodes in the graph by using the approximate greedy max k-cover algorithm. However, such a solution is impractical when the graph is large. In this paper, instead of keeping all maximal cliques in memory, we devise an algorithm to maintain k candidates in the process of maximal clique enumeration. Our algorithm has limited memory footprint and can achieve a guaranteed approximation ratio. We also introduce a novel lightweight
In the paper, we study the problems of nearest neighbor queries (NN) and all nearest neighbor que... more In the paper, we study the problems of nearest neighbor queries (NN) and all nearest neighbor queries (ANN) on location data, which have a wide range of applications such as Geographic Information System (GIS) and Location based Service (LBS). We propose a new structure, termed AVR-Tree, based on the R-tree and Voronoi diagram techniques. Compared with the existing indexing techniques used for NN and ANN queries on location data, AVR-Tree can achieve a better tradeoff between the pruning effectiveness and the index size for NN and ANN queries. We also conduct a comprehensive performance evaluation for the proposed techniques based on both real and synthetic data, which shows that AVR-Tree based NN and ANN algorithms achieve better performance compared with their best competitors in terms of both CPU and I/O costs.
Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval, 2014
With the prevalence of the geo-position enabled devices and services, a rapidly growing amount of... more With the prevalence of the geo-position enabled devices and services, a rapidly growing amount of tweets are associated with geo-tags. Consequently, the real time search on geotagged Twitter streams has attracted great attentions. In this paper, we advocate the significance of the co-occurrence of keywords for the geo-tagged tweets data analytics, which is overlooked by existing studies. Particularly, we formally introduce the problem of identifying local frequent keyword co-occurrence patterns over the geo-tagged Twitter streams, namely LFP query. To accommodate the high volume and the rapid updates of the Twitter stream, we develop an inverted KMV sketch (IK sketch for short) structure to capture the co-occurrence of keywords in limited space. Then efficient algorithms are developed based on IK sketch to support LFP queries as well as its variant. The extensive empirical study on real Twitter dataset confirms the effectiveness and efficiency of our approaches.
With the emergence of location-aware mobile device technologies, communication technologies and G... more With the emergence of location-aware mobile device technologies, communication technologies and GPS systems, various locationaware queries have attracted great attentions in the database literature. In many user recommendation systems, the spatial preference query is used to suggest the objects based on their spatial proximity to the facilities. In this paper, we study the problem of general spatial skyline which can provide a minimal set of candidates that contain optimal solutions for any monotonic distance based spatial preference query. An efficient algorithm is proposed to significantly reduce the number of nonpromising objects in the computation. The paper also covers a comprehensive performance study of the proposed techniques based on both real and synthetic data.
2009 IEEE 25th International Conference on Data Engineering, 2009
Uncertain data are inherent in many applications such as environmental surveillance and quantitat... more Uncertain data are inherent in many applications such as environmental surveillance and quantitative economics research. Recently, considerable research efforts have been put into the field of analysing uncertain data. In this paper, we study the problem of processing the uncertain location based range aggregate in a multi-dimensional space. We first formally introduce the problem, then propose a general filtering-and-verification framework to solve the problem. Two filtering techniques, named STF and PCR respectively, are proposed to signficantly reduce the verification cost.
2012 IEEE 28th International Conference on Data Engineering, 2012
Top-k pairs queries have received significant attention by the research community. k-closest pair... more Top-k pairs queries have received significant attention by the research community. k-closest pairs queries, kfurthest pairs queries and their variants are among the most well studied special cases of the top-k pairs queries. In this paper, we present the first approach to answer a broad class of top-k pairs queries over sliding windows. Our framework handles multiple top-k pairs queries and each query is allowed to use a different scoring function, a different value of k and a different size of the sliding window. Although the number of possible pairs in the sliding window is quadratic to the number of objects N in the sliding window, we efficiently answer the top-k pairs query by maintaining a small subset of pairs called Kskyband which is expected to consist of O(K log(N/K)) pairs. For all the queries that use the same scoring function, we need to maintain only one K-skyband. We present efficient techniques for the K-skyband maintenance and query answering. We conduct a detailed complexity analysis and show that the expected cost of our approach is reasonably close to the lower bound cost. We experimentally verify this by comparing our approach with a specially designed supreme algorithm that assumes the existence of an oracle and meets the lower bound cost. * Corresponding Author. when the divergence between the two stocks returns to normal. A top-k pairs query can be issued to obtain the pairs of stocks that are correlated (e.g., they belong to the same business sector and have similar fundamentals such as market caps, dividends etc.) and display different trends. Pair-trading can be profitable only if the trader is the first one to capitalize on the opportunity [7]. Hence, the trader may want to continuously monitor the top-k pairs from the most recent data (e.g., a sliding window containing most recent n items). Consider another example of an online auction website. A user may be interested in finding the pairs of products that have similar specifications but are sold at very different prices (i.e., different final bids). Such pairs may be used to understand the users behavior and market trends, e.g., suitable bidding time for buyers and suitable bidding closing time for sellers etc. An analyst or a user may issue the following query to obtain top-k pairs of such products sold during last 7 days. Select a.id, b.id from auction a, auction b where a.id < b.id order by dist(a.spec,b.spec)-|a.bid-b.bid| limit k window [7 days] Here dist(a.spec, b.spec) computes the distance (or difference) between their specifications and |a.bid − b.bid| denotes the absolute difference between the final bids they receive. Note that the query prefers the pairs of products that have small difference between their specifications but have large difference between their selling prices. The condition a.id < b.id ensures that a pair (a, b) is not repeated as (b, a). While the above example shows a simple scoring function, in real-world applications, the users may specify a more sophisticated scoring function. Our framework allows the users to define arbitrarily complex scoring functions. A query that retrieves top-k pairs among the most recent n data items (i.e., sliding window of size n) and uses the scoring function s is denoted as Q (k,n,s). A. Contributions Our framework has following features. Unified framework. To the best of our knowledge, we are the first to study top-k pairs queries over sliding windows. We present a unified framework that efficiently solves the
IEEE Transactions on Knowledge and Data Engineering, 2014
Given a scoring function that computes the score of a pair of objects, a top-k pairs query return... more Given a scoring function that computes the score of a pair of objects, a top-k pairs query returns k pairs with the smallest scores. In this paper, we present a unified framework for answering generic top-k pairs queries including k-closest pairs queries, kfurthest pairs queries and their variants. Note that k-closest pairs query is a special case of top-k pairs queries where the scoring function is the distance between the two objects in a pair. We are the first to present a unified framework to efficiently answer a broad class of top-k queries including the queries mentioned above. We present efficient algorithms and provide a detailed theoretical analysis that demonstrates that the expected performance of our proposed algorithms is optimal for two dimensional data sets. Furthermore, our framework does not require pre-built indexes, uses limited main memory and is easy to implement. We also extend our techniques to support top-k pairs queries on multi-valued (or uncertain) objects. We also demonstrate that our framework can handle exclusive top-k pairs queries. Our extensive experimental study demonstrates effectiveness and efficiency of our proposed techniques. Index Terms-Closest pairs queries, furthest pairs queries, top-k queries, multi-valued objects, uncertain objects !
IEEE Transactions on Knowledge and Data Engineering, 2014
Top-k pairs and top-k objects queries have received significant attention by the research communi... more Top-k pairs and top-k objects queries have received significant attention by the research community. In this paper, we present the first approach to answer a broad class of top-k pairs and top-k objects queries over sliding windows. Our framework handles multiple top-k queries and each query is allowed to use a different scoring function, a different value of k, and a different size of the sliding window. Furthermore, the framework allows the users to define arbitrarily complex scoring functions and supports out-of-order data streams. For all the queries that use the same scoring function, we need to maintain only one K-skyband. We present efficient techniques for the K-skyband maintenance and query answering. We conduct a detailed complexity analysis and show that the expected cost of our approach is reasonably close to the lower bound cost. For top-k pairs queries, we demonstrate the efficiency of our approach by comparing it with a specially designed supreme algorithm that assumes the existence of an oracle and meets the lower bound cost. For top-k objects queries, our experimental results demonstrate the superiority of our algorithm over the state-of-the-art algorithm.
Given a set of criterions, an object o dominates another object o ′ if o is more preferable than ... more Given a set of criterions, an object o dominates another object o ′ if o is more preferable than o ′ according to every criterion. A skyline query returns every object that is not dominated by any other object. In this paper, we study the problem of continuously monitoring a moving skyline query where one of the criterions is the distance between the objects and the moving query. We propose a safe zone based approach to address the challenge of efficiently updating the results as the query moves. A safe zone is the area such that the results of a query remain unchanged as long as the query lies inside this area. Hence, the results are required to be updated only when the query leaves its safe zone. Although the main focus of this paper is to present the techniques for Euclidean distance metric, the proposed techniques are applicable to any metric distance (e.g., Manhattan distance, road network distance). We present several non-trivial optimizations and propose an efficient algorithm for safe zone construction. Our experiments demonstrate that the cost of our safe zone based approach is reasonably close to a lower bound cost and is three orders of magnitude lower than the cost of a naïve algorithm.
Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 2012
Given a query graph q and a data graph G, computing all occurrences of q in G, namely exact all-m... more Given a query graph q and a data graph G, computing all occurrences of q in G, namely exact all-matching, is fundamental in graph data analysis with a wide spectrum of real applications. It is challenging since even finding one occurrence of q in G (subgraph isomorphism test) is NP-Complete. Consider that in many real applications, exploratory queries from users are often inaccurate to express their real demands. In this paper, we study the problem of efficiently computing all approximate occurrences of q in G. Particularly, we study the problem of efficiently retrieving all matches of q in G with the number of possible missing edges bounded by a given threshold θ, namely similarity allmatching. The problem of similarity all-matching is harder than the problem of exact all-matching since it covers the problem of exact all-matching as a special case with θ = 0. In this paper, we develop a novel paradigm to conduct similarity all-matching. Specifically, we propose to use a minimal set QT of spanning trees in q to cover all connected subgraphs q of q missing at most θ edges; that is, each q is spanned by a spanning tree in QT. Then, we conduct exact all-matching for each spanning tree in QT to induce all similarity matches. A rigid theoretic analysis shows that our new search paradigm significantly reduces the times of conducting exact all-matching against the existing techniques. To further speed-up the computation, we develop new filtering, computation sharing, and search ordering techniques. Our comprehensive experiments on both real and synthetic datasets demonstrate that our techniques outperform the state of the art technique by 7 orders of magnitude.
In this paper, we study the problem of (p, q)-biclique counting and enumeration for large sparse ... more In this paper, we study the problem of (p, q)-biclique counting and enumeration for large sparse bipartite graphs. Given a bipartite graph G = (U , V , E) and two integer parameters p and q, we aim to efficiently count and enumerate all (p, q)-bicliques in G, where a (p, q
In this paper, we study the problem of ( p , q)-biclique counting and enumeration for large spars... more In this paper, we study the problem of ( p , q)-biclique counting and enumeration for large sparse bipartite graphs. Given a bipartite G = ( U, V , E), and two integer parameters p and q, we aim to efficiently count and enumerate all (p, q)-bicliques in G , where a (p, q)-biclique B ( L, R ) is a complete subgraph of G with L ⊆ U, R ⊆ V , |L| = p, and |R| = q. The problem of (p, q)-biclique counting and enumeration has many applications, such as graph neural network information aggregation, densest subgraph detection, and cohesive subgroup analysis, etc. Despite the wide range of applications, to the best of our knowledge, we note that there is no efficient and scalable solution to this problem in the literature. This problem is computationally challenging, due to the worst-case exponential number of (p, q)-bicliques. In this paper, we propose a competitive branch-and-bound baseline method, namely BCList, which explores the search space in a depth-first manner, together with a varie...
Uploads
Papers by wenjie zhang