Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2014, TheScientificWorldJournal
…
14 pages
1 file
We present a novel approach for computing link-based similarities among objects accurately by utilizing the link information pertaining to the objects involved. We discuss the problems with previous link-based similarity measures and propose a novel approach for computing link based similarities that does not suffer from these problems. In the proposed approach each target object is represented by a vector. Each element of the vector corresponds to all the objects in the given data, and the value of each element denotes the weight for the corresponding object. As for this weight value, we propose to utilize the probability of reaching from the target object to the specific object, computed using the "Random Walk with Restart" strategy. Then, we define the similarity between two objects as the cosine similarity of the two vectors. In this paper, we provide examples to show that our approach does not suffer from the aforementioned problems. We also evaluate the performance o...
In many real-world domains, link graph is one of the most effective ways to model the relationships between objects. Measuring the similarity of objects in a link graph is studied by many researchers, but an effective and efficient method is still expected. Based on our observation of link graphs from real domains, we find the block structure naturally exists. We propose an algorithm called BlockSimRank, which partitions the link graph into blocks, and obtains similarity of each node-pair in the graph efficiently. Our method is based on random walk on two-layer model, with time complexity as low as O(n 4/3) and less memory need. Experiments show that the accuracy of BlockSimRank is acceptable when the time cost is the lowest.
Proceedings of the 5th International Conference on PErvasive Technologies Related to Assistive Environments - PETRA '12, 2012
Link analysis ranking methods are widely used for summarizing the connectivity structure of large networks. We explore a weighted version of two common link analysis ranking algorithms, PageRank and HITS, and study their applicability to assistive environment data. Based on these methods, we propose a novel approach for identifying representative objects in large datasets, given their similarity matrix. The novelty of our approach is that it takes into account both the pair-wise similarities between the objects, as well as the origin and "evolution path" of these similarities within the dataset. The key step of our method is to define a complete graph, where each object is represented by a node and each edge in the graph is given a weight equal to the pairwise similarity value of the two adjacent nodes. Nodes with high ranking scores correspond to representative objects. Our experimental evaluation was performed on three data domains: american sign language, sensor data, and medical data.
2009 Ninth IEEE International Conference on Data Mining, 2009
Similarity calculation has many applications, such as information retrieval, and collaborative filtering, among many others. It has been shown that link-based similarity measure, such as SimRank, is very effective in characterizing the object similarities in networks, such as the Web, by exploiting the object-to-object relationship. Unfortunately, it is prohibitively expensive to compute the link-based similarity in a relatively large graph. In this paper, based on the observation that link-based similarity scores of real world graphs follow the power-law distribution, we propose a new approximate algorithm, namely Power-SimRank, with guaranteed error bound to efficiently compute link-based similarity measure. We also prove the convergence of the proposed algorithm. Extensive experiments conducted on real world datasets and synthetic datasets show that the proposed algorithm outperforms SimRank by four-five times in terms of efficiency while the error generated by the approximation is small.
Semantic Web, 2018
Large amounts of geo-spatial information have been made available with the growth of the Web of Data. While discovering links between resources on the Web of Data has been shown to be a demanding task, discovering links between geo-spatial resources proves to be even more challenging. This is partly due to the resources being described by the means of vector geometry. Especially, discrepancies in granularity and error measurements across data sets render the selection of appropriate distance measures for geo-spatial resources difficult. In this paper, we survey existing literature for point-set measures that can be used to measure the similarity of vector geometries. We then present and evaluate the ten measures that we derived from literature. We evaluate these measures with respect to their time-efficiency and their robustness against discrepancies in measurement and in granularity. To this end, we use samples of real data sets of different granularity as input for our evaluation framework. The results obtained on three different data sets suggest that most distance approaches can be led to scale. Moreover, while some distance measures are significantly slower than other measures, distance measure based on means, surjections and sums of minimal distances are robust against the different types of discrepancies.
2006
Distance metric is widely used in similarity estimation. In this paper we find that the most popular Euclidean and Manhattan distance may not be suitable for all data distributions. A general guideline to establish the relation between a distribution model and its corresponding similarity estimation is proposed. Based on Maximum Likelihood theory, we propose new distance metrics, such as harmonic distance and geometric distance. Because the feature elements may be from heterogeneous sources and usually have different influence on similarity estimation, it is inappropriate to model the distribution as isotropic. We propose a novel boosted distance metric that not only finds the best distance metric that fits the distribution of the underlying elements but also selects the most important feature elements with respect to similarity. The boosted distance metric is tested on fifteen benchmark data sets from the UCI repository and two image retrieval applications. In all the experiments, robust results are obtained based on the proposed methods.
Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining - KDD 08, 2008
This work introduces a new family of link-based dissimilarity measures between nodes of a weighted, directed, graph that generalizes both the shortest-path and the commute-time (or resistance) distances. This measure, called the randomized shortest-path (RSP) dissimilarity, depends on a parameter θ and has the interesting property of reducing, on one end, to the standard shortest-path distance when θ is large and, on the other end, to the commute-time distance when θ is small (near zero). Intuitively, it corresponds to the expected cost incurred by a random walker in order to reach a destination node from a starting node, while maintaining a constant entropy (related to θ) spread in the graph. The parameter θ is therefore biasing gradually the simple random walk on the graph towards the shortest-path policy. By adopting a statistical physics approach and computing a sum over all the possible paths, it is shown that the RSP dissimilarity from every node to a particular node of interest can be computed efficiently by solving two linear systems of n equations, where n is the number of nodes. On the other hand, the dissimilarity between every couple of nodes is obtained by inverting an n × n matrix. The proposed measure could be used for various graph mining tasks such as computing betweenness centrality, finding dense communities, etc, as shown in the experimental section.
International Journal of Advanced Computer Science and Applications, 2015
The problem to detect the similarity or the difference between objects are faced regularly in several domains of applications such as e-commerce, social network, expert system, data mining, decision support system, etc. This paper introduces a general model for measuring the similarity between objects based on their attributes. In this model, the similarity on each attribute is defined with different natures and kinds of attributes. This makes our model is general and enables to apply the model in several domains of application. We also present the applying of the model into two applications in social network and e-commerce situations.
Proceedings of GISRUK …, 2011
Journal of Systems and Software, 2009
In several applications, data objects move on predefined spatial networks such as road segments, railways, and invisible air routes. Many of these objects exhibit similarity with respect to their traversed paths, and therefore two objects can be correlated based on their motion similarity. Useful information can be retrieved from these correlations and this knowledge can be used to define similarity classes. In this paper, we study similarity search for moving object trajectories in spatial networks. The problem poses some important challenges, since it is quite different from the case where objects are allowed to move freely in any direction without motion restrictions. New similarity measures should be employed to express similarity between two trajectories that do not necessarily share any common sub-path. We define new similarity measures based on spatial and temporal characteristics of trajectories, such that the notion of similarity in space and time is well expressed, and moreover they satisfy the metric properties. In addition, we demonstrate that similarity range queries in trajectories are efficiently supported by utilizing metric-based access methods, such as M-trees.
Theoretical Computer Science, 2016
In this paper we define a new similarity measure: LCSk, aiming at finding the maximal number of k length substrings matching in both input strings while preserving their order of appearance, for which the traditional LCS is a special case, where k = 1. We examine this generalization in both theory and practice. We first describe its basic solution and give an experimental evidence in real data for its ability to differentiate between sequences that are considered similar according to the LCS measure. We then examine extensions of the LCSk definition to LCS in at least k-length substrings (LCS ≥ k) and 2-dimensional LCSk and also define complementary EDk and ED ≥ k distances.
Loading Preview
Sorry, preview is currently unavailable. You can download the paper by clicking the button above.
The VLDB Journal
Cybernetics and Systems Analysis, 2017
Theoretical Computer Science, 2009
Proceedings of the 2007 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), 2007
2006 10th International Database Engineering and Applications Symposium (IDEAS'06), 2006
Studies in Classification, Data Analysis, and Knowledge Organization, 2007
IEEE International Conference on Sensor Networks, Ubiquitous, and Trustworthy Computing -Vol 1 (SUTC'06)
Proceedings of the fifth ACM international workshop on Advances in geographic information systems - GIS '97, 1997
Proceedings of the 27th ACM International Conference on Information and Knowledge Management, 2018
ACM Transactions on Information Systems, 2010