0% found this document useful (0 votes)
22 views9 pages

Fast, Memory Efficient Low-Rank Approximation of SimRank

Uploaded by

yajnik.1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views9 pages

Fast, Memory Efficient Low-Rank Approximation of SimRank

Uploaded by

yajnik.1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Fast, memory efficient low-rank approximation of

SimRank
∗ †
I.V. Oseledets G.V. Ovchinnikov
arXiv:1410.0717v1 [cs.DS] 2 Oct 2014

October 6, 2014

Abstract
SimRank is a well-known similarity measure between graph vertices.
In this paper novel low-rank approximation of SimRank is proposed.
Keywords: SimRank, low-rank approximations, SVD

1 Introduction
SimRank [4] is a topologically induced similarity measure between graph ver-
tices which could be very useful in many applications, such as relation mining
or document-by-document querying but, high computational O(n3 ) and storage
O(n2 ) costs (here n is number of vertexes in the graph) hinder it wider adop-
tion. There have been numerous attempts at solving those issues. [5] proposes
solution of Sylvester equation as an approximation of SimRank. [3] proposes
GPU-based solvers, and [6] describes some optimizations via threshold-sieving
heuristics. While some of those techniques improve computational costs or pro-
pose hardware efficient methods none of the papers takes on improving memory
consumption asymptotic. There is an empirical data from [1] which suggests an
existence of low-dimensional ’core’ of SimRank for many real-world data, yet
authors do not explore this idea. In this paper we propose to use low-rank ma-
trices for efficient storage and computation of SimRank at the price of slightly
lower accuracy of SimRank scores.

2 Definitions
In the foundation of SimRank definition lies idea that ”two objects are similar
if they are referenced by the similar objects”. Next, it is proposed that object
similarity lies between 0 and 1 with object being maximally similar to itself
∗ ...
† The work is supported by RFBR grant 14-07-31297 mol a

1
with similarity 1. Similarity between objects a and b, s(a, b) is defined in the
following way:

1, a = b,


 0, if I(a) = ∅ or I(b) = ∅,


s(a, b) = |I(a)| |I(b)|
c X X
s(Ii (a), Ij (b)), otherwise,


 |I(a)||I(b)|

i=1 j=1

here I(v) is the set of in-neighbours of vertex v, constant c ∈ (0, 1). Writing n2
equations (one equation for each pair of vertices) gives us a system with unique
solution [4].
One can find a solution of such system by using the following iterative pro-
cess:

1, a = b,
R0 (a, b) = (1)
0, otherwise,

 1, a = b,
 0, if I(a) = ∅ or I(b) = ∅,

Rk+1 (a, b) = c X X (2)
 Rk (w, v), otherwise.
 |I(a)||I(b)|

v∈I(a) w∈I(b)

It is shown in [4] that Rk (a, b) converges to s(a, b).


The iterative process (2) can be written in the matrix form as follows:

Sk+1 = cW T Sk W − c diag(W T Sk W ) + I, S0 = I, (3)

here W is the adjacency matrix A normalized by columns W = AD−1 with D


being a diagonal matrix with
n
X
Di,i = Ai,j .
j=1

3 Low-rank approximation
Due to SimRank itself being a symmetric matrix (which can be seen from (3))
we choose symmetric form for the low-rank approximation

S̃ = I + U DU T , (4)

where I is an identity matrix of n-th order, U and D are correspondingly or-


thonormal and diagonal matrices of sizes n × m and m × m.
Instead of the iterative process (3) converging to the precise SimRank matrix
S we consider the following iterative process:

S̃k+1 = W T S̃k W − diag(W T S̃k W ) + I. (5)

2
Substituting (4) into (5) with arbitrary initial matrices U0 and D0 gives us the
following equation for Uk+1 and Dk+1 :
T
I + Uk+1 Dk+1 Uk+1 = W T (I + Uk Dk UkT )W − diag(W T W + W T Uk DUkT W ) + I,

from which we derive


T
Uk+1 Dk+1 Uk+1 = Mk ,
where Mk = W T (I + Uk UkT )W − diag(W T W + W T Uk UkT W ). Calculation of
matrices Uk+1 and Dk+1 is based on the eigenvalue decomposition of matrix M :

Mk = Ũ D̃Ũ T ,

here Ũ , D̃ are matrices of eigenvectors and eigenvalues correspondingly, and


D̃ is ordered from greatest to least along its diagonal. Using a priori chosen
approximation rank r we from matrices Uk+1 and Dk+1 :

Uk+1 = Ũr , Dk+1 = D̃r ,

here Ũr is matrix of the first r eigenvectors and D̃r is a matrix of first r eigen-
values. Convergence analysis is complicated due to k-rank projection operation
not being Lipschitz.

4 Advantages over existing methods


The proposed method has memory requirements O(n max(r, d)) and computa-
tional complexity O(rkn2 ). The iterative algorithm from the original paper [4]
has memory requirements O(n2 ) and computational complexity O(kn2 d2 ) which
is O(kn4 ) in worst-case scenario. A number of attempts was made to speed-
up SimRank computation. [6] proposes three optimization strategies: skipping
computation of similarity between some node pairs, optimizations of SimRank
iterative process and pruning. In the end it improves worst-case computational
complexity of algorithm to min(O(nl), O(n3 / log2 (n)), with l denoting the num-
ber of object-to-object relationships. The memory requirements are the same
as with the original paper, namely O(n2 ).
Another approach, based on the approximation of SimRank by Sylvester
equation is pioneered in [5]. While this approach provides nice and well un-
derstood algebraic equation with already available solvers implemented in all
popular programming languages, it requires computation of W −1 which is a full
matrix, while W is usually sparse. There are number of works extending on this
idea, for example [7] proposes low-rank approximation to Sylvester equation or
[8] proposes low-rank approximation for SimRank equation written in the form
of discrete Lyapunov equation. It should be noted that those equations are in
fact only approximation to SimRank. It can be easily seen from an example

3
graph provided in original paper [4]. The adjacency matrix is
 
0 1.0 0 1.0 0
 0 0 1.0 0 0 
 
W =  1.0 0 0 0 0 .

 0 0 0 0 1.0 
0 0 0 1.0 0
The result for original iterative simrank algorithm gives the following SimRank
matrix:  
1.0 0 0 0.1323 0.03388

 0 1.0 0 0.4136 0.1059  

 0 0 1.0 0.04235 0.3308  
 0.1323 0.4136 0.04235 1.0 0.08822 
0.03388 0.1059 0.3308 0.08822 1.0
The result of the Lyapunov equation based approximation from [5] gives:
3.013 · 10−15 8.797 · 10−16 0.1323 0.03388
 
1.0
 6.25 · 10−16 1.0 2.796 · 10−15 0.4136 0.1059 
 3.2 · 10−15 6.729 · 10−16
 
 1.0 0.04235 0.3308  
 0.1323 0.4136 0.04235 0.5399 0.08822 
0.03388 0.1059 0.3308 0.08822 0.632
As we can see even for such small and simple example the discrepancy not
negligible: the S4,4 and S5,5 significantly differ from 1. On slightly more complex
tests non-diagonal elements will differ, too. For example, we can modify this
graph adding an edge between StudentA and StudentB (see the Figure 1) which
gives us the following adjacency matrix:
 
0 1.0 0 1.0 0
 0 0 1.0 0 0 
 
W =  1.0 0
 0 0 1.0  
 0 0 0 0 1.0 
0 0 1.0 1.0 0
The result for original iterative SimRank algorithm:
 
1.0 0.1809 0.2262 0.1993 0.5523
 0.1809 1.0 0.2933 0.6209 0.1702 
 
 0.2262 0.2933 1.0 0.3807 0.2721 
 ,
 0.1993 0.6209 0.3807 1.0 0.1744 
0.5523 0.1702 0.2721 0.1744 1.0
and result of algorithm from [5]:
 
0.5653 0.08779 0.1097 0.09898 0.2538
 0.08779 0.6522 0.1366 0.3276 0.08348 
 
 0.1097 0.1366 0.4566 0.1778 0.1377 
 .
 0.09898 0.3276 0.1778 0.5073 0.08661 
0.2538 0.08348 0.1377 0.08661 0.4639
As one can see the differences are too large to be ignored.

4
Figure 1: Example graph from original SimRank paper. a) Graph b) Corre-
sponding SimRank scores

5 Speeding up with probabilistic spectral de-


composition
Spectral decomposition of the matrix M could be replaced by its probabilistic
approximation from work [2]. Here is a short description of used algorithm,
provided for readers convenience.
Algorithm 1: Probabilistic spectral decomposition
Data: Symmetric matrix M , approximation rank r, oversampling
parameter p
Result: Approximate eigenvalue decomposition M ≈ U ΛU T
1. Select approximation rank r and oversampling parameter p;
2. Generate n × (r + p) matrix Z with elements drawn from standard
Gaussian distribution;
3. Form n × (r + p) matrix Y = M Z;
4. Construct an n × (r + p) matrix Q whose columns form an
orthonormal basis for the range of Y ;
5. Form (r + p) × (r + p) matrix B = QT M Q;
6. Compute spectral decomposition B = V T ΛV ;
7. Use QV as approximation of eigenvectors of matrix M ;
For more detailed information about probabilistic approaches for low-rank
approximation we direct reader to [2].

5
DIMACS10/delaunay n10 DIMACS10/data
1.0 1.0
Our method Our method
Naive approach Naive approach
0.9
0.9

0.8
0.8
kS−Sapprox k2

kS−Sapprox k2
0.7
kSk2

kSk2
0.7

0.6

0.6
0.5

0.5
0.4

0.3 0.4
10 20 30 40 50 60 70 80 90 100 10 20 30 40 50 60 70 80 90 100
Rank of approximation Rank of approximation

Figure 2: Dependence of relative error on approximation rank

Using this our final algorithm can be formulated as follows:


Algorithm 2: Low-rank SimRank approximation
Data: Column normalized adjacency matrix W , approximation rank r,
oversampling parameter p, number of iterations k
Result: Matrices U and D such S ≈ I + U DU T
set initial values for U , D ;
for i=1 . . . k do
A1 ← W T W Z;
A2 ← W T U DU T W Z;
A3 ← diag(A1 + A2 )Z;
U, D ← probabilistic spectral decomposition of A1 + A2 − A3 ;
end
We can also have a better approximation of SimRank in the form

S ≈ I + W T W + W T U DU T W − W T diag(W T W + W T U DU T W )W, (6)

which is essentially one iterative step done as in original iterative algorithm (3)
with approximative SimRank used as initial guess. While this form will result
in dense matrix this is form can be effectively used for querying.

6 Numerical experiments
The proposed method provides fixed memory and computation complexity, but
the question about the approximation quality is still open. While no amount
of numerical experiments can replace good theoretical estimates it nevertheless
sheds some light on proposed method behaviour. All experiments are done in
MATLAB 2014a.
In this section we present our experimental study of proposed algorithm on
a graphs from DIMACS10 Challenge Collection 1 . First we studied dependence
1 https://www.cise.ufl.edu/research/sparse/matrices/DIMACS10/

6
DIMACS10/delaunay n10 DIMACS10/data
1.0 1.0
Our method Our method
Our method with p=10 Our method with p=10
Naive approach Naive approach
0.8 0.8
kS−Sapprox k2

kS−Sapprox k2
0.6 0.6
kSk2

kSk2
0.4 0.4

0.2 0.2

0.0 0.0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
Parameter c Parameter c

Figure 3: Dependence of relative error on c

of relative error of our method on approximation rank. For comparison we used


best rank r approximation of the SimRank matrix. We call this method naive
because it does not take into account structure of the SimRank matrix. Methods
based on Sylvester or Lyapunov equation usually employ some kind of low-
rank approximation and, if those equations would describe SimRank precisely,
it would be their upper limit. As can be seen from Figure 2 the proposed
method provides superior accuracy. For those experiments c = 0.5 was selected
and no oversampling (p = 0) was used. The second part of the experiments
shows dependence of relative error on c with and without oversampling. Based
on estimations from [2] p = 10 is chosen. As we can see from Figure 3 our
method provides better approximation for most part of values of c. As expected
oversampling provides some improvement.
name n rank nnz nnz/n corr err
chesapeake 39 3 340 8.72 0.57 0.12
data 2851 285 30186 10.59 0.82 0.42
delaunay n10 1024 102 6112 5.97 0.67 0.30
delaunay n11 2048 204 12254 5.98 0.67 0.41
delaunay n12 4096 409 24528 5.99 0.68 0.60
delaunay n13 8192 819 49094 5.99 0.68 0.82
uk 4824 482 13674 2.83 0.24 0.66
vsp data and seymourl 9167 916 111732 12.19 0.89 0.39
Here nnz is the number of nonzero elements in the adjacency matrix, nnz/n
is the average degree of vertex, corr is correlation between S and S̃ with diago-
nals removed and
S S̃
err = k − k2 .
kSk2 kS̃k2
From this table is clearly seen that the accuracy of our method depends on the
connectivity of the given graph, the higher the connectivity the greater accuracy
will be.

7
6.1 Experiment with Wikipedia
We used Simple English Wikipedia corpus to find semantic relatedness between
concepts. We had 150495 entities(order of our matrix) which was a slightly
higher than the number of articles in the given wiki at the time of writing,
because some of those entities are redirection stubs. We used undirected graph
representation of inter-wiki links which gave us 4454023 non zero elements in
adjacency matrix. For graph that large direct computation of SimRank is
infeasible. We used the following parameters for the experiment: c = 0.3,
rank = 6000, no-oversampling and did ten iterations. While original paper [4]
suggests c = 0.8 later it was suggested to use c = 0.6 for better results [6] and we
choose c = 0.3 because it gave us subjectively better results. In experiment we
used virtual server (VZ container) with 16 CPU cores and 100GiB RAM avail-
able (host node has 32 cores: 4 CPUs, each is 8-core AMD Opteron Processor
6272, 128GiB RAM). With this setup computations took roughly 40 hours.
Some examples provided in the table below. The first row is the word for
which most similar words were queried, then in each of the columns most similar
words are listed ordered by their SimRank score. The scores themselves would
take too much space (they differ in 4-th or 5-th significant figures) and hence
are omitted.

Those results look very promising, despite small approximation rank (com-
pared to matrix order). Two factors contribute to this: high average vertex
degree of ≈ 29.6 and empirically observed localization of errors in more dissim-
ilar items. Investigation of both factors requires further work.

7 Acknowledgements
We are particularly grateful for the valuable technical assistance given by A.A.
Sadovsky and D.A. Podlesnykh.

8
References
[1] Yuanzhe Cai, Gao Cong, Xu Jia, Hongyan Liu, Jun He, Jiaheng Lu, and
Xiaoyong Du. Efficient algorithm for computing link-based similarity in real
world networks. In Data Mining, 2009. ICDM ’09. Ninth IEEE International
Conference on, pages 734–739, Dec 2009.
[2] N. Halko, P. Martinsson, and J. Tropp. Finding structure with randomness:
Probabilistic algorithms for constructing approximate matrix decomposi-
tions. SIAM Review, 53(2):217–288, 2011.
[3] Guoming He, Cuiping Li, Hong Chen, Xiaoyong Du, and Haijun Feng. Using
graphics processors for high performance simrank computation. Knowledge
and Data Engineering, IEEE Transactions on, 24(9):1711–1725, Sept 2012.

[4] Glen Jeh and Jennifer Widom. Simrank: A measure of structural-context


similarity. In Proceedings of the Eighth ACM SIGKDD International Con-
ference on Knowledge Discovery and Data Mining, KDD ’02, pages 538–543,
New York, NY, USA, 2002. ACM.
[5] Cuiping Li, Jiawei Han, Guoming He, Xin Jin, Yizhou Sun, Yintao Yu, and
Tianyi Wu. Fast computation of simrank for static and dynamic information
networks. In Proceedings of the 13th International Conference on Extending
Database Technology, EDBT ’10, pages 465–476, New York, NY, USA, 2010.
ACM.
[6] Dmitry Lizorkin, Pavel Velikhov, Maxim Grinev, and Denis Turdakov. Ac-
curacy estimate and optimization techniques for simrank computation. The
VLDB Journal, 19(1):45–66, 2010.
[7] Makoto Onizuka, Yasuhiro Fujiwara, Hiroaki Shiokawa, and Makoto Nakat-
suji. Efficient search algorithm for simrank. In Proceedings of the 2013
IEEE International Conference on Data Engineering (ICDE 2013), ICDE
’13, pages 589–600, Washington, DC, USA, 2013. IEEE Computer Society.
[8] Weiren Yu and Julie A. McCann. Sig-sr: Simrank search over singular
graphs. In Proceedings of the 37th International ACM SIGIR Conference on
Research & Development in Information Retrieval, SIGIR ’14, pages
859–862, New York, NY, USA, 2014. ACM.

You might also like