SimRank explained

SimRank is a general similarity measure, based on a simple and intuitive graph-theoretic model.SimRank is applicable in any domain with object-to-object relationships, that measures similarity of the structural context in which objects occur, based on their relationships with other objects.Effectively, SimRank is a measure that says "two objects are considered to be similar if they are referenced by similar objects." Although SimRank is widely adopted, it may output unreasonable similarity scores which are influenced by different factors, and can be solved in several ways, such as introducing an evidence weight factor,[1] inserting additional terms that are neglected by SimRank[2] or using PageRank-based alternatives.[3]

Introduction

Many applications require a measure of "similarity" between objects.One obvious example is the "find-similar-document" query,on traditional text corpora or the World-Wide Web.More generally, a similarity measure can be used to cluster objects, such as for collaborative filtering in a recommender system, in which “similar” users and items are grouped based on the users’ preferences.

Various aspects of objects can be used to determine similarity, usually depending on the domain and the appropriate definition of similarity for that domain.In a document corpus, matching text may be used, and for collaborative filtering, similar users may be identified by common preferences.SimRank is a general approach that exploits the object-to-object relationships found in many domains of interest.On the Web, for example, two pages are related if there are hyperlinks between them.A similar approach can be applied to scientific papers and their citations, or to any other document corpus with cross-reference information.In the case of recommender systems, a user’s preference for an item constitutes a relationship between the user and the item.Such domains are naturally modeled as graphs, with nodes representing objects and edges representing relationships.

The intuition behind the SimRank algorithm is that, in many domains, similar objects are referenced by similar objects.More precisely, objects

a

and

b

are considered to be similar if they are pointed from objects

c

and

d

, respectively, and

c

and

d

are themselves similar.The base case is that objects are maximally similar to themselves.[4]

It is important to note that SimRank is a general algorithm that determines only the similarity of structural context.SimRank applies to any domain where there are enough relevant relationships between objects to base at least some notion of similarity on relationships.Obviously, similarity of other domain-specific aspects are important as well; these can — and should be combined with relational structural-context similarity for an overall similarity measure.For example, for Web pages SimRank can be combined with traditional textual similarity; the same idea applies to scientific papers or other document corpora.For recommendation systems, there may be built-in known similarities between items (e.g., both computers, both clothing, etc.), as well as similarities between users (e.g., same gender, same spending level).Again, these similarities can be combined with the similarity scores that are computed based on preference patterns, in order to produce an overall similarity measure.

Basic SimRank equation

For a node

v

in a directed graph, we denote by

I(v)

and

O(v)

the set of in-neighbors and out-neighbors of

v

, respectively.Individual in-neighbors are denoted as

Ii(v)

, for

1\lei\le\left|I(v)\right|

, and individualout-neighbors are denoted as

Oi(v)

, for

1\lei\le\left|O(v)\right|

.

Let us denote the similarity between objects

a

and

b

by

s(a,b)\in[0,1]

. Following the earlier motivation, a recursive equation is written for

s(a,b)

.If

a=b

then

s(a,b)

is defined to be

1

.Otherwise,

s(a,b)=

C
\left|I(a)\right|\left|I(b)\right|
\left|I(a)\right|
\sum
i=1
\left|I(b)\right|
\sum
j=1

s(Ii(a),Ij(b))

where

C

is a constant between

0

and

1

.A slight technicality here is that either

a

or

b

may not have any in-neighbors.Since there is no way to infer any similarity between

a

and

b

in this case, similarity is set to

s(a,b)=0

, so the summation in the above equation is defined to be

0

when

I(a)=\emptyset

or

I(b)=\emptyset

.

Matrix representation of SimRank

Given an arbitrary constant

C

between

0

and

1

, let

S

be the similarity matrix whose entry

[S]a,b

denotes the similarity score

s(a,b)

, and

A

be the column normalized adjacency matrix whose entry

[A]a,b=\tfrac{1}{|l{I}(b)|}

if there is an edge from

a

to

b

, and 0 otherwise. Then, in matrix notations, SimRank can be formulated as

{{S

}}= \max\,

where

I

is an identity matrix.

Computing SimRank

A solution to the SimRank equations for a graph

G

can be reached by iteration to a fixed-point.Let

n

be the number of nodes in

G

.For each iteration

k

, we can keep

n2

entries

sk(*,*)

, where

sk(a,b)

gives the score between

a

and

b

on iteration

k

.We successively compute

sk+1(*,*)

based on

sk(*,*)

.We start with

s0(*,*)

where each

s0(a,b)

is a lower bound on the actual SimRank score

s(a,b)

:

s0(a,b)= \begin{cases} 1,ifa=b,\\ 0,ifab. \end{cases}

To compute

sk+1(a,b)

from

sk(*,*)

, we use the basic SimRank equation to get:

sk(a,b)=

C
\left|I(a)\right|\left|I(b)\right|
\left|I(a)\right|
\sum
i=1
\left|I(b)\right|
\sum
j=1

sk(Ii(a),Ij(b))

for

a\neb

, and

sk+1(a,b)=1

for

a=b

.That is, on each iteration

k+1

, we update the similarity of

(a,b)

using the similarity scores of the neighbours of

(a,b)

from the previous iteration

k

according to the basic SimRank equation.The values

sk(*,*)

are nondecreasing as

k

increases.It was shown in [4] that the values converge to limits satisfying the basic SimRank equation, the SimRank scores

s(*,*)

, i.e., for all

a,b\inV

,

\limksk(a,b)=s(a,b)

.

The original SimRank proposal suggested choosing the decay factor

C=0.8

and a fixed number

K=5

of iterations to perform.However, the recent research [5] showed that the given values for

C

and

K

generally imply relatively low accuracy of iteratively computed SimRank scores.For guaranteeing more accurate computation results, the latter paper suggests either using a smaller decay factor (in particular,

C=0.6

) or taking more iterations.

CoSimRank

CoSimRank is a variant of SimRank with the advantage of also having a local formulation, i.e. CoSimRank can be computed for a single node pair.[6] Let

S

be the similarity matrix whose entry

[S]a,b

denotes the similarity score

s(a,b)

, and

A

be the column normalized adjacency matrix. Then, in matrix notations, CoSimRank can be formulated as:

{{S

}}= C\cdot (\mathbf^ \cdot \cdot) +,

where

I

is an identity matrix. To compute the similarity score of only a single node pair, let

p(0)(i)=ei

, with

ei

being a vector of thestandard basis, i.e., the

i

-th entry is 1 and all other entries are 0. Then, CoSimRank can be computed in two steps:

p(k)=Ap(k-1)

s(i,j)=

infty
\sum
k=0

Ck\langlep(k)(i),p(k)(j)\rangle

Step one can be seen a simplified version of Personalized PageRank. Step two sums up the vector similarity of each iteration. Both, matrix and local representation, compute the same similarity score. CoSimRank can also be used to compute the similarity of sets of nodes, by modifying

p(0)(i)

.

Further research on SimRank

Partial Sums Memoization

Lizorkin et al.[5] proposed three optimization techniques for speeding up the computation of SimRank:

  1. Essential nodes selection may eliminate the computation of a fraction of node pairs with a-priori zero scores.
  2. Partial sums memoization can effectively reduce repeated calculations of the similarity among different node pairs by caching part of similarity summations for later reuse.
  3. A threshold setting on the similarity enables a further reduction in the number of node pairs to be computed.

In particular, the second observation of partial sums memoization plays a paramount role in greatly speeding up the computation of SimRank from

l{O}(Kd2n2)

to

l{O}(Kdn2)

, where

K

is the number of iterations,

d

is average degree of a graph, and

n

is the number of nodes in a graph. The central idea of partial sums memoization consists of two steps:

First, the partial sums over

I(a)

are memoized as
sk
Partial
I(a)

(j)=\sumisk(i,j),    (\forallj\inI(b))

and then

sk+1(a,b)

is iteratively computed from
sk
Partial
I(a)

(j)

as

sk+1(a,b)=

C
|I(a)||I(b)|

\sumj

sk
Partial
I(a)

(j).

Consequently, the results of
sk
Partial
I(a)

(j)

,

\forallj\inI(b)

,can be reused later when we compute the similarities

sk+1(a,*)

for a given vertex

a

as the first argument.

See also

Sources

Notes and References

  1. I. Antonellis, H. Garcia-Molina and C.-C. Chang. Simrank++: Query Rewriting through Link Analysis of the Click Graph. In VLDB '08: Proceedings of the 34th International Conference on Very Large Data Bases, pages 408--421. http://dbpubs.stanford.edu/pub/showDoc.Fulltext?lang=en&doc=2008-17&format=pdf&compression=&name=2008-17.pdf
  2. W. Yu, X. Lin, W. Zhang, L. Chang, and J. Pei. More is Simpler: Effectively and Efficiently Assessing Node-Pair Similarities Based on Hyperlinks. In VLDB '13: Proceedings of the 39th International Conference on Very Large Data Bases, pages 13--24. http://www.vldb.org/pvldb/vol7/p13-yu.pdf
  3. H. Chen, and C. L. Giles. "ASCOS++: An Asymmetric Similarity Measure for Weighted Networks to Address the Problem of SimRank." ACM Transactions on Knowledge Discovery from Data (TKDD) 10.2 2015.http://clgiles.ist.psu.edu/pubs/TKDD2015.pdf
  4. G. Jeh and J. Widom. SimRank: A Measure of Structural-Context Similarity. In KDD'02: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 538-543. ACM Press, 2002. Web site: Archived copy . 2008-10-02 . dead . https://web.archive.org/web/20080512152518/http://www-cs-students.stanford.edu/~glenj/simrank.pdf . 2008-05-12 .
  5. D. Lizorkin, P. Velikhov, M. Grinev and D. Turdakov. Accuracy Estimate and Optimization Techniques forSimRank Computation. In VLDB '08: Proceedings of the 34th International Conference on Very Large Data Bases, pages 422--433. Web site: Archived copy . 2008-10-25 . dead . https://web.archive.org/web/20090407093025/http://modis.ispras.ru/Lizorkin/Publications/simrank_accuracy.pdf . 2009-04-07 .
  6. S. Rothe and H. Schütze. CoSimRank: A Flexible & Efficient Graph-Theoretic Similarity Measure. In ACL '14: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1392-1402 . http://acl2014.org/acl2014/P14-1/pdf/P14-1131.pdf
  7. D. Fogaras and B. Racz. Scaling link-based similarity search. In WWW '05: Proceedings of the 14th international conference on World Wide Web, pages 641--650, New York, NY, USA, 2005. ACM. https://web.archive.org/web/20140311193305/http://www2005.org/docs/p641.pdf
  8. Antonellis, Ioannis, Hector Garcia Molina, and Chi Chao Chang. "Simrank++: query rewriting through link analysis of the click graph." Proceedings of the VLDB Endowment 1.1 (2008): 408-421.
  9. W. Yu, X. Lin, W. Zhang. Towards Efficient SimRank Computation on Large Networks. In ICDE '13: Proceedings of the 29th IEEE International Conference on Data Engineering, pages 601--612. Web site: Archived copy . dead . https://web.archive.org/web/20140512220118/http://www.cse.unsw.edu.au/~weirenyu/pubs/icde13.pdf . 2014-05-12 . 2014-05-09.