In network theory, link prediction is the problem of predicting the existence of a link between two entities in a network. Examples of link prediction include predicting friendship links among users in a social network, predicting co-authorship links in a citation network, and predicting interactions between genes and proteins in a biological network. Link prediction can also have a temporal aspect, where, given a snapshot of the set of links at time
t
t+1
Consider a network
G=(V,E)
V
E\subseteq|V|
|V|
V
t
t+1
E'
In the binary classification formulation of the link prediction task the potential links are classified as either true links or false links.Link prediction approaches for this setting learn a classifier
Mb
E'
Mb:E'\to\{0,1\}
Mp
E'
Mp:E'\to[0,1]
Single link approaches learn a model that classifies each link independently. Structured prediction approaches capture the correlation between potential links by formulating the task as a collective link prediction task. Collective link prediction approaches learn a model that jointly identify all the true links among the set of potential links.
Link prediction task can also be formulated as an instance of missing value estimation task.Here, the graph is represented as an adjacency matrix with missing values. The task is to complete the matrix by identifying the missing values.Matrix factorization based methods commonly use this formulation.
The task of link prediction has attracted attention from several research communities ranging from statistics and network science to machine learning and data mining. In statistics, generative random graph models such as stochastic block models propose an approach to generate links between nodes in a random graph.For social networks, Liben-Nowell and Kleinberg proposed a link prediction models based on different graph proximity measures.Several statistical models have been proposed for link prediction by the machine learning and data mining community.For example, Popescul et al. proposed a structured logistic regression model that can make use of relational features.Local conditional probability models based on attribute and structural features were proposed by O’Madadhain et al.Several models based on directed graphical models for collective link prediction have been proposed by Getoor.Other approached based on random walks. and matrix factorization have also been proposed With the advent of deep learning, several graph embedding based approaches for link prediction have also been proposed.For more information on link prediction refer to the survey by Getoor et al. and Yu et al.
Several link predication approaches have been proposed including unsupervised approaches such as similarity measures computed on the entity attributes, random walk and matrix factorization based approaches, and supervised approaches based on graphical models and deep learning.Link prediction approaches can be divided into two broad categories based on the type of the underlying network: (1) link prediction approaches for homogeneous networks (2) link prediction approaches for heterogeneous networks.Based on the type of information used to predict links, approaches can be categorized as topology-based approaches, content-based approaches, and mixed methods.[1]
Topology-based methods broadly make the assumption that nodes with similar network structure are more likely to form a link.
This is a common approach to link prediction that computes the number of common neighbors. Entities with more neighbors in common are more likely to have a link. It is computed as follows:
CN(A,B)={|A\capB|}
A weakness of this approach is that it does not take into account the relative number of common neighbors.
The Jaccard Measure addresses the problem of Common Neighbors by computing the relative number of neighbors in common:
J(A,B)={{|A\capB|}\over{|A\cupB|}}
The Adamic–Adar measure[2] is the sum of the log of the intersection of the neighbors of two nodes. This captures a two-hop similarity, which can yield better results than simple one-hop methods. It is computed as follows:
A(x,y)=\sumu
1 | |
log|N(u)| |
,
where
N(u)
u
Neighbor based methods can be effective when the number of neighbors is large, but this is not the case in sparse graphs. In these situations it is appropriate to use methods that account for longer walks. The Katz Measure[3] is one metric that captures this. It is computed by searching the graph for paths of length
t
Let A be the adjacency matrix of a network under consideration. Elements
(aij)
A3
(a2,12)=1
CKatz(i)
CKatz(i)=
infty | |
\sum | |
k=1 |
n | |
\sum | |
j=1 |
\alphak
k) | |
(A | |
ji |
Note that the above definition uses the fact that the element at location
(i,j)
Ak
k
i
j
Node-similarity methods predict the existence of a link based on the similarity of the node attributes.
The attribute values are represented as normalized vector and the distance between the vectors used to measure similarity. Small distances indicate higher similarity.
After normalizing the attribute values, computing the cosine between the two vectors is a good measure of similarity, with higher values indicating higher similarity.
Mixed methods combine attribute and topology based methods.
Graph embeddings also offer a convenient way to predict links. Graph embedding algorithms, such as Node2vec, learn an embedding space in which neighboring nodes are represented by vectors so that vector similarity measures, such as dot product similarity, or euclidean distance, hold in the embedding space. These similarities are functions of both topological features and attribute-based similarity. One can then use other machine learning techniques to predict edges on the basis of vector similarity.
A probabilistic relational model (PRM) specifies a template for a probability distribution over a databases. The template describes the relational schema for the domain, and the probabilistic dependencies between attributes in the domain. A PRM, together with a particular database of entities and unobserved links, defines a probability distribution over the unobserved links.
Probabilistic soft logic (PSL) is a probabilistic graphical model over hinge-loss Markov random field (HL-MRF). HL-MRFs are created by a set of templated first-order logic-like rules, which are then grounded over the data. PSL can combine attribute, or local, information with topological, or relational, information. While PSL can incorporate local predictors, such as cosine similarity, it also supports relational rules, such as triangle completion in a network.
Markov logic networks (MLNs) is a probabilistic graphical model defined over Markov networks. These networks are defined by templated first-order logic-like rules, which is then grounded over the training data. MLNs are able to incorporate both local and relational rules for the purpose of link prediction.
R-Models (RMLs) is a neural network model created to provide a deep learning approach to the link weight prediction problem. This model uses a node embedding technique that extracts node embeddings (knowledge of nodes) from the known links’ weights (relations between nodes) and uses this knowledge to predict the unknown links’ weights.
Link prediction has found varied uses, but any domain in which entities interact in a structures way can benefit from link prediction.[4] A common applications of link prediction is improving similarity measures for collaborative filtering approaches to recommendation. Link prediction is also frequently used in social networks to suggest friends to users. It has also been used to predict criminal associations.
In biology, link prediction has been used to predict interactions between proteins in protein-protein interaction networks.[5] Link prediction has also been used to infer interactions between drugs and targets using link prediction [6] Another application is found in collaboration prediction in scientific co-authorship networks.
Entity resolution, also known as deduplication, commonly uses link prediction to predict whether two entities in a network are references to the same physical entity. Some authors have used context information in network structured domains to improve entity resolution.[7]