Hyperlink-Induced Topic Search (HITS; also known as hubs and authorities) is a link analysis algorithm that rates Web pages, developed by Jon Kleinberg. The idea behind Hubs and Authorities stemmed from a particular insight into the creation of web pages when the Internet was originally forming; that is, certain web pages, known as hubs, served as large directories that were not actually authoritative in the information that they held, but were used as compilations of a broad catalog of information that led users direct to other authoritative pages. In other words, a good hub represents a page that pointed to many other pages, while a good authority represents a page that is linked by many different hubs.[1]
The scheme therefore assigns two scores for each page: its authority, which estimates the value of the content of the page, and its hub value, which estimates the value of its links to other pages.
Many methods have been used to rank the importance of scientific journals. One such method is Garfield's impact factor. Journals such as Science and Nature are filled with numerous citations, making these magazines have very high impact factors. Thus, when comparing two more obscure journals which have received roughly the same number of citations but one of these journals has received many citations from Science and Nature, this journal needs be ranked higher. In other words, it is better to receive citations from an important journal than from an unimportant one.[2]
This phenomenon also occurs in the Internet. Counting the number of links to a page can give us a general estimate of its prominence on the Web, but a page with very few incoming links may also be prominent, if two of these links come from the home pages of sites like Yahoo!, Google, or MSN. Because these sites are of very high importance but are also search engines, a page can be ranked much higher than its actual relevance.
In the HITS algorithm, the first step is to retrieve the most relevant pages to the search query. This set is called the root set and can be obtained by taking the top pages returned by a text-based search algorithm. A base set is generated by augmenting the root set with all the web pages that are linked from it and some of the pages that link to it. The web pages in the base set and all hyperlinks among those pages form a focused subgraph. The HITS computation is performed only on this focused subgraph. According to Kleinberg the reason for constructing a base set is to ensure that most (or many) of the strongest authorities are included.
Authority and hub values are defined in terms of one another in a mutual recursion. An authority value is computed as the sum of the scaled hub values that point to that page. A hub value is the sum of the scaled authority values of the pages it points to. Some implementations also consider the relevance of the linked pages.
The algorithm performs a series of iterations, each consisting of two basic steps:
The Hub score and Authority score for a node is calculated with the following algorithm:
HITS, like Page and Brin's PageRank, is an iterative algorithm based on the linkage of the documents on the web. However it does have some major differences:
To begin the ranking, we let
auth(p)=1
hub(p)=1
p
For each
p
auth(p)
auth(p)=\displaystyle
\sum\nolimits | |
q\inPto |
hub(q)
Pto
p
For each
p
hub(p)
hub(p)=\displaystyle
\sum\nolimits | |
q\inPfrom |
auth(q)
Pfrom
p
The final hub-authority scores of nodes are determined after infinite repetitions of the algorithm. As directly and iteratively applying the Hub Update Rule and Authority Update Rule leads to diverging values, it is necessary to normalize the matrix after every iteration. Thus the values obtained from this process will eventually converge.
G := set of pages for each page p in G do p.auth = 1 // p.auth is the authority score of the page p p.hub = 1 // p.hub is the hub score of the page p for step from 1 to k do // run the algorithm for k steps norm = 0 for each page p in G do // update all authority values first p.auth = 0 for each page q in p.incomingNeighbors do // p.incomingNeighbors is the set of pages that link to p p.auth += q.hub norm += square(p.auth) // calculate the sum of the squared auth values to normalise norm = sqrt(norm) for each page p in G do // update the auth scores p.auth = p.auth / norm // normalise the auth values norm = 0 for each page p in G do // then update all hub values p.hub = 0 for each page r in p.outgoingNeighbors do // p.outgoingNeighbors is the set of pages that p links to p.hub += r.auth norm += square(p.hub) // calculate the sum of the squared hub values to normalise norm = sqrt(norm) for each page p in G do // then update all hub values p.hub = p.hub / norm // normalise the hub values
The hub and authority values converge in the pseudocode above.
The code below does not converge, because it is necessary to limit the number of steps that the algorithm runs for. One way to get around this, however, would be to normalize the hub and authority values after each "step" by dividing each authority value by the square root of the sum of the squares of all authority values, and dividing each hub value by the square root of the sum of the squares of all hub values. This is what the pseudocode above does.
G := set of pages for each page p in G do p.auth = 1 // p.auth is the authority score of the page p p.hub = 1 // p.hub is the hub score of the page p function HubsAndAuthorities(G) for step from 1 to k do // run the algorithm for k steps for each page p in G do // update all authority values first p.auth = 0 for each page q in p.incomingNeighbors do // p.incomingNeighbors is the set of pages that link to p p.auth += q.hub for each page p in G do // then update all hub values p.hub = 0 for each page r in p.outgoingNeighbors do // p.outgoingNeighbors is the set of pages that p links to p.hub += r.auth