Okapi BM25 explained

In information retrieval, Okapi BM25 (BM is an abbreviation of best matching) is a ranking function used by search engines to estimate the relevance of documents to a given search query. It is based on the probabilistic retrieval framework developed in the 1970s and 1980s by Stephen E. Robertson, Karen Spärck Jones, and others.

The name of the actual ranking function is BM25. The fuller name, Okapi BM25, includes the name of the first system to use it, which was the Okapi information retrieval system, implemented at London's City University[1] in the 1980s and 1990s. BM25 and its newer variants, e.g. BM25F (a version of BM25 that can take document structure and anchor text into account), represent TF-IDF-like retrieval functions used in document retrieval.

The ranking function

BM25 is a bag-of-words retrieval function that ranks a set of documents based on the query terms appearing in each document, regardless of their proximity within the document. It is a family of scoring functions with slightly different components and parameters. One of the most prominent instantiations of the function is as follows.

Given a query, containing keywords

q1,...,qn

, the BM25 score of a document is:

score(D,Q)=

n
\sum
i=1

IDF(qi)

f(qi,D)(k1+1)
f(qD)+k1\left(1-b+b
|D|
avgdl
\right)
i,

where

f(qi,D)

is the number of times that the keyword

qi

occurs in the document,

|D|

is the length of the document in words, and is the average document length in the text collection from which documents are drawn.

k1

and are free parameters, usually chosen, in absence of an advanced optimization, as

k1\in[1.2,2.0]

and

b=0.75

.[2]

IDF(qi)

is the IDF (inverse document frequency) weight of the query term

qi

. It is usually computed as:

IDF(qi)=ln\left(

N-n(qi)+0.5
n(qi)+0.5

+1\right)

where is the total number of documents in the collection, and

n(qi)

is the number of documents containing

qi

.

There are several interpretations for IDF and slight variations on its formula. In the original BM25 derivation, the IDF component is derived from the Binary Independence Model.

IDF information theoretic interpretation

Here is an interpretation from information theory. Suppose a query term

q

appears in

n(q)

documents. Then a randomly picked document

D

will contain the term with probability
n(q)
N
(where

N

is again the cardinality of the set of documents in the collection). Therefore, the information content of the message "

D

contains

q

" is:

-log

n(q)
N

=log

N
n(q)

.

Now suppose we have two query terms

q1

and

q2

. If the two terms occur in documents entirely independently of each other, then the probability of seeing both

q1

and

q2

in a randomly picked document

D

is:
n(q1)
N

n(q2)
N

,

and the information content of such an event is:

2
\sum
i=1

log

N
n(qi)

.

With a small variation, this is exactly what is expressed by the IDF component of BM25.

Modifications

b=1

) and BM15 (for

b=0

).[3]

\delta

(a default value is in absence of a training data) as compared with BM25:

score(D,Q)=

n
\sum
i=1

IDF(qi)\left[

f(qi,D)(k1+1)
f(qD)+k1\left(1-b+b
|D|
avgdl
\right)
i,

+\delta\right]

General references

External links

Notes and References

  1. Web site: OKAPI . 2023-10-16 . smcse.city.ac.uk.
  2. Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. An Introduction to Information Retrieval, Cambridge University Press, 2009, p. 233.
  3. Web site: The BM25 Weighting Scheme.
  4. Hugo Zaragoza, Nick Craswell, Michael Taylor, Suchi Saria, and Stephen Robertson. Microsoft Cambridge at TREC-13: Web and HARD tracks. In Proceedings of TREC-2004.
  5. Stephen Robertson . Hugo Zaragoza . amp . The Probabilistic Relevance Framework: BM25 and Beyond . Foundations and Trends in Information Retrieval . 2009 . 3 . 4 . 333–389 . 10.1561/1500000019 . 10.1.1.156.5282 . 207178704 .
  6. Book: Robertson . Stephen . Zaragoza . Hugo . Taylor . Michael . Proceedings of the thirteenth ACM international conference on Information and knowledge management . Simple BM25 extension to multiple weighted fields . 2004-11-13 . https://doi.org/10.1145/1031171.1031181 . CIKM '04 . New York, NY, USA . Association for Computing Machinery . 42–49 . 10.1145/1031171.1031181 . 978-1-58113-874-0. 16628332 .
  7. Yuanhua Lv and ChengXiang Zhai. Lower-bounding term frequency normalization. In Proceedings of CIKM'2011, pages 7-16.