Query expansion explained
Query expansion (QE) is the process of reformulating a given query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding.[1] In the context of search engines, query expansion involves evaluating a user's input (what words were typed into the search query area, and sometimes other types of data) and expanding the search query to match additional documents. Query expansion involves techniques such as:
Query expansion is a methodology studied in the field of computer science, particularly within the realm of natural language processing and information retrieval.
Precision and recall trade-offs
Search engines invoke query expansion to increase the quality of user search results. It is assumed that users do not always formulate search queries using the best terms. Best in this case may be because the database does not contain the user entered terms.
By stemming a user-entered term, more documents are matched, as the alternate word forms for a user entered term are matched as well, increasing the total recall. This comes at the expense of reducing the precision. By expanding a search query to search for the synonyms of a user entered term, the recall is also increased at the expense of precision. This is due to the nature of the equation of how precision is calculated, in that a larger recall implicitly causes a decrease in precision, given that factors of recall are part of the denominator. It is also inferred that a larger recall negatively impacts overall search result quality, given that many users do not want more results to comb through, regardless of the precision.
The goal of query expansion in this regard is by increasing recall, precision can potentially increase (rather than decrease as mathematically equated), by including in the result set pages which are more relevant (of higher quality), or at least equally relevant. Pages which would not be included in the result set, which have the potential to be more relevant to the user's desired query, are included, and without query expansion would not have, regardless of relevance. At the same time, many of the current commercial search engines use word frequency (tf-idf) to assist in ranking. By ranking the occurrences of both the user entered words and synonyms and alternate morphological forms, documents with a higher density (high frequency and close proximity) tend to migrate higher up in the search results, leading to a higher quality of the search results near the top of the results, despite the larger recall.
Query expansion methods
Automatic methods for query expansion were proposed in 1960 by Maron and Kuhns.[2] Modern query expansion methods either imply document collection analysis (global or local) [3] or are dictionary- or ontology-based.[4] The global analysis of the document collection is applied for searching for relations between terms. The local analysis refers to the relevance feedback introduced by Rocchio.[5] Rocchio proposed to judge manually some of the retrieved documents and use this feedback information to expand the query. Since collecting users' judgment can be challenging, only the first top retrieved documents are considered as relevant. This is so called pseudo-relevance feedback (PRF).[6] Pseudo-relevance feedback is efficient in average but can damage results for some queries,[7] especially difficult ones since the top retrieved documents are probably non-relevant. Pseudo-relevant documents are used to find expansion candidate terms that co-occur with many query terms.[8] This idea was further developed within the relevance language model formalism in positional relevance [9] and proximity relevance models [10] which consider the distance to query terms in the pseudo-relevant documents. Another direction in query expansion is the representation of index and query terms in a vector space which can be used to find related terms at query time, using semantic vectors or word embeddings. [11] [12]
More generally, query expansion, with its counterpart document expansion, are today implemented in the form of vector databases, using various encoding schemes based on deep learning. [13]
See also
Software libraries
- QueryTermAnalyzer open-source, C#. Machine learning based query term weight and synonym analyzer for query expansion.
- LucQE - open-source, Java. Provides a framework along with several implementations that allow to perform query expansion with the use of Apache Lucene.
- Xapian is an open-source search library which includes support for query expansion
- ReQue open-source, Python. A configurable software framework and a collection of gold standard datasets for training and evaluating supervised query expansion methods.[14] [15]
References
Sources
- D. Abberley, D. Kirby, S. Renals, and T. Robinson, The THISL broadcast news retrieval system. In Proc. ESCA ETRW Workshop Accessing Information in Spoken Audio, (Cambridge), pp. 14–19, 1999. Section on Query Expansion - Concise, mathematical overview.
- R. Navigli, P. Velardi. An Analysis of Ontology-based Query Expansion Strategies. Proc. of Workshop on Adaptive Text Extraction and Mining (ATEM 2003), in the 14th European Conference on Machine Learning (ECML 2003), Cavtat-Dubrovnik, Croatia, September 22-26th, 2003, pp. 42–49 - An analysis of query expansion methods relying on WordNet as the reference ontology.
- Y. Qiu and H.P. Frei. Concept Based Query Expansion. In Proceedings of SIGIR-93, 16th ACM International Conference on Research and Development in Information Retrieval, Pittsburgh, SIGIR Forum, ACM Press, June 1993 - Academic document on a specific method of query expansion
- Efthimis N. Efthimiadis. Query Expansion. In: Martha E. Williams (ed.), Annual Review of Information Systems and Technology (ARIST), v31, pp 121–187, 1996 - An introduction for less-technical viewers.
Notes and References
- Vectomova . Olga . Wang, Ying . 2006 . A study of the effect of term proximity on query expansion . . 32 . 4 . 324 - 333 . 10.1177/0165551506065787 . 10.1.1.552.5987 . 7265523 .
- Maron, M. E. and Kuhns, J. L. 1960. On Relevance, Probabilistic Indexing and Information Retrieval. Journal of the ACM 7, 3, 216–244.
- C. Carpineto and G. Romano. A survey of automatic query expansion in information retrieval. ACM Computing Surveys, 44(1):1-50, Jan. 2012.
- J. Bhogal, A. Macfarlane, and P. Smith. A review of ontology based query expansion. Inf. Process. Manage., 43(4):866-886, July 2007.
- J. Rocchio. Relevance feedback in information retrieval. In The SMART Retrieval System, p. 313-323. 1971.
- C. Buckley. Automatic query expansion using SMART: TREC 3. In Proceedings of The third Text REtrieval Conference (TREC-3). NIST Special Publication, p. 69-80. National Institute of Standards and Technology, 1995.
- G. Amati, C. Carpineto, and G. Romano. Query difficulty, robustness, and selective application of query expansion. Advances in Information Retrieval, p. 127-137, 2004.
- J. Xu and W. B. Croft. Query expansion using local and global document analysis. In Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval, pages 4-11. ACM, 1996.
- Y. Lv and C. Zhai. Positional relevance model for pseudo-relevance feedback. In Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval, page 579-586.ACM, 2010.
- L. Ermakova, J. Mothe, and E. Nikitina. 2016. Proximity relevance model for query expansion. In Proceedings of the 31st Annual ACM Symposium on Applied Computing (SAC '16). ACM, New York, NY, USA, 1054-1059. DOI: https://doi.org/10.1145/2851613.2851696
- Sahlgren, Magnus, Jussi Karlgren, Rickard Cöster, and Timo Järvinen. 2003. "Automatic query expansion using random indexing." In Advances in Cross-Language Information Retrieval: Third Workshop of the Cross-Language Evaluation Forum (CLEF). Springer.
- S. Kuzi, A. Shtok, and O. Kurland. 2016. Query Expansion Using Word Embeddings. In Proceedings of the 25th ACM International on Conference on Information and Knowledge Management (CIKM '16). ACM, New York, NY, USA, 1929-1932. DOI: https://doi.org/10.1145/2983323.2983876
- Lin. Jimmy. Nogueira. Rodrigo. Yates. Andrew. 2020-10-13. Pretrained Transformers for Text Ranking: BERT and Beyond. cs.IR. 2010.06467.
- Mahtab Tamannaee, Hossein Fani, Fattane Zarrinkalam, Jamil Samouh, Samad Paydar, Ebrahim Bagheri:ReQue: A Configurable Workflow and Dataset Collection for Query Refinement. CIKM 2020: 3165-3172
- Hossein Fani, Mahtab Tamannaee, Fattane Zarrinkalam, Jamil Samouh, Samad Paydar, Ebrahim Bagheri; An Extensible Toolkit of Query Refinement Methods and Gold Standard Dataset Generation. In Advances in Information Retrieval: 43rd European Conference on IR Research (ECIR'21), 2021.