Information extraction explained

Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. Typically, this involves processing human language texts by means of natural language processing (NLP).[1] Recent activities in multimedia document processing like automatic annotation and content extraction out of images/audio/video/documents could be seen as information extraction.

Recent advances in NLP techniques have allowed for significantly improved performance compared to previous years.[2] An example is the extraction from newswire reports of corporate mergers, such as denoted by the formal relation:

MergerBetween(company1,company2,date)

,from an online news sentence such as:

"Yesterday, New York based Foo Inc. announced their acquisition of Bar Corp."

A broad goal of IE is to allow computation to be done on the previously unstructured data. A more specific goal is to allow automated reasoning about the logical form of the input data. Structured data is semantically well-defined data from a chosen target domain, interpreted with respect to category and context.

Information extraction is the part of a greater puzzle which deals with the problem of devising automatic methods for text management, beyond its transmission, storage and display. The discipline of information retrieval (IR)[3] has developed automatic methods, typically of a statistical flavor, for indexing large document collections and classifying documents. Another complementary approach is that of natural language processing (NLP) which has solved the problem of modelling human language processing with considerable success when taking into account the magnitude of the task. In terms of both difficulty and emphasis, IE deals with tasks in between both IR and NLP. In terms of input, IE assumes the existence of a set of documents in which each document follows a template, i.e. describes one or more entities or events in a manner that is similar to those in other documents but differing in the details. An example, consider a group of newswire articles on Latin American terrorism with each article presumed to be based upon one or more terroristic acts. We also define for any given IE task a template, which is a(or a set of) case frame(s) to hold the information contained in a single document. For the terrorism example, a template would have slots corresponding to the perpetrator, victim, and weapon of the terroristic act, and the date on which the event happened. An IE system for this problem is required to "understand" an attack article only enough to find data corresponding to the slots in this template.

History

Information extraction dates back to the late 1970s in the early days of NLP.[4] An early commercial system from the mid-1980s was JASPER built for Reuters by the Carnegie Group Inc with the aim of providing real-time financial news to financial traders.[5]

Beginning in 1987, IE was spurred by a series of Message Understanding Conferences. MUC is a competition-based conference[6] that focused on the following domains:

Considerable support came from the U.S. Defense Advanced Research Projects Agency (DARPA), who wished to automate mundane tasks performed by government analysts, such as scanning newspapers for possible links to terrorism.

Present significance

The present significance of IE pertains to the growing amount of information available in unstructured form. Tim Berners-Lee, inventor of the World Wide Web, refers to the existing Internet as the web of documents [7] and advocates that more of the content be made available as a web of data.[8] Until this transpires, the web largely consists of unstructured documents lacking semantic metadata. Knowledge contained within these documents can be made more accessible for machine processing by means of transformation into relational form, or by marking-up with XML tags. An intelligent agent monitoring a news data feed requires IE to transform unstructured data into something that can be reasoned with. A typical application of IE is to scan a set of documents written in a natural language and populate a database with the information extracted.[9]

Tasks and subtasks

Applying information extraction to text is linked to the problem of text simplification in order to create a structured view of the information present in free text. The overall goal being to create a more easily machine-readable text to process the sentences. Typical IE tasks and subtasks include:

finding the relevant terms for a given corpus

Note that this list is not exhaustive and that the exact meaning of IE activities is not commonly accepted and that many approaches combine multiple sub-tasks of IE in order to achieve a wider goal. Machine learning, statistical analysis and/or natural language processing are often used in IE.

IE on non-text documents is becoming an increasingly interesting topic in research, and information extracted from multimedia documents can now be expressed in a high level structure as it is done on text. This naturally leads to the fusion of extracted information from multiple kinds of documents and sources.

World Wide Web applications

IE has been the focus of the MUC conferences. The proliferation of the Web, however, intensified the need for developing IE systems that help people to cope with the enormous amount of data that are available online. Systems that perform IE from online text should meet the requirements of low cost, flexibility in development and easy adaptation to new domains. MUC systems fail to meet those criteria. Moreover, linguistic analysis performed for unstructured text does not exploit the HTML/XML tags and the layout formats that are available in online texts. As a result, less linguistically intensive approaches have been developed for IE on the Web using wrappers, which are sets of highly accurate rules that extract a particular page's content. Manually developing wrappers has proved to be a time-consuming task, requiring a high level of expertise. Machine learning techniques, either supervised or unsupervised, have been used to induce such rules automatically.

Wrappers typically handle highly structured collections of web pages, such as product catalogs and telephone directories. They fail, however, when the text type is less structured, which is also common on the Web. Recent effort on adaptive information extraction motivates the development of IE systems that can handle different types of text, from well-structured to almost free text -where common wrappers fail- including mixed types. Such systems can exploit shallow natural language knowledge and thus can be also applied to less structured texts.

A recent development is Visual Information Extraction,[16] [17] that relies on rendering a webpage in a browser and creating rules based on the proximity of regions in the rendered web page. This helps in extracting entities from complex web pages that may exhibit a visual pattern, but lack a discernible pattern in the HTML source code.

Approaches

The following standard approaches are now widely accepted:

Numerous other approaches exist for IE including hybrid approaches that combine some of the standard approaches previously listed.

Free or open source software and services

See also

Extraction
Mining, crawling, scraping, and recognition
Search and translation
General
Lists

References

  1. name=Kariampuzha2023 Kariampuzha . William . Alyea . Gioconda . Qu . Sue . Sanjak . Jaleal . Mathé . Ewy . Sid . Eric . Chatelaine . Haley . Yadaw . Arjun . Xu . Yanji . Zhu . Qian . 2023 . Precision information extraction for rare disease epidemiology at scale . Journal of Translational Medicine . en . 21 . 1 . 157 . 10.1186/s12967-023-04011-y . 36855134 . 9972634 . free .
  2. Christina Niklaus, Matthias Cetto, André Freitas, and Siegfried Handschuh. 2018. A Survey on Open Information Extraction. In Proceedings of the 27th International Conference on Computational Linguistics, pages 3866–3878, Santa Fe, New Mexico, USA. Association for Computational Linguistics.
  3. Machine Learning for Information Extraction in Informal Domains. FREITAG. DAYNE. 2000 Kluwer Academic Publishers. Printed in the Netherlands.
  4. Book: Information Extraction. https://web.archive.org/web/20190220184608/http://pdfs.semanticscholar.org/2c90/fa59c6d9beed8dcb0e844725b872d3f33a35.pdf. dead. 2019-02-20. Cowie. Jim. Wilks. Yorick. 3. 1996. 10.1.1.61.6480. 10237124.
  5. Book: https://www.aclweb.org/anthology/A92-1024. Automatic Extraction of Facts from Press Releases to Generate News Stories. Andersen. Peggy M.. Hayes. Philip J.. 10.1.1.14.7943. Huettner. Alison K.. Schmandt. Linda M.. Nirenburg. Irene B.. Weinstein. Steven P.. Proceedings of the third conference on Applied natural language processing -. 1992. 170–177. 10.3115/974499.974531. 14746386.
  6. Marco Costantino, Paolo Coletti, Information Extraction in Finance, Wit Press, 2008.
  7. Web site: Linked Data - The Story So Far.
  8. Web site: Tim Berners-Lee on the next Web. 2010-03-27. 2011-04-10. https://web.archive.org/web/20110410204952/http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html. dead.
  9. [Rohini Kesavan Srihari|R. K. Srihari]
  10. Dat Quoc Nguyen and Karin Verspoor . End-to-end neural relation extraction using deep biaffine attention . Proceedings of the 41st European Conference on Information Retrieval (ECIR). 2019 . 10.1007/978-3-030-15712-8_47. 1812.11275.
  11. Milosevic N, Gregson C, Hernandez R, Nenadic G . A framework for information extraction from tables in biomedical literature . International Journal on Document Analysis and Recognition . 22 . 1 . 55–78 . February 2019 . 10.1007/s10032-019-00317-0 . 1902.10031 . 2019arXiv190210031M . 62880746 .
  12. PhD . Milosevic . Nikola . 2018 . A multi-layered approach to information extraction from tables in biomedical documents . University of Manchester .
  13. Book: Milosevic N, Gregson C, Hernandez R, Nenadic G . Natural Language Processing and Information Systems . Disentangling the Structure of Tables in Scientific Literature . Lecture Notes in Computer Science . 21 . June 2016 . 162–174 . 10.1007/978-3-319-41754-7_14 . 978-3-319-41753-0 . 19538141 . https://pure.manchester.ac.uk/ws/files/41051279/Disentangling_the_Structure_of_Tables_in_Scientific_Literature.pdf .
  14. PhD . Milosevic . Nikola . 2018 . A multi-layered approach to information extraction from tables in biomedical documents . University of Manchester .
  15. A.Zils, F.Pachet, O.Delerue and F. Gouyon, Automatic Extraction of Drum Tracks from Polyphonic Music Signals, Proceedings of WedelMusic, Darmstadt, Germany, 2002.
  16. 1506.08454. WYSIWYE: An Algebra for Expressing Spatial and Textual Rules for Information Extraction. Vijil . Chenthamarakshan. Prasad M . Desphande . Raghu . Krishnapuram . Ramakrishnan . Varadarajan . Knut . Stolze. 2015. cs.CL.
  17. 10.1.1.21.8236. Visual Web Information Extraction with Lixto. Robert . Baumgartner. Sergio . Flesca . Georg . Gottlob. 2001. 119–128.
  18. 10.1016/j.ipm.2005.09.002 . Information extraction from research papers using conditional random fields☆ . 2006 . Peng . F. . McCallum . A. . Information Processing & Management . 42 . 4 . 963.
  19. Web site: Extracting Frame-based Knowledge Representation from Route Instructions. Shimizu. Nobuyuki. Hass. Andrew. 2006. 2010-03-27. https://web.archive.org/web/20060901085639/http://www.cs.albany.edu/~shimizu/shimizu+haas2006frame.pdf. 2006-09-01. dead.

External links