PurpleFilm

The Information Within

Information Retrieval

Information retrieval (IR) is the science of searching for information in documents, searching for documents themselves, searching for metadata which describe documents, or searching within databases, whether relational stand-alone databases or hypertextually-networked databases such as the World Wide Web. There is a common confusion, however, between data retrieval, document retrieval, information retrieval, and text retrieval, and each of these has its own bodies of literature, theory, praxis and technologies. IR is, like most nascent fields, interdisciplinary, based on computer science, mathematics, library science, information science, cognitive psychology, linguistics, statistics, physics.

Automated IR systems are used to reduce information overload. Many universities and public libraries use IR systems to provide access to books, journals, and other documents. IR systems are often related to object and query. Queries are formal statements of information needs that are put to an IR system by the user. An object is an entity which keeps or stores information in a database. User queries are matched to objects stored in the database. A document is, therefore, a data object. Often the documents themselves are not kept or stored directly in the IR system, but are instead represented in the system by document surrogates.

In 1992 the US Department of Defense, along with the National Institute of Standards and Technology (NIST), cosponsored the Text Retrieval Conference (TREC) as part of the TIPSTER text program. The aim of this was to look into the information retrieval community by supplying the infrastructure that was needed for such a huge evaluation of text retrieval methodologies.

Web search engines such as Google, Live.com, or Yahoo search are the most visible IR applications

There are several measures on the performance of an information retrieval system. The measures rely on a collection of documents and a query for which the relevancy of the documents is known. All common measures described here assume binary relevancy: the document is either relevant or completely irrelevant. In practice queries may be ill-posed and there may be different shades of relevancy. The formulas for precision, recall and fall-out are translated from the German Wikipedia-article “Recall und Precision”. See also this nice intuitive, graphical depiction.

Precision

The proportion of retrieved and relevant documents to all the documents retrieved:

\mbox{precision}=\frac{|\{\mbox{relevant documents}\}\cap\{\mbox{retrieved documents}\}|}{|\{\mbox{retrieved documents}\}|}

In binary classification, precision is analogous to positive predictive value. Precision takes all retrieved documents into account. It can also be evaluated at a given cut-off rank, considering only the topmost results returned by the system. This measure is called precision at n or P@n.

Note that the meaning and usage of “precision” in the field of Information Retrieval differs from the definition of accuracy and precision within other branches of science and technology.

Recall

The proportion of relevant documents that are retrieved, out of all relevant documents available:

\mbox{recall}=\frac{|\{\mbox{relevant documents}\}\cap\{\mbox{retrieved documents}\}|}{|\{\mbox{relevant documents}\}|}

In binary classification, recall is called sensitivity.

It is trivial to achieve recall of 100% by returning all documents in response to any query. Therefore recall alone is not enough but one needs to measure the number of irrelevant document also, for example by computing the precision.

Fall-Out

The proportion of irrelevant documents that are retrieved, out of all irrelevant documents available:

\mbox{fall-out}=\frac{|\{\mbox{irrelevant documents}\}\cap\{\mbox{retrieved documents}\}|}{|\{\mbox{irrelevant documents}\}|}

F-measure

The weighted harmonic mean of precision and recall, the traditional F-measure or balanced F-score is:

F = 2 \cdot (\mathrm{precision} \cdot \mathrm{recall}) / (\mathrm{precision} + \mathrm{recall}).\,

This is also known as the F1 measure, because recall and precision are evenly weighted.

The general formula for non-negative real α is:

F_\alpha = (1 + \alpha) \cdot (\mathrm{precision} \cdot \mathrm{recall}) / (\alpha \cdot \mathrm{precision} + \mathrm{recall}).\,

Two other commonly used F measures are the F2 measure, which weights recall twice as much as precision, and the F0.5 measure, which weights precision twice as much as recall.

Average precision

The precision and recall are based on the whole list of documents returned by the system. Average precision emphasizes returning more relevant documents earlier. It is average of precisions computed after truncating the list after each of the relevant documents in turn:

\operatorname{Ave}P = \frac{\sum_{r=1}^N (P(r) \times \mathrm{rel}(r))}{\mbox{number of relevant documents}} \!,

where r is the rank, N the number retrieved, rel() a binary function on the relevance of a given rank, and P() precision at a given cut-off rank.

If there are several queries with known relevancies available, the mean average precision is the mean value of the average precisions computed for each of the queries separately.

Model types

For successful IR, it is necessary to represent the documents in some way. There are a number of models for this purpose. They can be categorized according to two dimensions like shown in the figure on the right: the mathematical basis and the properties of the model.

Trying to explain Information Retrieval basic problems

800px-information-retrieval-models1.png

in accordance with Dominik Kuropka’s Categorization of IR-models

(Thanks to Neil Dodgson and the other members of WIKI peer review work group, more info on Wikipedia Free Encyclopedia)