• Information Retrieval (IR) retrieves relevant facts from unstructured data that are NOT specified in advance
  • Information Extraction (IE) extracting relevant facts from unstructured data that are specified in advance

both IE and IR are subtasks of Natural Language Processing (NLP) - Computational Linguistics

IR/IE - Other

an implementation of Feature Extraction that extracts specific Features (i.e. entities, relations, events, etc) from text

IR/IE - Model Types

To effectively retrieve relevant documents, the documents are typically transformed into a suitable representation. Each retrieval strategy incorporates a specific model for its document representation purposes. The picture on the right illustrates the relationship between some common models.

models are categorized according to two dimensions:

  1. Mathematical Basis
  2. Properties of the Model

Dimension #1 - Mathematical Basis

Basis Type

Description

Example Models

Set-Theoretic Models

represent documents as sets of words or phrases. Similarities are usually derived from set-theoretic operations on those sets

Algebraic Models

represent documents and queries usually as vectors, matrices, or tuples. The similarity of the query vector and document vector is represented as a scalar value

Probabilistic Models

treat the process of document retrieval as a probabilistic inference. Similarities are computed as probabilities that a document is relevant for a given query.

Probabilistic theorems like Bayes’ Theorem are often used in these models

Feature-based Retrieval Models

View documents as vectors of values of feature functions and seek the best way to combine these features into a single relevance score, typically by learning to rank methods. Feature functions are arbitrary functions of document and query, and as such can easily incorporate almost any other retrieval model as just another feature

Dimension #2 - Properties of the Model

Properties of the Model

Description

Models without Term-Interdependencies

treat different terms/words as independent. This fact is usually represented in vector space models by the orthogonality assumption of term vectors or in probabilistic models by an independency assumption for term variables

Models with Immanent Term Interdependencies

allow a representation of interdependencies between terms. However, the degree of interdependency between two terms is defined by the model itself. It is usually directly or indirectly derived (e.g. by dimensional reduction) from the co-occurrence of those terms in the whole set of documents

Models with Transcendent Term Interdependencies

allow a representation of interdependencies between terms, but they do not allege how the interdependency between two terms is defined. They rely on an external source for the degree of interdependency between two terms. (For example, human or sophisticated algorithms)