# Vector Space for Information Retrieval

## Notes from a lecture in IN2110 at the University of Oslo about document vectors

This article is a set of notes on one of the lecture in IN2110 at the University of Oslo about document vectors. The original title is: “IN2110: Språkteknologiske metoder Vektorrom for IR.” That is: “Vector space models for Information Retrieval (IR).”

In the lecture it was said:

“In very high-level terms, all machine learning models can be described as a learned mapping from inputs to outputs.”

In this case it can be about modelling the similarity of documents, queries, and topics.

1. The first step is to represent our data in a way that the model can take as input.

→ Very often done by describing the data by a set of features.

In this manner the properties of the data.

What is observable?

What is relevant?

The object modelled can be described as tuple of d feature values.

In mathematics, a tuple is a finite ordered list (sequence) of elements. An n-tuple is a sequence (or ordered list) of n elements, where n is a non-negative integer.

We can record features of the data.

Every feature has a numerical value.

To bring in the explanation of tuple.

→ object x

→ can be described as a finite ordered list of elements = d feature values

Features are also numerically indexed.

Vector space model is based on a spatial metaphor.

Features correspond to dimensions or coordinate axes in the ‘space’.

This helpful video can explain in more detail.

Features correspond to dimensions or coordinate axes in this space.

The objects correspond to points in this feature space.

It is possible to measure geometrical distance/closeness.

According to the lecture:

“High-dimensional spaces where d is in the thousands or even millions are not uncommon in ML/NLP.”

You can see why this may complicate matters ever so slightly.

There is often a wish to quantify similarity of different documents.

How relevant is a document to a certain query?

Imagine searching on Google and getting the wrong thing.

This may have happened, but often you find what you were searching for.

How relevant is a certain document to a given query?

It is possible to do a frequency count of all the words (so-called bag-of-words).

Each word type corresponds to a dimension.

“Rose is a rose is a rose is a rose.” Gertrude Stein I How many words? (Assuming we ignore case and punctuation.)

Three types and ten tokens.

Conceptually, a vector space is often thought of as a matrix.

It is often called a co-occurrence matrix.

Here is another useful video to explain this as well.

We can assume that documents that share many of the same words are semantically similar in terms of their content.

This may of course not be right at all — however, these inquiries work by making a set of assumptions.

We can also view the rows as vectors representing words.

Words that co-occur in the same documents may be semantically related.

Distributional hypothesis is important in this regard.

The distributional hypothesis is the basis for statistical semantics.

“The Distributional Hypothesis is that words that occur in the same contexts tend to have similar meanings (Harris, 1954). The underlying idea that “a word is characterized by the company it keeps” was popularized by Firth (1957), and it is implicit in Weaver’s (1955) discussion of word sense disambiguation (originally written as a memorandum, in 1949). The Distributional Hypothesis is the basis for Statistical Semantics. Although the Distributional Hypothesis originated in Linguistics, it is now receiving attention in Cognitive Science (McDonald and Ramscar, 2001). The origin and theoretical basis of the Distributional Hypothesis is discussed by Sahlgren (2008).”

Distributional semantics is a research area that develops and studies theories and methods for quantifying and categorising semantic similarities between linguistic items based on their distributional properties in large samples of language data.

If vectors are defined the one can now compute similarity (of either words or documents) based on distance in the space.

There are several possibilities for this. This is listed in the lecture.

## Euclidean distance

Euclidean distance and length bias.

It is possible to reduce frequency effects.

One can first normalise all vectors to have unit length.

Can be achieved by simply dividing each element by the length.

“Amounts to all vectors pointing to the surface of a unit sphere.”

It is possible to measure (cosine) proximity rather than (Euclidean) distance.

Computes similarity as a function of the angle between the vectors.

Avoids the arbitrary scaling caused by dimensionality, frequency, etc. Cosine similarity — model from IN2110: Språkteknologiske metoder [1ex] Vektorrom for IR

Why do this?

We want to know how to find something and whether it is relevant.

AKA queries and relevance making.

Information retrieval (IR).

“Information retrieval is the activity of obtaining information system resources that are relevant to an information need from a collection of those resources.”

A central task in information retrieval:

• “Identifying relevant documents for a given query (i.e. search terms).
• Treat the query as a short document: I Represent it as a vector and find its nearest neighbors.
• I.e. rank the documents based on the distance between the document vectors and the query vector.”

“Problem: Raw frequency counts not always good indicators of relevance.”

The most frequent words will typically not be very discriminative.

One can therefore apply weighting:

“The process of weighting involves emphasizing the contribution of particular aspects of a phenomenon over others to a final outcome or result; thereby highlighting those aspects in comparison to others in the analysis.”

A weighting function is usually applied to the raw counts.

The most commonly used weighting function is tf-idf:

The term frequency tf(ti , dj ) denotes the number of times the term ti occurs in document dj.

“In information retrieval, tf–idf or TFIDF, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus.”

The document frequency df(ti) denotes the total number of documents in the collection that the term occurs in.

N is the total number of documents in the collection.

The weight given to term ti in document dj is then computed as

A high tf-idf: is obtained if a term has a high frequency in the given document.

A low tf-idf: frequency in the document collection as whole.

The weights hence tend to filter out common terms.

It is possible to use different processes to further define what a word is.

Tokenization: Splitting a text into sentences and words or other units.

To use supervised learning, requiring labeled training data.

“Supervised learning is the machine learning task of learning a function that maps an input to an output based on example input-output pairs.”

One can train a classifier to automatically assign new instances to predefined classes.

This requires labeled training data.

However, one can train a classifier to automatically assign new instances to predefined classes, given some set of training examples.

One can perform clustering. Unsupervised learning from unlabeled data, automatically group similar objects together.

“Classification amounts to computing the boundaries in the space that separate the classes; the decision boundaries.”