Bangkok, Thailand — photo by @joshrh19

Vector Space for Information Retrieval

Notes from a lecture in IN2110 at the University of Oslo about document vectors

This article is a set of notes on one of the lecture in IN2110 at the University of Oslo about document vectors. The original title is: “IN2110: Språkteknologiske metoder Vektorrom for IR.” That is: “Vector space models for Information Retrieval (IR).”

“High-dimensional spaces where d is in the thousands or even millions are not uncommon in ML/NLP.”

You can see why this may complicate matters ever so slightly.

Euclidean distance

Slide from IN2110: Språkteknologiske metoder [1ex] Vektorrom for IR
Slide from IN2110: Språkteknologiske metoder [1ex] Vektorrom for IR
Model from IN2110: Språkteknologiske metoder [1ex] Vektorrom for IR
Cosine similarity — model from IN2110: Språkteknologiske metoder [1ex] Vektorrom for IR
  • Treat the query as a short document: I Represent it as a vector and find its nearest neighbors.
  • I.e. rank the documents based on the distance between the document vectors and the query vector.”
Model from IN2110: Språkteknologiske metoder [1ex] Vektorrom for IR

“Problem: Raw frequency counts not always good indicators of relevance.”

The most frequent words will typically not be very discriminative.

Model from IN2110
Model from IN2110
Model from IN2110

“Classification amounts to computing the boundaries in the space that separate the classes; the decision boundaries.”

Model from IN2110

AI Policy and Ethics at www.nora.ai. Student at University of Copenhagen MSc in Social Data Science. All views are my own. twitter.com/AlexMoltzau