A Beginner in Word2vec

An introduction to one technique for NLP

3 min readAug 14, 2020

I am writing this article to learn more about Word2vec. This is a short description based on the article on Wikipedia and will not contain any extensive technical descriptions of application.

There are a variety of techniques for natural language processing (NLP).

One of these techniques are Word2vec.

This uses a neural network model to learn word associations from a large corpus of text.
One trained it can detect synonymous words or suggest additional words for a partial sentence.

Word2vec was created and published in 2013 by a team of researchers led by Tomas Mikolov at Google.

Tomáš Mikolov is a Czech computer scientist working in the field of machine learning. He is currently a Research Scientist at Czech Institute of Informatics, Robotics and Cybernetics.

Him and the team he led have been widely cited in the scientific literature and the algorithm is patented.

Word2vec represent each word with a list of numbers called a vector.

“The vectors are chosen carefully such that a simple mathematical function (the cosine similarity between the vectors) indicates the level of semantic similarity between the words represented by those vectors.”

What is cosine similarity and semantic similarity.

“Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them, which is also the same as the inner product of the same vectors normalized to both have length 1. In linear algebra, an inner product space is a vector space with an additional structure called an inner product. This additional structure associates each pair of vectors in the space with a scalar quantity known as the inner product of the vectors.
Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity.”

The approach is based on grouping related models to produce word embeddings.

Word embedding: is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.

Word2vec takes:

A large corpus of text and produces a vector space.
Typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. (dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within i)
Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located close to one another in the space.

Word2vec can utilize either of two model architectures to produce a distributed representation of words:

Continuous bag-of-words (CBOW).
Continuous skip-gram.

→ In CBOW the model predicts the current word from a window of surrounding context words. The order of context words does not influence prediction (bag-of-words assumption).

→ In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words.

A far better and more extensive article was written by Jay Alammar.

The Illustrated Word2vec

Discussions: Hacker News (347 points, 37 comments), Reddit r/MachineLearning (151 points, 19 comments) Translations…

jalammar.github.io

This is #500daysofAI and you are reading article 437. I am writing one new article about or related to artificial intelligence every day for 500 days.

A Beginner in Word2vec

An introduction to one technique for NLP

The Illustrated Word2vec

Discussions: Hacker News (347 points, 37 comments), Reddit r/MachineLearning (151 points, 19 comments) Translations…

Written by Alex Moltzau