Photo by @dsmacinnes

A Beginner in Word2vec

An introduction to one technique for NLP

I am writing this article to learn more about Word2vec. This is a short description based on the article on Wikipedia and will not contain any extensive technical descriptions of application.

There are a variety of techniques for natural language processing (NLP).

One of these techniques are Word2vec.

Word2vec was created and published in 2013 by a team of researchers led by Tomas Mikolov at Google.

Tomáš Mikolov is a Czech computer scientist working in the field of machine learning. He is currently a Research Scientist at Czech Institute of Informatics, Robotics and Cybernetics.

Him and the team he led have been widely cited in the scientific literature and the algorithm is patented.

Word2vec represent each word with a list of numbers called a vector.

“The vectors are chosen carefully such that a simple mathematical function (the cosine similarity between the vectors) indicates the level of semantic similarity between the words represented by those vectors.”

What is cosine similarity and semantic similarity.

The approach is based on grouping related models to produce word embeddings.

Word embedding: is the collective name for a set of language modeling and feature learning techniques in natural language processing (NLP) where words or phrases from the vocabulary are mapped to vectors of real numbers.

Word2vec takes:

Word2vec can utilize either of two model architectures to produce a distributed representation of words:

  • Continuous bag-of-words (CBOW).
  • Continuous skip-gram.

→ In CBOW the model predicts the current word from a window of surrounding context words. The order of context words does not influence prediction (bag-of-words assumption).

→ In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words.

A far better and more extensive article was written by Jay Alammar.

This is #500daysofAI and you are reading article 437. I am writing one new article about or related to artificial intelligence every day for 500 days.

AI Policy and Ethics at Student at University of Copenhagen MSc in Social Data Science. All views are my own.