Photo by @drew_beamer

Word Tokenisation in Python

Different ways for tokenising words

Word tokenisation is often one part of working with words. Therefore, I thought it would be worth exploring this more in detail. This article will be partly about word tokenisation in general and a few examples of how different it can be in Python. This article was written by the author for learning and combines material from Geeksforgeeks and Wikipedia.

“Tokenisation is the process of tokenising or splitting a string, text into a list of tokens. … One can think of token as parts like a word is a token in a sentence, and a sentence is a token in a paragraph.”

A lexical token or simply token is a string with an assigned and thus identified meaning.

It is structured as a pair consisting of a token name and an optional token value. The token name is a category of lexical unit. Common token names are according to Wikipedia page on Lexical Analysis (with added bold):

  • Identifier: names the programmer chooses;

Geeksforgeeks have shown how this can be done in different ways.

  1. word_tokenize

Each will be explained in turn.

1. Word_tokenize

Word Tokenization — Splitting words in a sentence.

from nltk.tokenize import word_tokenize

text = "Hello everyone. Welcome to GeeksforGeeks."

word_tokenize(text)

Output :

['Hello', 'everyone', '.', 'Welcome', 'to', 'GeeksforGeeks', '.']

How word_tokenize works?
word_tokenize() function is a wrapper function that calls tokenize() on an instance of the TreebankWordTokenizer class.

2. TreebankWordTokenizer

from nltk.tokenize import TreebankWordTokenizer

tokenizer = TreebankWordTokenizer()

tokenizer.tokenize(text)

Output :

['Hello', 'everyone.', 'Welcome', 'to', 'GeeksforGeeks', '.']

These tokenizers work by separating the words using punctuation and spaces. And as mentioned in the code outputs above, it does not discard the punctuation, allowing a user to decide what to do with the punctuations at the time of pre-processing.

3. PunktWordTokenizer

Code #6: PunktWordTokenizerIt doen’t seperates the punctuation from the words.

from nltk.tokenize import PunktWordTokenizer

tokenizer = PunktWordTokenizer()

tokenizer.tokenize("Let's see how it's working.")

Output :

['Let', "'s", 'see', 'how', 'it', "'s", 'working', '.']

4. WordPunctTokenizer

Code #6: WordPunctTokenizerIt seperates the punctuation from the words.

from nltk.tokenize import WordPunctTokenizer

tokenizer = WordPunctTokenizer()

tokenizer.tokenize("Let's see how it's working.")

Output :

['Let', "'", 's', 'see', 'how', 'it', "'", 's', 'working', '.']

5. RegexpTokenizer

Code #7: Using Regular Expression

from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer("[\w']+")

text = "Let's see how it's working."

tokenizer.tokenize(text)

Output :

["Let's", 'see', 'how', "it's", 'working']

from nltk.tokenize import regexp_tokenize

text = "Let's see how it's working."

regexp_tokenize(text, "[\w']+")

Output :

["Let's", 'see', 'how', "it's", 'working']

Working with XML

Tokenising from text may not always be that easy. It is not impossible when finding strings, especially online that you may have to think about XML files. XML files are similar to HTML files, it is also capable of parsing them. BeautifulSoup is one of the most used libraries when it comes to web scraping with Python

For example, in the text string:

The quick brown fox jumps over the lazy dog

“…the string isn’t implicitly segmented on spaces, as a natural language speaker would do. The raw input, the 43 characters, must be explicitly split into the 9 tokens with a given space delimiter (i.e., matching the string " " or regular expression /\s{1}/).”

The tokens could be represented in XML.

<sentence>
<word>The</word>
<word>quick</word>
<word>brown</word>
<word>fox</word>
<word>jumps</word>
<word>over</word>
<word>the</word>
<word>lazy</word>
<word>dog</word>
</sentence>

You could read more about this in a post on XML by Geeksforgeeks.

Why can it be hard to work with tokenisation?

A token can represent more than one lexeme.

Lexeme: a basic lexical unit of a language consisting of one word or several words, the elements of which do not separately convey the meaning of the whole.

Tokens can be characterised by character content or context within the data stream.

Programming languages: often categorize tokens as identifiers, operators, grouping symbols, or by data type (integer, string, etc.).

Written languages: commonly categorise tokens as nouns, verbs, adjectives, or punctuation.

Categories are used for post-processing of the tokens either by the parser or by other functions in the program.

This is #500daysofAI and you are reading article 462. I am writing one new article about or related to artificial intelligence every day for 500 days.

AI Policy and Ethics at www.nora.ai. Student at University of Copenhagen MSc in Social Data Science. All views are my own. twitter.com/AlexMoltzau