Photo by @drew_beamer

Word Tokenisation in Python

Different ways for tokenising words

Word tokenisation is often one part of working with words. Therefore, I thought it would be worth exploring this more in detail. This article will be partly about word tokenisation in general and a few examples of how different it can be in Python. This article was written by the author for learning and combines material from Geeksforgeeks and Wikipedia.

  • Keyword: names already in the programming language;
  • Separator: (also known as punctuators): punctuation characters and paired-delimiters;
  • Operator: symbols that operate on arguments and produce results;
  • Literal: numeric, logical, textual, reference literals;
  • Comment: line, block.
  1. TreebankWordTokenizer
  2. PunktWordTokenizer
  3. WordPunctTokenizer
  4. RegexpTokenizer

1. Word_tokenize

Word Tokenization — Splitting words in a sentence.

['Hello', 'everyone', '.', 'Welcome', 'to', 'GeeksforGeeks', '.']

2. TreebankWordTokenizer

from nltk.tokenize import TreebankWordTokenizer

['Hello', 'everyone.', 'Welcome', 'to', 'GeeksforGeeks', '.']

3. PunktWordTokenizer

Code #6: PunktWordTokenizerIt doen’t seperates the punctuation from the words.

['Let', "'s", 'see', 'how', 'it', "'s", 'working', '.']

4. WordPunctTokenizer

Code #6: WordPunctTokenizerIt seperates the punctuation from the words.

['Let', "'", 's', 'see', 'how', 'it', "'", 's', 'working', '.']

5. RegexpTokenizer

Code #7: Using Regular Expression

["Let's", 'see', 'how', "it's", 'working']
["Let's", 'see', 'how', "it's", 'working']

Working with XML

Tokenising from text may not always be that easy. It is not impossible when finding strings, especially online that you may have to think about XML files. XML files are similar to HTML files, it is also capable of parsing them. BeautifulSoup is one of the most used libraries when it comes to web scraping with Python

<sentence>
<word>The</word>
<word>quick</word>
<word>brown</word>
<word>fox</word>
<word>jumps</word>
<word>over</word>
<word>the</word>
<word>lazy</word>
<word>dog</word>
</sentence>

Why can it be hard to work with tokenisation?

A token can represent more than one lexeme.

AI Policy and Ethics at www.nora.ai. Student at University of Copenhagen MSc in Social Data Science. All views are my own. twitter.com/AlexMoltzau