NLP with Python

A note on Natural Language Processing with Python

As I have mentioned previously there are a few resources available to help tackle the issues with Natural Language Processing, particularly in Python.

Online there is a book by Steven Bird, Ewan Klein, and Edward Loper. It is called: Analyzing Text with the Natural Language Toolkit.

The summary can be shortened to the following:

  • “Texts are represented in Python using lists: [’Monty’, 'Python’]. We can use indexing, slicing, and the len() function on lists.
  • A word "token" is a particular appearance of a given word in a text; a word "type" is the unique form of the word as a particular sequence of letters. We count word tokens using len(text) and word types using len(set(text)).
  • We obtain the vocabulary of a text t using sorted(set(t)).
  • We operate on each item of a text using [f(x) for x in text].
  • To derive the vocabulary, collapsing case distinctions and ignoring punctuation, we can write set(w.lower() for w in text if w.isalpha()).
  • We process each word in a text using a for statement, such as for w in t: or for word in text:. This must be followed by the colon character and an indented block of code, to be executed each time through the loop.
  • We test a condition using an if statement: if len(word) < 5:. This must be followed by the colon character and an indented block of code, to be executed only if the condition is true.
  • A frequency distribution is a collection of items along with their frequency counts (e.g., the words of a text and their frequency of appearance).
  • A function is a block of code that has been assigned a name and can be reused. Functions are defined using the def keyword, as in def mult(x, y); x and y are parameters of the function, and act as placeholders for actual data values.
  • A function is called by specifying its name followed by zero or more arguments inside parentheses, like this: texts(), mult(3, 4), len(text1).”

For more on this read the first chapter of the book.

Hope you enjoy it!