NLP with Python
As I have mentioned previously there are a few resources available to help tackle the issues with Natural Language Processing, particularly in Python.
Online there is a book by Steven Bird, Ewan Klein, and Edward Loper. It is called: Analyzing Text with the Natural Language Toolkit.
The summary can be shortened to the following:
- “Texts are represented in Python using lists: [’Monty’, 'Python’]. We can use indexing, slicing, and the len() function on lists.
- A word "token" is a particular appearance of a given word in a text; a word "type" is the unique form of the word as a particular sequence of letters. We count word tokens using len(text) and word types using len(set(text)).
- We obtain the vocabulary of a text t using sorted(set(t)).
- We operate on each item of a text using [f(x) for x in text].
- To derive the vocabulary, collapsing case distinctions and ignoring punctuation, we can write set(w.lower() for w in text if w.isalpha()).
- We process each word in a text using a for statement, such as for w in t: or for word in text:. This must be followed by the colon character and an indented block of code, to be executed each time through the loop.
- We test a condition using an if statement: if len(word) < 5:. This must be followed by the colon character and an indented block of code, to be executed only if the condition is true.
- A frequency distribution is a collection of items along with their frequency counts (e.g., the words of a text and their frequency of appearance).
- A function is a block of code that has been assigned a name and can be reused. Functions are defined using the def keyword, as in def mult(x, y); x and y are parameters of the function, and act as placeholders for actual data values.
- A function is called by specifying its name followed by zero or more arguments inside parentheses, like this: texts(), mult(3, 4), len(text1).”
For more on this read the first chapter of the book. https://www.nltk.org/book/ch01.html
Hope you enjoy it!