Building a Text Dataset
Loading text data in TensorFlow — and a note on building datasets with text
Late I have been trying to build a dataset, so I thought I would write an article about building a text dataset.
In my case it is pdf. files, however a lot of the libraries around is focused on .txt files.
TensorFlow is one of those who posts about building datasets.
“TensorFlow is a free and open-source software library for dataflow and differentiable programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks.”
Load text | TensorFlow Core
This tutorial provides an example of how to use tf.data.TextLineDataset to load examples from text files…
They share videos as well on the topic:
TensorFlow has an easy setup.
import tensorflow as tfimport tensorflow_datasets as tfds
You can download a few txt. files to play around (learn).
DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']
for name in FILE_NAMES:
text_dir = tf.keras.utils.get_file(name, origin=DIRECTORY_URL+name)
parent_dir = os.path.dirname(text_dir)
It shows a decent way first of constructing the file names, and then to bring in the files into a for loop.
def labeler(example, index):
return example, tf.cast(index, tf.int64)
labeled_data_sets = 
for i, file_name in enumerate(FILE_NAMES):
lines_dataset = tf.data.TextLineDataset(os.path.join(parent_dir, file_name))
labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
Building a vocabulary, tokenising and encoding.
First, build a vocabulary by tokenizing the text into a collection of individual unique words. There are a few ways to do this in both TensorFlow and Python. For this tutorial:
- Iterate over each example’s
tfds.features.text.Tokenizerto split it into tokens.
- Collect these tokens into a Python set, to remove duplicates.
- Get the size of the vocabulary for later use.
tokenizer = tfds.features.text.Tokenizer()
vocabulary_set = set()
for text_tensor, _ in all_labeled_data:
some_tokens = tokenizer.tokenize(text_tensor.numpy())
vocab_size = len(vocabulary_set)
→ Even examples for encoding.
Create an encoder by passing the
tfds.features.text.TokenTextEncoder. The encoder's
encode method takes in a string of text and returns a list of integers.
encoder = tfds.features.text.TokenTextEncoder(vocabulary_set)
They show an example of this in the video.
Building datasets with text
However, things may be more messy.
You may not have easily accessible text data.
Alexandre Gonfalonieri wrote an interesting article on the topic of how to build a dataset:
How to Build A Data Set For Your Machine Learning Project
Are you about thinking AI for your organization? You have identified a use case with a proven ROI? Perfect! but not so…
What is data?
This may seem an easy question.
“Data are characteristics or information, usually numerical, that are collected through observation.”
That is what you will get on Wikipedia.
However, data is so much more — the human experience, senses, smell etc.
Yet, if you were to build a dataset from pdf. files this may not be an immediate concern. The immediate concern may simply be to be able to load the data into a readable format so that you can use it in an analysis.
Considering that data in several shapes or forms are connected to so many professions and fields this could become a less straightforward journey.
This is #500daysofAI and you are reading article 443. I am writing one new article about or related to artificial intelligence every day for 500 days.