Building a Text Dataset

Loading text data in TensorFlow — and a note on building datasets with text

3 min readAug 20, 2020

Late I have been trying to build a dataset, so I thought I would write an article about building a text dataset.

In my case it is pdf. files, however a lot of the libraries around is focused on .txt files.

TensorFlow is one of those who posts about building datasets.

“TensorFlow is a free and open-source software library for dataflow and differentiable programming across a range of tasks. It is a symbolic math library, and is also used for machine learning applications such as neural networks.”

Load text | TensorFlow Core

This tutorial provides an example of how to use tf.data.TextLineDataset to load examples from text files…

www.tensorflow.org

They share videos as well on the topic:

TensorFlow has an easy setup.

import tensorflow as tfimport tensorflow_datasets as tfds
import os

You can download a few txt. files to play around (learn).

DIRECTORY_URL = 'https://storage.googleapis.com/download.tensorflow.org/data/illiad/'
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
  text_dir = tf.keras.utils.get_file(name, origin=DIRECTORY_URL+name)
  
parent_dir = os.path.dirname(text_dir)

parent_dir

It shows a decent way first of constructing the file names, and then to bring in the files into a for loop.

def labeler(example, index):
  return example, tf.cast(index, tf.int64)  

labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
  lines_dataset = tf.data.TextLineDataset(os.path.join(parent_dir, file_name))
  labeled_dataset = lines_dataset.map(lambda ex: labeler(ex, i))
  labeled_data_sets.append(labeled_dataset)

Building a vocabulary, tokenising and encoding.

First, build a vocabulary by tokenizing the text into a collection of individual unique words. There are a few ways to do this in both TensorFlow and Python. For this tutorial:

Iterate over each example’s numpy value.
Use tfds.features.text.Tokenizer to split it into tokens.
Collect these tokens into a Python set, to remove duplicates.
Get the size of the vocabulary for later use.

tokenizer = tfds.features.text.Tokenizer()

vocabulary_set = set()
for text_tensor, _ in all_labeled_data:
  some_tokens = tokenizer.tokenize(text_tensor.numpy())
  vocabulary_set.update(some_tokens)

vocab_size = len(vocabulary_set)
vocab_size

→ Even examples for encoding.

Encode examples

Create an encoder by passing the vocabulary_set to tfds.features.text.TokenTextEncoder. The encoder's encode method takes in a string of text and returns a list of integers.

encoder = tfds.features.text.TokenTextEncoder(vocabulary_set)

They show an example of this in the video.

Building datasets with text

However, things may be more messy.

You may not have easily accessible text data.

Alexandre Gonfalonieri wrote an interesting article on the topic of how to build a dataset:

How to Build A Data Set For Your Machine Learning Project

Are you about thinking AI for your organization? You have identified a use case with a proven ROI? Perfect! but not so…

towardsdatascience.com

What is data?

This may seem an easy question.

“Data are characteristics or information, usually numerical, that are collected through observation.”

That is what you will get on Wikipedia.

However, data is so much more — the human experience, senses, smell etc.

Yet, if you were to build a dataset from pdf. files this may not be an immediate concern. The immediate concern may simply be to be able to load the data into a readable format so that you can use it in an analysis.

Considering that data in several shapes or forms are connected to so many professions and fields this could become a less straightforward journey.

This is #500daysofAI and you are reading article 443. I am writing one new article about or related to artificial intelligence every day for 500 days.