Image for post
Image for post
Photo by @weirick

Building a Text Dataset

Loading text data in TensorFlow — and a note on building datasets with text

Late I have been trying to build a dataset, so I thought I would write an article about building a text dataset.

import tensorflow as tfimport tensorflow_datasets as tfds
import os
FILE_NAMES = ['cowper.txt', 'derby.txt', 'butler.txt']

for name in FILE_NAMES:
text_dir = tf.keras.utils.get_file(name, origin=DIRECTORY_URL+name)

parent_dir = os.path.dirname(text_dir)

def labeler(example, index):
return example, tf.cast(index, tf.int64)

labeled_data_sets = []

for i, file_name in enumerate(FILE_NAMES):
lines_dataset =, file_name))
labeled_dataset = ex: labeler(ex, i))
  1. Use tfds.features.text.Tokenizer to split it into tokens.
  2. Collect these tokens into a Python set, to remove duplicates.
  3. Get the size of the vocabulary for later use.
tokenizer = tfds.features.text.Tokenizer()

vocabulary_set = set()
for text_tensor, _ in all_labeled_data:
some_tokens = tokenizer.tokenize(text_tensor.numpy())

vocab_size = len(vocabulary_set)

Encode examples

Create an encoder by passing the vocabulary_set to tfds.features.text.TokenTextEncoder. The encoder's encode method takes in a string of text and returns a list of integers.

encoder = tfds.features.text.TokenTextEncoder(vocabulary_set)
Image for post
Image for post

Building datasets with text

However, things may be more messy.

Written by

AI Policy and Ethics at Student at University of Copenhagen MSc in Social Data Science. All views are my own.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store