Photo by @acharki95

What is torchtext?

Data processing utilities and popular datasets for natural language

Alex Moltzau
3 min readAug 19, 2020

--

Working with text in PyTorch, and torchtext may not in all ways seem like the obvious choice. NLTK, spaCy and other packages seem to spring to mind. I am not suggesting that torchtext beats any of these, rather that it is interesting to explore a variety of ways that we can work with text. Torchtext is really well-documented and is worth exploring. I will try to summarise part of the documentation in this article, as part of my own exploration prior to using torchtext.

torchtext

If you want to work with NLP it seems torchtext could be one to check out.

It consists mainly of:

  1. torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vectors)
  2. torchtext.datasets: Pre-built loaders for common NLP datasets

However, they are adding more to this mix.

They are re-designing the torchtext library to make it more compatible with pytorch (e.g. torch.utils.data).

Several datasets have been written with the new abstractions in torchtext.experimental folder.

They also made a tutorial.

If you press the ‘text’ symbol on their page you can find a lot of examples or videos.

TORCHTEXT.DATA

The data module provides the following:

  • Ability to define a preprocessing pipeline
  • Batching, padding, and numericalizing (including building a vocabulary object)
  • Wrapper for dataset splits (train, validation, test)
  • Loader for a custom NLP dataset

All I can say is that this is rich of different ways to work with your data. One example for me is tokenization.

One example is the .get_tokenizer.

get_tokenizer

torchtext.data.get_tokenizer(tokenizer, language='en')

Generate tokenizer function for a string sentence. Parameters:

  • tokenizer — the name of tokenizer function. If None, it returns split() function, which splits the string sentence by space. If basic_english, it returns _basic_english_normalize() function, which normalize the string first and split by space. If a callable function, it will return the function. If a tokenizer library (e.g. spacy, moses, toktok, revtok, subword), it returns the corresponding library.
  • language — Default en

Examples:

>>> import torchtext
>>> from torchtext.data import get_tokenizer
>>> tokenizer = get_tokenizer("basic_english")
>>> tokens = tokenizer("You can now install TorchText using pip!")
>>> tokens
>>> ['you', 'can', 'now', 'install', 'torchtext', 'using', 'pip', '!']

Datasets

If you want to experiment with the package there are plenty of datasets!

This is #500daysofAI and you are reading article 442. I am writing one new article about or related to artificial intelligence every day for 500 days.

--

--

Alex Moltzau

AI Policy, Governance, Ethics and International Partnerships at www.nora.ai. All views are my own. twitter.com/AlexMoltzau