What is torchtext?

Data processing utilities and popular datasets for natural language

3 min readAug 19, 2020

Working with text in PyTorch, and torchtext may not in all ways seem like the obvious choice. NLTK, spaCy and other packages seem to spring to mind. I am not suggesting that torchtext beats any of these, rather that it is interesting to explore a variety of ways that we can work with text. Torchtext is really well-documented and is worth exploring. I will try to summarise part of the documentation in this article, as part of my own exploration prior to using torchtext.

torchtext - torchtext 0.8.0a0+c4a91f2 documentation

Learn about PyTorch's features and capabilities

pytorch.org

torchtext

If you want to work with NLP it seems torchtext could be one to check out.

It consists mainly of:

torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vectors)
torchtext.datasets: Pre-built loaders for common NLP datasets

However, they are adding more to this mix.

They are re-designing the torchtext library to make it more compatible with pytorch (e.g. torch.utils.data).

Several datasets have been written with the new abstractions in torchtext.experimental folder.

They also made a tutorial.

If you press the ‘text’ symbol on their page you can find a lot of examples or videos.

Text Classification with TorchText - PyTorch Tutorials 1.6.0 documentation

This tutorial shows how to use the text classification datasets in , including This example shows how to train a…

pytorch.org

TORCHTEXT.DATA

The data module provides the following:

Ability to define a preprocessing pipeline
Batching, padding, and numericalizing (including building a vocabulary object)
Wrapper for dataset splits (train, validation, test)
Loader for a custom NLP dataset

All I can say is that this is rich of different ways to work with your data. One example for me is tokenization.

One example is the .get_tokenizer.

get_tokenizer

torchtext.data.get_tokenizer(tokenizer, language='en')

Generate tokenizer function for a string sentence. Parameters:

tokenizer — the name of tokenizer function. If None, it returns split() function, which splits the string sentence by space. If basic_english, it returns _basic_english_normalize() function, which normalize the string first and split by space. If a callable function, it will return the function. If a tokenizer library (e.g. spacy, moses, toktok, revtok, subword), it returns the corresponding library.
language — Default en

Examples:

>>> import torchtext
>>> from torchtext.data import get_tokenizer
>>> tokenizer = get_tokenizer("basic_english")
>>> tokens = tokenizer("You can now install TorchText using pip!")
>>> tokens
>>> ['you', 'can', 'now', 'install', 'torchtext', 'using', 'pip', '!']

Datasets

If you want to experiment with the package there are plenty of datasets!

torchtext.datasets - torchtext 0.8.0a0+c4a91f2 documentation

Edit description

pytorch.org

This is #500daysofAI and you are reading article 442. I am writing one new article about or related to artificial intelligence every day for 500 days.

What is torchtext?

Data processing utilities and popular datasets for natural language

torchtext - torchtext 0.8.0a0+c4a91f2 documentation

Learn about PyTorch's features and capabilities

torchtext

Text Classification with TorchText - PyTorch Tutorials 1.6.0 documentation

This tutorial shows how to use the text classification datasets in , including This example shows how to train a…

TORCHTEXT.DATA

get_tokenizer

Datasets

torchtext.datasets - torchtext 0.8.0a0+c4a91f2 documentation

Edit description

Written by Alex Moltzau