What is torchtext?
Data processing utilities and popular datasets for natural language
Working with text in PyTorch, and torchtext may not in all ways seem like the obvious choice. NLTK, spaCy and other packages seem to spring to mind. I am not suggesting that torchtext beats any of these, rather that it is interesting to explore a variety of ways that we can work with text. Torchtext is really well-documented and is worth exploring. I will try to summarise part of the documentation in this article, as part of my own exploration prior to using torchtext.
torchtext - torchtext 0.8.0a0+c4a91f2 documentation
Learn about PyTorch's features and capabilities
If you want to work with NLP it seems torchtext could be one to check out.
It consists mainly of:
- torchtext.data: Generic data loaders, abstractions, and iterators for text (including vocabulary and word vectors)
- torchtext.datasets: Pre-built loaders for common NLP datasets
However, they are adding more to this mix.
They are re-designing the torchtext library to make it more compatible with pytorch (e.g.
Several datasets have been written with the new abstractions in torchtext.experimental folder.
They also made a tutorial.
If you press the ‘text’ symbol on their page you can find a lot of examples or videos.
Text Classification with TorchText - PyTorch Tutorials 1.6.0 documentation
This tutorial shows how to use the text classification datasets in , including This example shows how to train a…
The data module provides the following:
- Ability to define a preprocessing pipeline
- Batching, padding, and numericalizing (including building a vocabulary object)
- Wrapper for dataset splits (train, validation, test)
- Loader for a custom NLP dataset
All I can say is that this is rich of different ways to work with your data. One example for me is tokenization.
One example is the
Generate tokenizer function for a string sentence. Parameters:
- tokenizer — the name of tokenizer function. If None, it returns split() function, which splits the string sentence by space. If basic_english, it returns _basic_english_normalize() function, which normalize the string first and split by space. If a callable function, it will return the function. If a tokenizer library (e.g. spacy, moses, toktok, revtok, subword), it returns the corresponding library.
- language — Default en
>>> import torchtext
>>> from torchtext.data import get_tokenizer
>>> tokenizer = get_tokenizer("basic_english")
>>> tokens = tokenizer("You can now install TorchText using pip!")
>>> ['you', 'can', 'now', 'install', 'torchtext', 'using', 'pip', '!']
If you want to experiment with the package there are plenty of datasets!
This is #500daysofAI and you are reading article 442. I am writing one new article about or related to artificial intelligence every day for 500 days.