Photo by @mana5280

Tiny and Powerful NLP for Text With pQRNN

Natural language processing with projection-based modelling and Quasi-Recurrent Neural Networks (QRNN) from Google AI

The Google AI blog is one to follow. I will do my best to cover an article written the 21st of September 2020 by Prabhu Kaliamoorthi, Software Engineer, at Google Research.

What is a projection-based model architecture? According to a paper on from 2018:

“A large class of model reduction methods are projection-based; that is, they derive the low-dimensional approximation by projection of the original model onto a low-dimensional subspace (or, more generally, a low-dimensional manifold).”

What Kaliamoorthi does is build a state-of-the-art light-weight text classification model.

One such model, pQRNN, shows that this new architecture can nearly achieve BERT-level performance, despite using 300x fewer parameters and being trained on only the supervised data.

They have open-sourced the PRADO model they built last year, and encourage the community to use it as a jumping off point for new model architectures.

Their PRADO model is a neural architecture built in 2019. Their model with less than 200K parameters reached ‘state of the art’ performance.

There is a need for NLP models that can be run on-device rather than in data centers.

Their new model pQRNN advances NLP performance with a minimal model size.

“The novelty of pQRNN is in how it combines a simple projection operation with a quasi-RNN encoder for fast, parallel processing.”

According to a paper on the topic that the author has linked:

“…quasi-recurrent neural networks (QRNNs), an approach to neural sequence modeling that alternates convolutional layers, which apply in parallel across timesteps, and a minimalist recurrent pooling function that applies in parallel across channels.”

How does it work in Kaliamoorthi’s model?

According to the author [arrows added]:

  1. “Normally, the text input to NLP models is first processed into a form that is suitable for the neural network,
    → by segmenting text into pieces (tokens) that correspond to values in a predefined universal dictionary (a list of all possible tokens).
  2. The neural network then uniquely identifies each segment using a trainable parameter vector, which comprises the embedding table.
    → in which text is segmented has a significant impact on the model performance, size, and latency.

The figure below is made by Kaliamoorthi and shows the spectrum of approaches used by the NLP community and their pros and cons.

Image from Google AI blog retrieved the 23rd of September 2020

In short, what he says is that not all NLP models need to know everything.

Most tasks can be solved by knowing a small subset of segments.

“Hence, allowing the network to determine the most relevant segments for a given task results in better performance.”

There is difference in complexity to consider:

Image from Google AI blog retrieved the 23rd of September 2020

Their previous model that this one builds on, PRADO, was designed to: “…learn clusters of text segments from words rather than word pieces or characters, which enabled it to achieve good performance on low-complexity NLP tasks. Since word units are more meaningful, and yet the most relevant words for most tasks are reasonably small, many fewer model parameters are needed to learn such a reduced subset of relevant word clusters.”

The pQRNN model has three building blocks according to Kaliamoorthi:

  1. A projection operator that converts tokens in text to a sequence of ternary vectors.
  2. A dense bottleneck layer.
  3. A stack of Quasi-Recurrent Neural Networks (QRNN) encoders.

It is illustrated as follows:

This is #500daysofAI and you are reading article 477. I am writing one new article about or related to artificial intelligence every day for 500 days.

AI Policy and Ethics at www.nora.ai. Student at University of Copenhagen MSc in Social Data Science. All views are my own. twitter.com/AlexMoltzau