Photo by @dmey503

Transfer Learning in NLP

Text-to-Text Transfer Transformer

Alex Moltzau
4 min readJul 19, 2020

--

On the 24th of October 2019 a paper was published on arXiv named: Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer.

A diagram of their text-to-text framework. Every task they consider — including translation, question answering, and classification — is cast as feeding their model text as input and training it to generate some target text. This allows the, to use the same model, loss function, hyperparameters, etc. across their diverse set of tasks. It also provides a standard testbed for the methods included in their empirical survey. “T5” refers to their model, which they dub the “Text-to-Text Transfer Transformer”.

Their goal is however to provide comprehensive perspective on where the field stands, and not to propose new methods.

They discuss their base model and its implementation, their procedure for formulating every problem as a text-to-text task, and the suite of tasks they consider.

Schematic of the objective they use in their baseline model. In this example, they process the sentence “Thank you for inviting me to your party last week.” The words “for”, “inviting” and “last” (marked with an ×) are randomly chosen for corruption. Each consecutive span of corrupted tokens is replaced by a sentinel token (shown as and ) that is unique over the example. Since “for” and “inviting” occur consecutively, they are replaced by a single sentinel . The output sequence then consists of the dropped-out spans, delimited by the sentinel tokens used to replace them in the input plus a final sentinel token .

Summed up their outlook was:

  • “The inconvenience of large models. An unsurprising but important result from our study is that larger models tend to perform better. The fact that the hardware used for running these models is continually getting cheaper and more powerful suggests that scaling up may continue to be a promising way to achieve better performance [Sutton, 2019]. However, it will always be the case that there are applications and scenarios where using a smaller or less expensive model is helpful, for example when performing client-side inference or federated learning [Konečn`y et al., 2015, 2016]. Relatedly, one beneficial use of transfer learning is the possibility of attaining good performance on low-resource tasks. Low-resource tasks often occur (by definition) in settings where one lacks the assets to label more data. It follows that low-resource applications often also have limited access to computational resources which can incur additional costs. As a result, we advocate for research on methods that achieve stronger performance with cheaper models so that transfer learning can be applied where it will have the most impact. Some current work along these lines include distillation [Hinton et al., 2015; Sanh et al., 2019; Jiao et al., 2019], parameter sharing [Lan et al., 2019], and conditional computation [Shazeer et al., 2017].
  • More efficient knowledge extraction. Recall that one of the goals of pre-training is (loosely speaking) to provide the model with general-purpose “knowledge” that improves its performance on downstream tasks. The method we use in this work, which is currently common practice, is to train the model to denoise corrupted spans of text. We suspect that this simplistic technique may not be a very efficient way to teach the model general-purpose knowledge. More concretely, it would be useful to be able to attain good fine-tuning performance without needing to train our models on 1 trillion tokens of text first. Some concurrent work along these lines improves efficiency by pre-training a model to distinguish between real and machine-generated text [Anonymous, 2019].
  • Formalizing the similarity between tasks. We observed that pre-training on unlabeled indomain data can improve performance on downstream tasks (Section 3.4). This finding mostly relies on basic observations like the fact that SQuAD was created using data from Wikipedia. It would be useful to formulate a more rigorous notion of the “similarity” between the pre-training and downstream tasks, so that we could make more principled choices about what source of unlabeled data to use. There is some early empirical work along these lines in the field of computer vision [Huh et al., 2016; Kornblith et al., 2018; He et al., 2018]. A better notion of the relatedness of tasks could also help choose supervised pre-training tasks, which has been shown to be helpful for the GLUE benchmark [Phang et al., 2018].
  • Language-agnostic models. We were disappointed to find that English-only pre-training did not achieve state-of-the-art results on the translation tasks we studied. We also are interested in avoiding the logistical difficulty of needing to specify which languages a vocabulary can encode ahead of time. To address these issues, we are interested in further investigating language-agnostic models, i.e. models that can perform a given NLP task with good performance regardless of the text’s language. This is an especially pertinent issue given that English is not the native language for the majority of the world’s population.”

If you are interested in the field they provided an empirical overview of the field and a perspective on where it stands.

This is #500daysofAI and you are reading article 411. I am writing one new article about or related to artificial intelligence every day for 500 days.

--

--

Alex Moltzau

AI Policy, Governance, Ethics and International Partnerships at www.nora.ai. All views are my own. twitter.com/AlexMoltzau