Image for post
Image for post
The Gardens by the Bay, Singapore Photo by — @touann

Model Size and Efficient Training in AI

Different Approaches to Compute-Efficient Training

In a discussion on I was discussing with a member after having made a thread about Data Centre emission numbers. This article does not cover the research paper in question, it rather meant to spark your interest in the topic.

In that thread Alberto who teaches at Berkeley said the following:

“For optimally compute-efficient training, most of the increase should go towards increased model size. Spending a relative ton of electricity on a large model (possibly with billions of parameters) and then aggressively pruning and quantizing the model, may be overall more efficient than training a smaller model.”

Further her referred to a paper on arXiv on model sizes:

As such I decided to check out this article.

Early on I found a statement that may not be applicable in all cases, however I found it an interesting:

“…the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations.”

Image for post
Image for post
This figure was presented with the description: “Under the usual presumption that models are trained to convergence, only small models that are fast-to-execute are feasible in resource-constrained settings. Our work shows that the most compute-efficient training scheme is instead to train very large models, stop them well short of convergence, and then heavily compress them to meet test-time constraints.”

With hardware resources being limited there is an objective of attempting to maximise accuracy. This is dependant on both time, memory constraints and inference.

Inference is a conclusion reached on the basis of evidence and reasoning. Within machine learning, in plain English, there is a use of statistical algorithms that learn from existing data, a process called training, in order to make decisions about new data — this process is called inference. During training, patterns and relationships in the data are identified to build a model.

This paper argues that large models are: “…more robust to compression techniques such as quantization and pruning than small models.”

The current deep learning paradigm, using more compute (e.g., increasing model size, dataset size, or training steps) typically leads to higher model accuracy. There has recently been a success in pre-training that allows training to scale to massive amounts of unlabelled data and very large neural models in some cases.

When this is the case computational resources are increasingly the critical constraint on improving model accuracy.

One goal has therefore been to be maximising: “…compute efficiency: how to achieve the highest model accuracy given a fixed amount of hardware and training time.”

Train until convergence has made larger models seem less viable for small budgets. This paper seeks to challenge this assumption.

They show that the fastest way to train Transformer models is to substantially increase model size but stop training very early.

Li, Z., Wallace, E., Shen, S., Lin, K., Keutzer, K., Klein, D., & Gonzalez, J. E. (2020). Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers. arXiv preprint arXiv:2002.11794.

This is #500daysofAI and you are reading article 292. I am writing one new article about or related to artificial intelligence every day for 500 days. My current focus for 100 days 200–300 is national and international strategies for artificial intelligence. I have decided to spend the last 25 days of my AI strategy writing to focus on the climate crisis.

Written by

AI Policy and Ethics at Student at University of Copenhagen MSc in Social Data Science. All views are my own.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store