The Gardens by the Bay, Singapore Photo by — @touann

Model Size and Efficient Training in AI

Different Approaches to Compute-Efficient Training

Alex Moltzau
3 min readMar 22, 2020

--

In a discussion on climatechange.ai I was discussing with a member after having made a thread about Data Centre emission numbers. This article does not cover the research paper in question, it rather meant to spark your interest in the topic.

In that thread Alberto who teaches at Berkeley said the following:

“For optimally compute-efficient training, most of the increase should go towards increased model size. Spending a relative ton of electricity on a large model (possibly with billions of parameters) and then aggressively pruning and quantizing the model, may be overall more efficient than training a smaller model.”

Further her referred to a paper on arXiv on model sizes:

As such I decided to check out this article.

Early on I found a statement that may not be applicable in all cases, however I found it an interesting:

“…the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations.”

This figure was presented with the description: “Under the usual presumption that models are trained to convergence, only small models that are fast-to-execute are feasible in resource-constrained settings. Our work shows that the most compute-efficient training scheme is instead to train very large models, stop them well short of convergence, and then heavily compress them to meet test-time constraints.”

With hardware resources being limited there is an objective of attempting to maximise accuracy. This is dependant on both time, memory constraints and inference.

Inference is a conclusion reached on the basis of evidence and reasoning. Within machine learning, in plain English, there is a use of statistical algorithms that learn from existing data, a process called training, in order to make decisions about new data — this process is called inference. During training, patterns and relationships in the data are identified to build a model.

This paper argues that large models are: “…more robust to compression techniques such as quantization and pruning than small models.”

The current deep learning paradigm, using more compute (e.g., increasing model size, dataset size, or training steps) typically leads to higher model accuracy. There has recently been a success in pre-training that allows training to scale to massive amounts of unlabelled data and very large neural models in some cases.

When this is the case computational resources are increasingly the critical constraint on improving model accuracy.

One goal has therefore been to be maximising: “…compute efficiency: how to achieve the highest model accuracy given a fixed amount of hardware and training time.”

Train until convergence has made larger models seem less viable for small budgets. This paper seeks to challenge this assumption.

They show that the fastest way to train Transformer models is to substantially increase model size but stop training very early.

Li, Z., Wallace, E., Shen, S., Lin, K., Keutzer, K., Klein, D., & Gonzalez, J. E. (2020). Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers. arXiv preprint arXiv:2002.11794.

This is #500daysofAI and you are reading article 292. I am writing one new article about or related to artificial intelligence every day for 500 days. My current focus for 100 days 200–300 is national and international strategies for artificial intelligence. I have decided to spend the last 25 days of my AI strategy writing to focus on the climate crisis.

--

--

Alex Moltzau

AI Policy, Governance, Ethics and International Partnerships at www.nora.ai. All views are my own. twitter.com/AlexMoltzau