Model Size and Efficient Training in AI
In a discussion on climatechange.ai I was discussing with a member after having made a thread about Data Centre emission numbers. This article does not cover the research paper in question, it rather meant to spark your interest in the topic.
Data Centres Emission Numbers of the Largest Tech Companies
As you might know cooling data centres as well as the material cost of these has a major impact on the environment. I…
In that thread Alberto who teaches at Berkeley said the following:
“For optimally compute-efficient training, most of the increase should go towards increased model size. Spending a relative ton of electricity on a large model (possibly with billions of parameters) and then aggressively pruning and quantizing the model, may be overall more efficient than training a smaller model.”
Further her referred to a paper on arXiv on model sizes:
Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of…
Since hardware resources are limited, the objective of training deep learning models is typically to maximize accuracy…
As such I decided to check out this article.
Early on I found a statement that may not be applicable in all cases, however I found it an interesting:
“…the most compute-efficient training strategy is to counterintuitively train extremely large models but stop after a small number of iterations.”
With hardware resources being limited there is an objective of attempting to maximise accuracy. This is dependant on both time, memory constraints and inference.
Inference is a conclusion reached on the basis of evidence and reasoning. Within machine learning, in plain English, there is a use of statistical algorithms that learn from existing data, a process called training, in order to make decisions about new data — this process is called inference. During training, patterns and relationships in the data are identified to build a model.
This paper argues that large models are: “…more robust to compression techniques such as quantization and pruning than small models.”
The current deep learning paradigm, using more compute (e.g., increasing model size, dataset size, or training steps) typically leads to higher model accuracy. There has recently been a success in pre-training that allows training to scale to massive amounts of unlabelled data and very large neural models in some cases.
When this is the case computational resources are increasingly the critical constraint on improving model accuracy.
One goal has therefore been to be maximising: “…compute efficiency: how to achieve the highest model accuracy given a fixed amount of hardware and training time.”
Train until convergence has made larger models seem less viable for small budgets. This paper seeks to challenge this assumption.
They show that the fastest way to train Transformer models is to substantially increase model size but stop training very early.
Li, Z., Wallace, E., Shen, S., Lin, K., Keutzer, K., Klein, D., & Gonzalez, J. E. (2020). Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers. arXiv preprint arXiv:2002.11794.
This is #500daysofAI and you are reading article 292. I am writing one new article about or related to artificial intelligence every day for 500 days. My current focus for 100 days 200–300 is national and international strategies for artificial intelligence. I have decided to spend the last 25 days of my AI strategy writing to focus on the climate crisis.