Papers
arxiv:2309.14322

Small-scale proxies for large-scale Transformer training instabilities

Published on Sep 25, 2023
· Submitted by akhaliq on Sep 26, 2023
Authors:
,
,
,
,
,
,
,
,
,

Abstract

Teams that have trained large Transformer-based models have reported training instabilities at large scale that did not appear when training with the same hyperparameters at smaller scales. Although the causes of such instabilities are of scientific interest, the amount of resources required to reproduce them has made investigation difficult. In this work, we seek ways to reproduce and study training stability and instability at smaller scales. First, we focus on two sources of training instability described in previous work: the growth of logits in attention layers (Dehghani et al., 2023) and divergence of the output logits from the log probabilities (Chowdhery et al., 2022). By measuring the relationship between learning rate and loss across scales, we show that these instabilities also appear in small models when training at high learning rates, and that mitigations previously employed at large scales are equally effective in this regime. This prompts us to investigate the extent to which other known optimizer and model interventions influence the sensitivity of the final loss to changes in the learning rate. To this end, we study methods such as warm-up, weight decay, and the muParam (Yang et al., 2022), and combine techniques to train small models that achieve similar losses across orders of magnitude of learning rate variation. Finally, to conclude our exploration we study two cases where instabilities can be predicted before they emerge by examining the scaling behavior of model activation and gradient norms.

Community

Ok, here's what I got out of this one. My favorite one today actually:

Training giant AI models requires ridiculous resources - thousands of GPUs for months. As a solo researcher I wouldn't be able to reproduce experiments at that scale.

But by cranking up the learning rate, I can recreate weird behaviors seen in huge models - specifically attention collapse and logit divergence. These authors also found that solutions for those behaviors also work in the small-model analogs as well as larger models.

  • Longer warmup helps more for bigger models
  • Decoupled LR & weight decay improves stability
  • Depth increases sensitivity much faster than width
  • You can predict upcoming issues from scaling trends

If the authors are right, one more tool that lets researchers study and even help train giant models without Google-size resources. Small models can guide large model development, sort of like how you can build a scale train set to study and improve how a railroad system works... for a lot less money than starting your own railroad company, buying land, building real tracks, etc.

More thoughts on this one here - it's about a 5 min read: https://notes.aimodels.fyi/deepmind-study-large-ai-instabilities-without-tons-of-gpus/

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Sign up or log in to comment

Models citing this paper 4

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2309.14322 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 6