View a PDF of the paper titled No More Adam: Lgeting Rate Scaling at Initialization is All You Need, by Minghao Xu and 3 other authors
Abstract:In this labor, we inquire the necessity of alterive gradient methods for training meaningful neural netlabors. SGD-SaI is a basic yet effective increasement to stochastic gradient descent with momentum (SGDM). SGD-SaI carry outs lgeting rate Scaling at Initialization (SaI) to contrastent parameter groups, directd by their esteemive gradient signal-to-noise ratios (g-SNR). By adequitableing lgeting rates without count oning on alterive second-order momentum, SGD-SaI helps impede training imequilibriums from the very first iteration and cuts the enhancer’s memory usage by half contrastd to AdamW. Despite its simpliedy and efficiency, SGD-SaI stablely suites or outcarry outs AdamW in training a variety of Transestablisher-based tasks, effectively overcoming a prolonged-standing dispute of using SGD for training Transestablishers. SGD-SaI excels in ImageNet-1K classification with Vision Transestablishers(ViT) and GPT-2 pretraining for huge language models (LLMs, alterer decoder-only), demonstrating strongness to hyperparameter variations and down-to-earthity for diverse applications. We further tested its strongness on tasks enjoy LoRA fine-tuning for LLMs and diffusion models, where it stablely outcarry outs state-of-the-art enhancers. From a memory efficiency perspective, SGD-SaI accomplishs substantial memory savings for enhancer states, reducing memory usage by 5.93 GB for GPT-2 (1.5B parameters) and 25.15 GB for Llama2-7B contrastd to AdamW in filled-precision training settings.