Paper summary — Decoupled Weight Decay Regularization

4 min readOct 29, 2021

If you search for - what is difference between L2 Regularization and Weight decay regularization, the most frequent answer would be that both are somewhat same. Yes both take you to same result in Stochastic Gradient Descent with Momentum but not when it comes to Adaptive gradient optimizers. The concept and working of weight decay in these adaptive optimizer (ex. Adam) is different.

I highly recommend you to read the paper Decoupled Weight Decay Regularization by Ilya Loshchilov & Frank Hutter — link.

Note — All the content in this summary/blog was written by referring to the above paper. At times I would quote from the paper because not every sentence can be reframed and still retain the meaning of it. Wherever I will quote from the paper, I will make it italic and put that in quotes.

Abstract

In the abstract, the authors convey that L2 regularization and Weight decay regularization are same for standard SGD but it’s not the same case with Adaptive algorithms such as Adam. So in this paper they suggest a small modification to recover the idea of weight decay by “decoupling the weight decay from the optimization steps taken w.r.t the loss function”. Moving forward in the paper, authors have provided experiments/evidence to prove that the modifications done are effective.

Introduction

L2 regularization and weight decay are not the same for adaptive algorithms but is equivalent in the case of SGD. When using L2 regularization with Adam it is seen that the older/historic gradients are being regularized less as compared to that while using weight decay.

“L2 regularization is not effective in Adam” mostly because of deep learning libraries implementing only L2 regularization and “not the original weight decay”.

“Weight decay is equally effective in both SGD and Adam”

Performance of weight decay depends on the batch size. Larger the batch size, smaller it the favorable weight decay.

It is advised to use learning rate multiplier/scheduler while using Adam.

“Main contribution of this paper is to improve regularization in Adam by decoupling the weight decay from the gradient-based update.”

Decoupling the Weight Decay from gradient based weight update

Source of image — From original paper itself — Link

In the paper, the author propose to decay the weights while updating weight in current iteration for SGD (line 9). In this way, λ and α can be decoupled (independent of each other)(As before line 6 had λ and line 8 had α — making each other dependent).

When Adam is run with L2 regularization, it was seen that “weights that tend to have large gradients in f do not get regularized as much as they would with decoupled weight decay since the gradient of the regularizer gets scaled along with the gradient of f ”

How does L2 regularization and Weight decay differ in terms of Adaptive gradient algorithms? — “In L2 regularization, the sum of the gradient of the loss function and the gradient of the regularizer are adapted, whereas with decoupled weight decay, only the gradient of the loss function are adapted. With L2, both types of gradients are normalized by their typical magnitudes and there weights x with large typical gradient magnitude s are regularized by a smaller relative amount than other weights.

In contrast, de-coupled weight decay regularizes all weights with the same rate λ, effectively regularizing weights x with large s more than standard L2 regularization does.”

Figure 1 — Source of image — From original paper itself — Link

From the above image we can see AdamW (Adam with Weight decay) with cosine annealing learning rate scheduler gives the best performance and hence the authors suggest to use learning rate schedulers with adaptive gradient algorithms as well.

Figure 2 — Source of image — From original paper itself — Link

From Figure 2, Top left graph we can see that in SGD, L2 regularization is not decoupled from learning rate as the best hyperparameter basin is diagonal meaning initial learning rate and L2 regularization are interdependent on each other. This means, if we change only one of them we might get worse results. So to get best results we will have to change both simultaneously — initial learning rate and L2 regularization, giving large hyperparameter space.

On the other hand (referring to Figure 2 — graph on top right), SGD with Weight decay or SGDW shows that initial learning rate and L2 regularization are decoupled. This shows that even if learning rate is not well tuned, having it fixed at some value and only changing weight decay factor would yield good value. This is also shown by the graph not diagonal.

Coming to Adam with L2 regularization (Figure 2 — bottom left graph), we see that it performs even worse than SGD.

Adam with weight decay or AdamW(Figure 2 — bottom right graph) shows that it largely decouples learning rate and weight decay — keep one parameter constant and try to optimize the other. Performance of this was better than SGD, Adam and SGDW!

Do refer the paper for more insights on experiments, performance and mathematical proofs!

Thanks for reading the summary.
Connect with me on LinkedIn. :)

Paper summary — Decoupled Weight Decay Regularization

Abstract

Introduction

Decoupling the Weight Decay from gradient based weight update

Written by Sahil Chachra