AdaBelief: A state-of-the-art optimization algorithm.

Ashish Ranjan
5 min readMay 3, 2021

1. Introduction:

Gradient-based optimization algorithms serve as a basis for loss reduction with neural networks, allowing for learning of the underlying distribution of data in the form of trainable model parameters. Some common examples of gradient-based optimization algorithms include: Stochastic Gradient Descent(SGD), RMSprop, Adaptive Moment Estimation(Adam) and so many more.

Importantly, the effectiveness of an optimizer is determined based on following:

1) Convergence rate — this account for the training time complexity, i.e., how quickly the loss function is reduced to the minimum.

2) Model generalization — model performance on the new unseen data.

While the “adam” optimizer have a good convergence rate, the optimizer like “SGDin effect has a better model generalization.

Turing out to be a next major development, most recently, researchers from Yale University have introduced a novel optimizer called “AdaBelief , that is a generalized version of theadamoptimizer. Moreover, they then stated that the proposed optimizer contains features that are more of a mix of several successful existing optimizes.

Authors did point out the following benefits with the AdaBelief optimizer:

1) Faster Convergence: This is a close characteristics of the “adam” optimizer, which is a adaptive algorithm.

2) Better Generalization: Similar to the one achieved with the “SGD” optimizer.

3) Stable Training

2. Gradient Exponential Moving Average (EMA):

An exponential moving average (EMA) is a weighted average technique that places a greater weight and significance on the most recent data points.

The moving average technique applied to the gradient is useful to estimate the moment of the gradient. This has the effect of understanding and controlling the negative impact of sudden gradient change produced with some mini-batches.

High EMA indicate a large momentum, thereby, requiring making a large stepsize and vice-versa.

While computing the gradient during the training process, certain few mini-batches may have outliers that may end-up producing sudden change in the gradients causing unstable training. Henceforth, to compassionate for this, computing the EMA of the gradients for every mini-batch helps minimize the effect of uninformative gradients. The weighted average of the new and the old gradients better helps compute the new controlled gradient for the weight update.

3. AdaBelief Optimizer:

AdaBelief optimizer is closely related to the “adamoptimizer, though there is one notable difference between them.

The AdaBelief optimizer is designed to make a stepsize according to the “belief” in the current observed gradient direction. The exponential moving average (EMA) of the gradient is viewed as the predicted gradient at the next time step.

— If the observed gradient at time t, denoted as (g), deviates greatly from the predicted gradient (m), the trust decreases for the current observed gradient and hence, a small step is taken.

— However, if the observed gradient (g) is close to the predicted gradient (m), the trust is maintained and a large step is taken.

Roughly, the “belief in observation” is computed as 1 / (gₜ - mₜ)².

The algorithmic steps for both Adam and AdaBelief is given below:

Fig: AdaBelief Optimizer: Adapting Stepsizes by the Belief in Observed Gradients

gₜ → stands for the gradient.

mₜ → stands for EMA of the gradient. Larger gradients in recent time will produce larger momentum, requiring making a large stepsize, and vice-versa. This is so because it is used in the numerator.

vₜ → stands for EMA of the gₜ².

sₜ → stands for EMA of (gₜ - mₜ)². This ensures that AdaBelief takes a larger stepsize when the value of the gradient is close to its EMA (i.e., belief on the current gradient is maintained), and otherwise, a smaller stepsize when the two values are different (i.e., belief decreases for the current gradient).

To better understand the advantage with the AdaBelief optimizer, consider the example from the paper:

Figure : An ideal optimizer considers curvature of the loss function, instead of taking a large (small) step where the gradient is large (small). This is taken from the paper.

In the paper, the authors have demonstrated the existence of regions on the loss curve that require a different treatment in order to have an efficient working of the optimizer. From the figure the different regions as pointed out by the authors were as follows:

Region 1 : The curve in this region is flat with the gradient almost close to 0. Ideally an optimizer should take a large stepsize here. The AdaBelief optimizer takes a large steps, because of the term in the denominator (refer to the update steps containing capped termsₜ” that is small).

Region 2: The curve in this region is a “steep and narrow valley”, where the gradient is mostly large. An ideal optimizer for this case should take a small stepsize. The AdaBelief takes a small steps, because of the term in the denominator (the capped termsₜ” is large).

Region 3: The curve in this region indicate the case of “large gradient and small curvature”. An ideal optimizer for the case should take a increased stepsize. The AdaBelief takes a large steps, because of the term in the denominator (the capped termsₜ” is small).

Note:

Adam does shows a behavior similar to AdaBelief for regions (1 and 2), however it fails to handle the case for region 3 with “large gradient and small curvature”. Adam usually takes small stepsize for the region 3 (Because of the denominator term capped “vₜ” is large).

SGD on contrary has failure handling regions 1 and 2. It take small stepsize for region 1 and large stepsize for region 2. For region 3, the SGD successfully takes the large stepsize.

4. Conclusion

  1. AdaBelief optimizer is closely derived from Adam and has a change in one of the parameters.
  2. It is advantageous when it comes to both (1) faster convergence and (2) good generalization on the unseen data. Moreover, it performs well for the case “Large gradient and small curvature” which is not the case with the Adam optimizer.
  3. It is a adaptive optimizer and choose its step size according to the “belief” in the current gradient direction. It takes a large step when observation is close to prediction, and a small step when the observation greatly deviates from the prediction.

Thanks for reading.

--

--