CS663 | computer vision

Learning Algorithms (optimizers)- Another Factor Influencing Training

There are different gradient descent algorithms that are in use in Machine Learning. The goal of these algorithms is to attemp to when reducing loss get out of local minimum and find a good solution (ideally the best). This is an "optimization problem" and there are many algorithms developed to solve it. Here are a few that are commonly used.

The following is taken and modified from " A Visual Explanation of Gradient Descent Methods (Momentum, AdaGrad, RMSProp, Adam) "

Plain Gradient Descent

The basic gradient descent algorithm follows the idea that the opposite direction of the gradient points to where the lower area is. So it iteratively takes steps in the opposite directions of the gradients. For each parameter theta (weight in the network), it does the following. This works fine if your starting point is in the valley of the global minimum.

delta = - learning_rate gradient*

theta += delta

Momentum

The gradient descent with momentum algorithm (or Momentum for short) borrows the idea from physics. Imagine rolling down a ball inside of a frictionless bowl. Instead of stopping at the bottom, the momentum it has accumulated pushes it forward, and the ball keeps rolling back and forth. Here is an illustration of the idea

We can apply the concept of momentum to our vanilla gradient descent algorithm. In each step, in addition to the regular gradient, it also adds on the movement from the previous step. Mathematically, it is commonly expressed as:

delta = - learning_rate gradient +* ***previous_delta * decay_rate*** (eq. 1)

theta += delta (eq. 2)

This can be rewriten as :

sum_of_gradient = gradient + previous_sum_of_gradient decay_rate* (eq. 3)

delta = -learning_rate sum_of_gradient* (eq. 4)

theta += delta (eq. 5)

Comparing Plain Gradient Descent & Momentum

AdaGrad

Instead of keeping track of the sum of gradient like momentum, the Adaptive Gradient algorithm, or AdaGrad for short, keeps track of the sum of gradient squared and uses that to adapt the gradient in different directions. Often the equations are expressed in tensors. I will avoid tensors to simplify the language here. For each dimension:

sum_of_gradient_squared = previous_sum_of_gradient_squared + gradient²

delta = -learning_rate gradient* / sqrt(sum_of_gradient_squared)

theta += delta

Comparison of gradient descent and AdaGrad

RMSProp

AdaGrad t is slow. This is because the sum of gradient squared only grows and never shrinks. RMSProp (for Root Mean Square Propagation) fixes this issue by adding a decay factor.

sum_of_gradient_squared = previous_sum_of_gradient_squared * decay_rate+ gradient² * (1- decay_rate)

delta = -learning_rate gradient / sqrt(sum_of_gradient_squared)*

theta += delta

Comparing AdaGrad and RMSProps

Adam

Adaptive Moment Estimation) takes the best of both worlds of Momentum and RMSProp. Adam empirically works well, and thus in recent years, it is commonly the go-to choice of deep learning problems.

Let’s take a look at how it works:

sum_of_gradient = previous_sum_of_gradient beta1 + gradient * (1 - beta1)* [Momentum]

sum_of_gradient_squared = previous_sum_of_gradient_squared beta2 + gradient² * (1- beta2)* [RMSProp]

delta = -learning_rate sum_of_gradient / sqrt(sum_of_gradient_squared)*

theta += delta

Beta1 is the decay rate for the first moment, sum of gradient (aka momentum), commonly set at 0.9.

Beta 2 is the decay rate for the second moment, sum of gradient squared, and it is commonly set at 0.999.

Adam gets the speed from momentum and the ability to adapt gradients in different directions from RMSProp. The combination of the two makes it powerful.

Which algorithm? (of the above I recommend Adam) ...more AMSGrad, NAdam,....
use framework implementations -e.g. Tensorflow optimizers

try different ones, experience, look to literature for examples, ALWAYS an active research area

Some Tensorflow Options:

class Adadelta: Optimizer that implements the Adadelta algorithm.

class Adagrad: Optimizer that implements the Adagrad algorithm.

class Adam: Optimizer that implements the Adam algorithm.

class Adamax: Optimizer that implements the Adamax algorithm.

class Ftrl: Optimizer that implements the FTRL algorithm.

class Nadam: Optimizer that implements the NAdam algorithm.

class Optimizer: Base class for Keras optimizers.

class RMSprop: Optimizer that implements the RMSprop algorithm.

class SGD: Gradient descent (with momentum) optimizer.