August 11, 2019 · 4 mins read

# Lesson 11 - Adam Optimzer in Python

Optimizers are one of the most important components of neural networks. They control the learning rate and update the weights and biases of neural networks using the gradients computed by backpropagation. There are many different types of optimizers. Stochastic gradient descent, or SGD, and Adam are by far the most popular choices. In this post, I will discuss the latter.

This post is based on the original Adam paper by Diederik P. Kingma and Jimmy Lei Ba: “Adam: A Method for Stochastic Optimization”. Read the parper on arXiv.

This is the eleventh post of my fast.ai journey. Read all posts here.

While optimizers like SGD work very well on neural networks, they often require a lot of memory and they are computationally inefficient. By combining the advantages of AdaGrad and RMSProp (other optimization techniques), Adam is able to resolve this issue resulting in more efficient training phases.

The following graphs of the training cost at every epoch for a CNN copied from the paper support this claim.  ## Overview of the Algorithm

The weights of the network are denoted by $\theta$. $f(\theta)$ is the cost function of the network which can be differentiated with respect to the parameters $\theta$.

Adam takes more parameters than just the learning rate like SGD does. Here are some of Adam’s parameters:

• $a$: stepsize / learning rate. The paper proposes a default value of $1 \cdot 10 ^ {-3}$.
• $\beta_1, \beta_2 \in [0, 1)$: Exponential decay rates for the moment estimates. The paper proses $\beta_1 = 0.9$ and $\beta_2 = 0.999$ as default values. Tweaking these values helps one reduce overfitting.
• $\epsilon$: a very small value to avoid dividing by zero. The paper proposes a value of $10 ^ {-8}$.

The paper denotes the epoch by $t$. $f_t$ is function $f$ at timestamp $t$. The gradient of $f_t$, $g_t$, is $\nabla_\theta f_t(\theta)$.

The algorithm uses $m_0$ and $v_0$, both initialized as 0, to save the momenta.

The pseudo code for the Adam algorithm is as follows: The wonderful thing about this algorithm is that it limits the step size to $\|\Delta_t\| \lessapprox \alpha$ creating a trust region around the current value of $\theta$. The step size is also limited by $\frac{\hat{m}_t}{\sqrt{\hat{v}_t}}$ which is virtually always larger than $\alpha$ except when $\theta$ approaches a minimum causing it to slow down in the final stage which is desirable. The latter value is formally known as the signal-to-noise ratio, or SNR for short. For a proof and more elaborate explanation of this, check out section 2.1 of the paper linked to in the introduction.

## In Python

In Python, the algorithm would look something like this:

g = # code to compute gradient

m = beta_1 * m + (1 - beta_1) * g
v = beta_2 * v + (1 - beta_2) * (g * g)

m_hat = (m / (1 - beta_1 ** t))
v_hat = (v / (1 - beta_2 ** t))

weights -= alpha * (m_hat / (np.sqrt(v_hat) + epsilon))


Note that this algorithm has to be applied to every layer separately.

## Training a CNN with Adam vs SGD

I used a simple convolutional neural network as described in this tutorial by PyTorch to see what effect using Adam vs SGD has. I found that Adam was always a step ahead of SGD on accuracy per 1000 batches. A graph of training phase: After two epochs, SGD got $53\%$ accuracy on CIFAR10 where Adam got $58\%$ with an equal amount of training.

## Conclusion

Adam is a great optimizer to use for deep learning. It is faster and more scalable than stochastic gradient descent and it converges faster.

A huge thanks to Sam Miserendino for proofreading this post!