2.7 Adam

这一篇写得特别详细：
深度学习优化算法解析(Momentum, RMSProp, Adam)
Adam（Adaptive Moment Estimation）
初始化：

v_{d W} = 0, v_{d b} = 0, S_{d W} = 0, S_{d b} = 0

On iteration t:
compute $d W, d b$ using mini batch

\begin{aligned} M o m e n t u m : \\ v_{d W} = β_{1} v_{d W} + (1 - β_{1}) d W v_{d b} = β_{1} v_{d b} + (1 - β_{1}) d b \\ R M S p r o p : \\ S_{d W} = β_{2} S_{d W} + (1 - β_{2}) d W^{2} S_{d b} = β_{2} S_{d b} + (1 - β_{2}) d b^{2} \\ B i a s c o r r e c t i o n : \\ v_{d W}^{c o r r e c t e d} = \frac{v_{d W}}{1 - β_{1}^{2}} v_{d b}^{c o r r e c t e d} = \frac{v_{d b}}{1 - β_{1}^{2}} \\ S_{d W}^{c o r r e c t e d} = \frac{S_{d W}}{1 - β_{2}^{2}} S_{d b}^{c o r r e c t e d} = \frac{S_{d b}}{1 - β_{2}^{2}} \\ C o m p u t a t i o n : \\ W = W - α \frac{v_{d W}}{\sqrt{S_{d W}}} \\ b = b - α \frac{v_{d b}}{\sqrt{S_{d b}}} \end{aligned}

2.7 Adam
When implementing Adam, what people usually do is just use the default value. So, $β_{1}$ and $β_{2}$ as well as $ϵ$ . I don’t think anyone ever really tunes Epsilon. And then, try a range of values of Alpha to see what works best.

So, where does the term ‘Adam’ come from?

Adam stands for Adaptive Moment Estimation. So $β_{1}$ is computing the mean of the derivatives. This is called the first moment. And $β_{2}$ is used to compute exponentially weighted average of the squares and that’s called the second moment. So that gives rise to the name adaptive moment estimation.