这一篇写得特别详细:
深度学习优化算法解析(Momentum, RMSProp, Adam)
Adam(Adaptive Moment Estimation)
初始化:
On iteration t:
compute using mini batch
When implementing Adam, what people usually do is just use the default value. So, and as well as . I don’t think anyone ever really tunes Epsilon. And then, try a range of values of Alpha to see what works best.
So, where does the term ‘Adam’ come from?
Adam stands for Adaptive Moment Estimation. So is computing the mean of the derivatives. This is called the first moment. And is used to compute exponentially weighted average of the squares and that’s called the second moment. So that gives rise to the name adaptive moment estimation.