迷你批量梯度下降不是遍历所有的示例,而是根据批大小对较低数量的示例进行汇总,并对这些批次中的每个批次执行更新。随机梯度下降法是一种常用的优化方法。它在概念上很简单,通常可以有效地实现。但是,它有一个需要手动调优的参数(步长)。已经提出了不同的自动化调优选项。其中一个成功的方案是AdaGrad。而标准的随机次梯度方法主要遵循预先确定的程序方案,忽略了被观测数据的特征。相比之下,AdaGrad的算法动态地结合了早期迭代中观察到的数据几何知识,以执行更多信息的基于梯度的学习。AdaGrad已经发布了两个版本。对角线AdaGrad(这个版本是在实践中使用的),其主要特点是保持和适应每个维度的一个学习率; the second version known as Full AdaGrad maintains one learning rate per direction (e.g.. a full PSD matrix). Adaptive Gradient Algorithm (Adagrad) is an algorithm for gradient-based optimization. The learning rate is adapted component-wise to the parameters by incorporating knowledge of past observations. It performs larger updates (e.g. high learning rates) for those parameters that are related to infrequent features and smaller updates (i.e. low learning rates) for frequent one. It performs smaller updates As a result, it is well-suited when dealing with sparse data (NLP or image recognition) Each parameter has its own learning rate that improves performance on problems with sparse gradients.