GRADIENT DESCENT in Chinese Translation

Lesson 9: Why mini-batch gradient descent is used?

第9课：为什么使用小批量梯度下降？?

Figure 2: Gradient descent with different learning rates.

图2:不同学习速度下的梯度下降.

This is called batch gradient descent.

我们将这种梯度下降算法称为BatchGradientDescent。

Hence, gradient descent will not work well with this loss function.

因此，梯度下降将不能很好地作用于此损失函数。

It turns out there's the method called gradient descent.

这个按着斜率往下走的方法，叫做GradientDescent.

People also translate

stochastic gradient descent

Challenge 2- Gradient Descent may have trouble finding the absolute minimum.

挑战二：梯度下降法可能会无法找到绝对最小值点.

Add momentum-based stochastic gradient descent to network2. py.

增加基于momentum的随机梯度下降到network2.py中。

The gradient descent then repeats this process, edging ever closer to the minimum.

然后，梯度下降法会重复此过程，逐渐接近最低点。

Some will use stochastic gradient descent with momentum.

基于此就有了Gradientdescentwithmomentum。

If we define the batch size to be 1,this is called stochastic gradient descent.

如果mini-batch的大小为1，这叫做随机梯度下降法。

Gradient descent is a simple optimization procedure that you can use with many machine.

梯度下降法是一个简单的最优化算法，你可以将它运用到许多机器学习算法中。

The network can be trained using backpropagation and gradient descent.

训练直接使用backpropagation和gradientdescent就可以。

In stochastic gradient descent we define our cost function as the cost of a single example:.

在随机梯度下降法中，我们定义代价函数为一个单一训练实例的代价：.

For instance, learning(i.e. optimization)is usually done iteratively through backpropagation using gradient descent algorithms.

例如，学习（即优化）通常使用梯度下降算法通过反向传播来迭代进行。

Gradient Descent can be thought of climbing down to the bottom of a valley, instead of climbing up a hill.

梯度下降可以被认为是攀登到山谷的底部，而不是爬上山丘。

And we will also figure out how to apply gradient descent to fit the parameters of logistic regression.

同时我们还要弄清楚如何运用梯度下降法，来拟合出逻辑回归的参数。

Thus gradient descent always converges(assuming the learning rate α is not too large) to the global minimum.

因此，梯度下降法应该总是收敛到全局最小值（假设学习速率$\alpha$不设置的过大）。

The weights corresponding to these gates arealso updated using BPTT stochastic gradient descent as it seeks to minimize a cost function.

对应这些门的权重也使用BPTT随机梯度下降来更新，因为其要试图最小化成本函数。

Stochastic Gradient Descent is sensitive to feature scaling, so it is highly recommended to scale your data.

随机梯度下降法对特征缩放(featurescaling)很敏感，因此强烈建议您缩放您的数据。

By increasing the learning rate suddenly, gradient descent may“hop” out of the local minima and find its way toward the global minimum.

梯度下降算法可以通过突然提高学习率，来“跳出”局部最小值并找到通向全局最小值的路径。

Gradient Descent therefore is prone to be stuck in local minimum, depending on the nature of the terrain(or function in ML terms).

因此，梯度下降倾向于卡在局部最小值，这取决于地形的性质（或ML中的函数）。

Trained the model using batch stochastic gradient descent, with specific values for momentum and weight decay.

用stochasticgradientdescent+minibatch方法进行训练，并且运用了momentum和weightdecay。

Gradient descent is a simple procedure, where TensorFlow simply shifts each variable a little bit in the direction that reduces the cost.

梯度下降算法是一个简单的学习过程，TensorFlow只需将每个变量一点点地往使成本不断降低的方向移动。

We trained our models using stochastic gradient descent with a batch size of 128 examples, momentum of 0.9, and weight decay of 0.0005.

我们使用随机梯度下降训练我们的模型，batch大小为128，momentum0.9，权重衰减率0.0005。

If we're doing Batch Gradient Descent, we will get stuck here since the gradient will always point to the local minima.

如果我们采用批量梯度下降，那么我们会被困在这里，因为这里的梯度始终会指向局部最小值点。

Variations such as SGD(stochastic gradient descent) or minibatch gradient descent typically perform better in practice.

它的衍化版例如SGD（随机梯度下降stochasticgradientdescent）或者最小批量梯度下降（minibatchgradientdescent）通常在实际使用中会有更好的效果。

We calculate the gradient descent until the derivative reaches the minimum error, and each step is determined by the steepness of the slope(gradient).

在导数达到最小误差值之前，我们会一直计算梯度下降，并且每个步骤都会取决于斜率（梯度）的陡度。

The only difference between stochastic gradient descent and vanilla gradient descent is the fact that the former uses a noisy approximation of the gradient..

随机梯度下降和朴素梯度下降之间唯一的区别是：前者使用了梯度的噪声近似。