
Gradient descent update rule
where L is the number of layers and αα is the learning rate. All parameters should be stored in the
parametersdictionary.Note that the iterator
lstarts at 0 in thefor loopwhile the first parameters areW[1]andb[1].You need to shift
ltol+1when coding.
def update_parameters_with_gd(parameters, grads, learning_rate):
"""
Update parameters using one step of gradient descent
Arguments:
parameters -- python dictionary containing your parameters to be updated:
parameters['W' + str(l)] = Wl
parameters['b' + str(l)] = bl
grads -- python dictionary containing your gradients to update each parameters:
grads['dW' + str(l)] = dWl
grads['db' + str(l)] = dbl
learning_rate -- the learning rate, scalar.
Returns:
parameters -- python dictionary containing your updated parameters
"""
L = len(parameters) // 2 # number of layers in the neural networks
# Update rule for each parameter
for l in range(L):
### START CODE HERE ### (approx. 2 lines)
parameters["W" + str(l+1)] = parameters["W" + str(l+1)] - learning_rate * grads["dW" + str(l + 1)]
parameters["b" + str(l+1)] = parameters["b" + str(l+1)] - learning_rate * grads["db" + str(l + 1)]
### END CODE HERE ###
return parameters
Different between GD & SGD
- (Batch) Gradient Descent
X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
# Forward propagation
a, caches = forward_propagation(X, parameters)
# Compute cost.
cost = compute_cost(a, Y)
# Backward propagation.
grads = backward_propagation(a, caches, parameters)
# Update parameters.
parameters = update_parameters(parameters, grads)
- Stochastic Gradient Descent:
X = data_input
Y = labels
parameters = initialize_parameters(layers_dims)
for i in range(0, num_iterations):
for j in range(0, m):
# Forward propagation
a, caches = forward_propagation(X[:,j], parameters)
# Compute cost
cost = compute_cost(a, Y[:,j])
# Backward propagation
grads = backward_propagation(a, caches, parameters)
# Update parameters.
parameters = update_parameters(parameters, grads)
In Stochastic Gradient Descent, you use only 1 training example before updating the gradients. When the training set is large, SGD can be faster. But the parameters will “oscillate” toward the minimum rather than converge smoothly. Here is an illustration of this:

“+” denotes a minimum of the cost. SGD leads to many oscillations to reach convergence. But each step is a lot faster to compute for SGD than for GD, as it uses only one training example (vs. the whole batch for GD).
Note also that implementing SGD requires 3 for-loops in total:
- Over the number of iterations
- Over the $m$ training examples
- Over the layers (to update all parameters, from $(W^{[1]},b^{[1]})$ to $(W^{[L]},b^{[L]})$)
Mini-Batch Gradient descent
2 Steps
- Shuffle

- Partition: Partition the shuffled (X, Y) into
mini_batch_size

Note that the last mini-batch might end up smaller than
mini_batch_size=64. Let $lfloor s rfloor$ represents $s$ rounded down to the nearest integer (this ismath.floor(s)in Python). If the total number of examples is not a multiple ofmini_batch_size=64then there will be $lfloor frac{m}{mini_batch_size}rfloor$ mini-batches with a full 64 examples, and the number of examples in the final mini-batch will be $m-mini_batch_size times lfloor frac{m}{mini_batch_size}rfloor$
num_complete_minibatches = math.floor(m/mini_batch_size) # number of mini batches of size mini_batch_size in your partitionning
for k in range(0, num_complete_minibatches):
### START CODE HERE ### (approx. 2 lines)
mini_batch_X = shuffled_X[:, k * mini_batch_size :(k + 1) * mini_batch_size]
mini_batch_Y = shuffled_Y[:, k * mini_batch_size :(k + 1) * mini_batch_size]
### END CODE HERE ###
mini_batch = (mini_batch_X, mini_batch_Y)
mini_batches.append(mini_batch)
# Handling the end case (last mini-batch < mini_batch_size)
if m % mini_batch_size != 0:
### START CODE HERE ### (approx. 2 lines)
mini_batch_X = shuffled_X[:,num_complete_minibatches * mini_batch_size:]
mini_batch_Y = shuffled_Y[:,num_complete_minibatches * mini_batch_size:]
### END CODE HERE ###
mini_batch = (mini_batch_X, mini_batch_Y)
mini_batches.append(mini_batch)
Momentum
- Initialize
for l in range(L):
v["dW" + str(l+1)] = np.zeros_like(parameters["W"+str(l+1)])
v["db" + str(l+1)] = np.zeros_like(parameters["b"+str(l+1)])
- Update
where L is the number of layers, $beta$ is the momentum and $alpha$ is the learning rate. All parameters should be stored in the
parametersdictionary. Note that the iteratorlstarts at 0 in theforloop while the first parameters are $W^{[1]}$ and $b^{[1]}$ (that’s a “one” on the superscript). So you will need to shiftltol+1when coding.
for l in range(L):
# compute velocities
v["dW" + str(l+1)] = beta * v["dW" + str(l + 1)] + (1 - beta) * grads['dW' + str(l + 1)]
v["db" + str(l+1)] = beta * v["db" + str(l + 1)] + (1 - beta) * grads['db' + str(l + 1)]
# update parameters
parameters["W" + str(l+1)] = parameters["W" + str(l + 1)] - learning_rate * v["dW"+str(l + 1)]
parameters["b" + str(l+1)] = parameters["b" + str(l + 1)] - learning_rate * v["db"+str(l + 1)]
Adam
Synopsis
-
It calculates an exponentially weighted average of past gradients, and stores it in variables $v$ (before bias correction) and $v^{corrected}$ (with bias correction).
-
It calculates an exponentially weighted average of the squares of the past gradients, and stores it in variables $s$ (before bias correction) and $s^{corrected}$ (with bias correction).
-
It updates parameters in a direction based on combining information from “1” and “2”.
Update
where:
- t counts the number of steps taken of Adam
- L is the number of layers
- $beta_1$ and $beta_2$ are hyperparameters that control the two exponentially weighted averages.
- $alpha$ is the learning rate
- $varepsilon$ is a very small number to avoid dividing by zero
for l in range(L):
# Moving average of the gradients. Inputs: "v, grads, beta1". Output: "v".
v["dW" + str(l+1)] = beta1 * v["dW" + str(l + 1)] + (1 - beta1) * grads['dW' + str(l + 1)]
v["db" + str(l+1)] = beta1 * v["db" + str(l + 1)] + (1 - beta1) * grads['db' + str(l + 1)]
# Compute bias-corrected first moment estimate. Inputs: "v, beta1, t". Output: "v_corrected".
v_corrected["dW" + str(l+1)] = v["dW" + str(l+1)]/(1 - np.power(beta1, t))
v_corrected["db" + str(l+1)] = v["db" + str(l+1)]/(1 - np.power(beta1, t))
# Moving average of the squared gradients. Inputs: "s, grads, beta2". Output: "s".
s["dW" + str(l+1)] = beta2 * s["dW" + str(l + 1)] + (1 - beta2) * np.power(grads['dW' + str(l + 1)], 2)
s["db" + str(l+1)] = beta2 * s["db" + str(l + 1)] + (1 - beta2) * np.power(grads['db' + str(l + 1)], 2)
# Compute bias-corrected second raw moment estimate. Inputs: "s, beta2, t". Output: "s_corrected".
s_corrected["dW" + str(l+1)] = s["dW" + str(l+1)]/(1 - np.power(beta2, t))
s_corrected["db" + str(l+1)] = s["db" + str(l+1)]/(1 - np.power(beta2, t))
# Update parameters. Inputs: "parameters, learning_rate, v_corrected, s_corrected, epsilon". Output: "parameters".
parameters["W" + str(l+1)] = parameters["W" + str(l + 1)] - learning_rate * v_corrected["dW" + str(l + 1)] / np.sqrt(s_corrected["dW" + str(l + 1)] + epsilon)
parameters["b" + str(l+1)] = parameters["b" + str(l + 1)] - learning_rate * v_corrected["db" + str(l + 1)] / np.sqrt(s_corrected["db" + str(l + 1)] + epsilon)




近期评论