A neural network is put together by hooking together many of our simple “neurons,” so that the output of a neuron can be the input of another. For example, here is a small neural network:

### Neuron

A Neuron Network is composed of many small neurons.

This neuron is a computational unit that takes as input x1,x2,x3 (and a +1 intercept term), where $$h(x)=f(x)= \frac{1}{1+e^{-x}}$$

f is called the activation function. Normally we choose sigmoid function as the activation function.

### Feed Forward

This is how we calculate the output of each layer:

If we want to calcuate the output of the first hidden layer, the matlab code look sth like this:

1 | Z2 = [ones(m, 1) X] * Theta1'; |

where X is the input matrix (not included the intercept term), Theta1 is the weights matrix (parameter) between input layer and the first hidden layer, A2 is the output of the first hidden layer.

### Cost Function

Here is the formula for calculating the cost function:

The cost function in matlab code look sth like this:

1 | J = sum(sum(-Y.*log(A3) - (1-Y).*log(1-A3))) / m; |

If using a squared-error cost function, the overall cost function should be:

The first term in the definition of J(W,b) is an average sum-of-squares error term. The second term is a regularization term (also called a weight decay term) that tends to decrease the magnitude of the weights, and helps prevent overfitting.

Remember that cost function doesn’t do anything during our neural network training process.

### Backpropagation

Recall that the intuition behind the backpropagation algorithm is as follows. Given a training example $(x^{(t)},y^{(t)})$, we will first run a “forward pass” to compute all the activations throughout the network, including the output value of the hypothesis $h_Θ(x)$. Then, for each node j in layer l, we would like to compute an “error term” $δ_j^{(l)}$ that measures how much that node was “responsible” for any errors in our output.

Our goal is to minimize J(W,b) as a function of W and b. To train our neural network, we will initialize each parameter $W^{(l)}_{ij}$ and each $b_i^{(l)}$ to a small random value near zero, and then apply an optimization algorithm such as batch gradient descent. Since J(W,b) is a non-convex function, gradient descent is susceptible to local optima; however, in practice gradient descent usually works fairly well. Finally, note that it is important to initialize the parameters randomly, rather than to all 0’s. If all the parameters start off at identical values, then all the hidden layer units will end up learning the same function of the input.

One iteration of gradient descent updates the parameters W,b as follows:

where α is the learning rate. The key step is computing the partial derivatives above. We will now describe the backpropagation algorithm, which gives an efficient way to compute these partial derivatives.

Suppose we have the following nn:

Here is the steps for calculating gradient:

- 1) Perform a feedforward pass, computing the activations $(z^2, a^2, z^3, a^3)$ for layers 2 and 3.
- 2) For each output unit k in layer 3 (the output layer), set
- 3) For the hidden layer l = 2, set
- 4) Accumulate the gradient from this example using the following formula. Note that you should skip or remove $δ_0^{(2)}$.
- 5) Obtain the gradient for the neural network cost function:

Note that if not using the regularization item, then the gradient should be: