Sunday, July 24, 2016

Cost Functions and the Backpropagation Derivation



There are dozens of ways to construct neural networks and it can be very difficult to decipher which are best used for which problems and why. However, while there are stark differences between different neural networks there are common elements to all. We will call these the vital elements. They are the DNA of the neural network. Each is different but each adheres to the same general principles. These vital elements are the cost function and the backpropagation calculation (the activation function is another vital element and we will discuss it somewhat here but I will get into more depth about the activation function in future posts). In this post I want to provide the general derivation of the backpropagation algorithm. The principles from this derivations are used to compute the gradients of all neural network. I will also provide a few of the more popular cost functions and there gradients at the conclusion of this post. We will not be going into great detail about the benefits of each but my hope is that by understanding the cost function and backpropagation in the general so we can better understand their applications in the specific.



Before I start I want to get some nomenclature out of the way just so we know what we are talking about later on in the post. Refer back to this as often as you need because this will be the basis for all the derivations to come. Here we are talking about a feedforward neural network and this is the notation I will be using.
  • Input - $X$
  • Labelled data - $y$
  • Output estimate - $a^L$
  • Weights (sometimes called thetas) - $w$
  • Bias - $b$
  • Gradient - $\nabla$
  • superscripts refer to layer; $L$ denotes output layer; $l$ denotes any other layer
  • subscripts refer to the connection of individual neurons between layers usually denoted $jk$ where $j$ refers to the neuron in $l$th layer and $k$ refers to the neuron in layer $l-1$ 
  • Activation function - $\sigma$
  • $a^l_j = \sigma(\sum\limits_k (w^l_{jk} \cdot a^{l-1}_k) + b^l_j)$
  • $z^l_j = \sum\limits_k (w^l_{jk} \cdot a^{l-1}_k) + b^l_j$
  • $a^l_j$ is the activation of the $j$th neuron in the $l$th layer, $a^1_j$ is the activation of the $j$th neuron is the input layer

This is an example of a network with the notation we will be using.

There are two rules which all cost functions must also adhere to that I want to mathematically spell out here as well. In order to preform backpropagation it must be possible to write the Cost, $C$, as an average across the cost, \(C_x\) where \(x\) is an individual training example.
$$C = \frac{1}{n}\sum_{x}C_x$$
This is important because if we were not able to do this it would not be possible to compute the gradient vector with respect to our weights and biases.

The other rule is that the cost, $C$, must not depend upon any activation layers apart from $a^L$, where $a^L$ refers to the activation of the output layer. This is because if the cost were to rely on any activation units apart from $a^L$ backpropagation would become invalid. It would become impossible to move backwards through the network because parts of the cost would refer to different hidden layers. Let's now work through the principles of backpropagation to mathematically show why our network learns

Not All Cost Functions Are Created Equal 
The goal of any neural network is to find weights which will drive the cost towards 0. Different cost functions will accomplish this goal differently and at different speeds. Our goal is choose an appropriate cost function which will effectively drop our cost close to 0. The speed at which the cost  and activation functions are able to work together to accomplish this goal will play a huge part in determining our selection. Part of the problem what is known as the slow-down effect. This in essence means that as they get closer to the correct values they will start to take smaller and smaller steps towards this value. Depending on how early this slowdown starts it can take the network a really long time to converge. Another problem is that in some network structures certain neurons learning much slower, if at all, than others. Together these problems can cause certain network structures to behave very poorly with certain data types and cost functions. I will tackle the learning slowdown in more depth in future post but just keep in find for now that these are the main reasons we have so many different cost functions.

In order to understand how the gradient minimizes the cost functions let's mathematically generalize what we mean by backpropagation. These formulas will be helpful, particularly as we dive deeper into discussions about neural network because their properties can be applied to any network and cost function. At first glance most of these formulas will appear abstract and a bit confusing, however an understanding of these formulas will lead to a much deeper understanding of any neural networks structure.

Backpropagation is used to calculate how much of the cost (also referred to as loss or error occasionally) can be attributed to any specific weight or bias. In order to start this calculation we must first find the error, or how much of the cost, can be attributed to each neuron in the output layer, $a^L$.
$$\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j)$$
This gives the general equation to compute the rate of the change of the cost with respect to each output neuron ($^L_j$ refers to the $j$th neuron in the $L$th layer). $\frac{\partial C}{\partial a^L_j}$ tells us that the speed at which $\delta^L_j$ changes is dependent upon the $j$th activation output. $\sigma'(z^L_j)$ is telling us how fast the activation function, $\sigma$, is changing at $z^L_j$. Values near 0 for either $\frac{\partial C}{\partial a^L_j}$ or $\sigma'(z^L_j)$ means the error $\delta^L_j$ and therefore $C$ relies very little upon these units. Rewriting the above equation in matrix notation we get,
$$\delta^L = \frac{\partial C}{\partial a^L} \sigma'(z^L) = \nabla_a C \odot \sigma'(z^L) \tag{1}$$
where $\nabla_a C$ is the gradient vector representation of $\frac{\partial C}{\partial a^L_j}$. The gradient is generally a vector of all the partial derivatives of a given function but in this case we are merely taking the partial with respect to $a^L_j$. The gradient is usually denoted with a $\nabla$ but in this instance our vector is merely the derivatives of the cost with respect to $a^L$, hence why we use $\nabla_a C$ notation rather than $\nabla C$ . It might help to think of this as a partial gradient vector. The $\odot$ is the Hadamard product i.e. elementwise multiplication and $\sigma'(z^L)$ is the same as it was above. With this in mind we can see that these two equations are equivalent simply using different notation.

Now that we have an equation for the error of $a^L$ let's figure out how to find the equation for the error of $a^l$. To find this we will need to use the information we obtained from computing $\delta^L$ and move backwards through the network, hence the name backpropagation. The next equation we will need is
$$\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \tag{2}$$
In order to compute the error of the layer just prior $a^L$, $\delta^{L-1}$, we will need the weights which attach to the output layer, $w^{L}$, the error associated with output $\delta^{L}$ and the derivative of the activation function at layer $L-1$. All of this information we already have! We have not computed $\sigma'(z^l)$ but we can easily find it by taking the derivative of the activation function with respect to $z$, and plugging our numbers in. After finding $\delta^{L-1}$ we can move backward to find $\delta^{L-2}$. We can continue this process all the way backward to $\delta^2$. We will not compute $\delta^1$ because those are our inputs, which are constants, and therefore cannot be responsible for any of the error.

Now that we have calculated the errors of each activation layer the only steps left are to find out how many of those errors can be attributed to each weight, $w$, and all the biases, $b$. The calculations are almost identical but there is a slight difference. First, let's tackle the tougher to compute, the weights.
$$\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \tag{3}$$
This is read as the partial derivative of the Cost with respect to the weight in layer $l$ which connect neuron $k$ in layer $l-1$ with neuron $j$ in layer $l$ (remember that the subscripts of the weights are read backwards, neuron $k$ in layer $l-1$ connects to neuron $j$ in layer $l$). It's easier to conceptualize if you forget about the subscripts and merely read this as $\frac{\partial C}{\partial w} = a_{in}\delta_{out}$. Here is a visualization to help conceptualize the weight subscripts if it is hard to visualize.



The last equation we need is the errors of our activation layers with respect to our biases.
$$\frac{\partial C}{\partial b^l_j} = \delta^l_j \tag{4}$$
The reason we drop the $a^{l-1}_k$ variable is because the bias unit is added after the previous activation layer has already been computed and therefore does not contain any information from the previous layer. We now have all the information to compute the gradient with respect to the bias and weights. With this information we can update our weights and bias and hopefully make more accurate estimates of our data.

Let's quickly recap the equations we just went over and what they are used for. After computing the cost, $C$, we need to find the error which we compute by taking the partial derivative with respect to the output layer, $a^L$. This is given be the equation, $\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j)$ or written in vector notation, $\delta^L = \nabla_a C \odot \sigma'(z^L)$. These equations tell us how much the cost, $C$, changes as the output, $a^L$, changes. After we have calculated the error of the output we find the error of each layer, $l$ with the equation $\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l)$. We move backwards through the network using the prior layers weights', deltas' and activations' to find the error of layer $l$. Finally, we use this information to compute the rate of change of the cost with respect to each weight and bias. These are done with the equations $\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$ and $\frac{\partial C}{\partial b^l_j} = \delta^l_j$ respectively.

If you have never seen these equations before that is a lot to take in and probably doesn't all quite make sense yet but in the next section we will work through their derivations and hopefully that will secure the information a little more. As I mentioned at the top of the post there are tons of different cost functions but all adhere to the two rules we mentioned earlier and all can be generalized through the four equations for computing gradients we just went over. Apart from that though each cost function can behave very differently. Next I will provide the proof of these theorems and then I will present a few cost functions and their gradient calculations. In a future post I will go into more detail about the positives and negatives of certain cost functions but my hope with this post is that we can see what cost functions are available and maybe gain a little insight into them.

Proof of the generalized theorems of backpropagation
If you remember the chain rule from calculus this will be fairly easy for you. If not we will go over things but the basic rule of backpropagation is to just keep chain ruling. Let's take a look at our equation for computing the error of the output layer, $\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j)$, since the second part of our equation, $\sigma'(z^L_j)$ is really just the derivative of the activation function we can rewrite this as $\frac{\partial a^L_j}{\partial z^L_j}$. Our original equation for the error can now be rewritten
$$\delta^L_j = \frac{\partial C}{\partial a^L_j} \frac{\partial a^L_j}{\partial z^L_j}$$
Hopefully the structure of this equation looks familiar to you because it comes from the definition of the chain rule. All we need to do is multiply the $\frac{\partial C}{\partial a^L_j}$ by the $\frac{\partial a^L_j}{\partial z^L_j}$ and we are left with $\frac{\partial C}{\partial z^L_j}$ or in other words.
$$\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) = \frac{\partial C}{\partial z^L_j}$$
and with that we have proven that our delta calculation is equivalent to the rate of change of $C$ with respect to our inputs in layers $L$.

Now that we have this information let's keep working our way backwards in order to compute $\delta^l_j$. When computing $\delta^l_j$ what we are really doing is calculating $\frac{\partial C}{\partial z^l_j}$. In order to calculate this we will need to use the chain rule again this time using $\delta^{l+1}_k$ (remember $k$ is the neuron in layer $l-1$ which connects to $j$ layer $l$). We have already computed $\delta^{l+1}_k$ so we can use this to calculate $\delta^l_j$.
$$\delta^l_j = \frac{\partial C}{\partial z^l_j} = \sum_k \frac{\partial C}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial z^l_j} = \sum_k \delta^{l+1}_k \frac{\partial z^{l+1}_k}{\partial z^l_j}$$
The reason we need to sum over $k$ is because the $k$ neurons in layer $l+1$ are dependent upon the $j$ neuron which connects to them in layer $l$, therefore it is necessary to sum over $k$ to clump all the $k$'s together which are associated with their various $j$'s. From the definition of
$$z^{l+1}_k = \sum_j w^{l+1}_{kj} a^l_j +b^{l+1}_k = \sum_j w^{l+1}_{kj} \sigma(z^l_j) +b^{l+1}_k$$
we find that by taking this and differentiate with respect to $z^l_j$
$$\frac{\partial z^{l+1}_k}{\partial z^l_j} = w^{l+1}_{kj} \sigma'(z^l_j)$$
we get the last piece of the puzzle. We can plug this back into our equation for $\delta^l_j$ and now we get
$$\delta^l_j = \sum_k w^{l+1}_{jk} \delta^{l+1}_k \sigma'(z^l_j)$$ or written another way $$\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l)$$
Now that we have proved the equations for $\delta^L$ and $\delta^l$ the only things left to do are prove the equations for the explicit rates of change with respect the weights and biases. In order to that we will need to utilize some of the information we have already gathered. We want to compute $\frac{\partial C}{\partial w^l_{jk}}$ and because we already have the calculation $\frac{\partial C}{\partial z^l_j} = \delta^l_j$ all we need to do is use the chain rule once more to compute,
$$\frac{\partial C}{\partial w^l_{jk}} = \delta^l_j \frac{\partial z^l_j}{\partial w^l_{jk}}$$
From the definition of $z$ we know that $z^l_j = \sum_k w^l_{jk} a^{l-1}_k + b^l_j$. If we differentiate this we respect to $w^l_{jk}$ we are left with $a^{l-1}_k$. Plug this into our equation and we find that
$$\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$$ which is exactly what we would expect to find.

Lastly let's prove $\frac{\partial C}{\partial b^l_j} = \delta^l_j$. For this we can use the exact same definition for $z^l_j$ that we just used, however this time we will differentiate with respect to $b^l_j$. Since $b^l_j$ is a constant we are left with 1 as our partial derivative. We can plug this into
$$\frac{\partial C}{\partial b^l_j} = \frac{\partial C}{\partial z^l_j}\frac{\partial z^l_j}{\partial b^l_j} = \delta^l_j$$
And there we have it! We have now proven all the general theorems of backpropagation. This might not make total sense yet but hopefully over the course of this and future blog posts I can clear up any uncertainties. This a lot to take in but once you really understand this you can apply it to any neural network. That's what is really useful about these formulas and why I wanted to spend so much time here to prove them. This is the basis for all neural networks. We just proved how backpropagation algorithms work in any cost functions. If the cost function does not adhere to these formulas then it is a proper cost function capable of backpropagation.

Let's now take a look at a few cost functions. I'm not going to go into a lengthy derivation of each functions full gradient. Instead I will provide the formulas plus the derivation with respect to output layers. I hope to do a whole series on cost functions where I will go into more detail about each of the functions and provide their full gradient derivations. If you would like you can attempt to calculate the gradients with respect to the weights and biases. For now, however, I hope an overview of some of the more common cost functions will suffice. We are far from done with a complete understanding of the cost function but if you have followed me up to this point we are on the right path.

Quadratic
$$C = \frac{1}{2} \|y-a^L\|^2 = \frac{1}{2} \sum_j (y_j-a^L_j)^2$$
This function adheres to the two rules we stated earlier. The only activation function it relies upon is $a^L$ and it is possible to average it across all inputs. ($\|y-a^L\|$ is referred to as the norm between $y$ and $a^L$ which is a measure of the distance between the two matrices). Computing the gradient we get
$$\nabla_a C = \delta^L_j = (y - a^L)$$
This is also commonly referred to as the mean-squared error, maximum likelihood and sum squared error.

Cross-Entropy
$$C = - \frac{1}{n} \sum_j [y_j \text{ln}a^L_j + (1-y_j)\text{ln}(1-a^L_j)]$$
gradient
$$\nabla_a C = \delta^L_j = - \frac{1}{n} \sum_j \frac{y - a^L_j}{(a^L)(1-a^L)}$$
This is a binary classifier for the sigmoid function where the sigmoid activation function is defined as $\sigma(z) = 1/(1+e^{-z})$

Log-Likelihood
$$C \equiv -\ln a^L_y$$
gradient
$$\nabla_a C = a^L_j-y_j$$
This function is the same as cross-entropy except it is used for multi-class classification rather than binary classes. The softmax activation function is used here rather than the sigmoid where the softmax activation function is defined as $a^L_j = \frac{e^{z^L_j}}{\sum_k e^{z^L_k}}$.

Kullback-Liebler Divergence
Kullback-Liebler divergence is a measure of how much information is lost when transferring information from one probability distribution to another and is denoted
$$D_{\mathrm{KL}}(P\|Q) = \sum_i P(i) \, \ln\frac{P(i)}{Q(i)}$$
This measures how much information is lost when $Q$ is used to approximate $P$. In our instance the distributions were are concerned with are $y$ and $a^L$, specifically how much information is lost when we attempt to approximate $y$ with $a^L$. This is a regression cost function used to fit continuous probability distributions and the cost function is denoted,
$$C = \sum\limits_j y_j \log \frac{y_j}{a^L_j}$$
gradient
$$\nabla_a C = \frac{y_J}{a^L_j}$$

Hinge Loss
This is often used with support vector machines but even for neural network it is still a useful cost/loss function to be familiar with.
$$C = \max(0,1-y \cdot a^L)$$
Here the activation function outputs a value for a^L between -1 and +1.
gradient
$$\nabla_a C = \begin{cases} -y, \text{if } y \cdot a^L < 1 \\ 0 \text{if } y \cdot a^l \geq 1 \end{cases}$$

This is far from an exhaustive list of all the cost functions but these are a few of the more popular ones. This post is quite lengthy but if you made it this far I hope you learned a thing or two. I am planning to start a series which delves deeper into each of the cost functions (and a few more) that I only briefly touched on. I hope that this post will serve more as a resource for future posts in which I can point back here were the fundamentals are laid out. I'm am also planning to do a comprehensive post on activation functions. Since the cost function and the activation functions go hand in hand I plan to work on these simultaneously. Next week I plan to write a coding example on convolutional neural networks using theano and lasagne. I'll post a few links about how to set up CUDA, theano and lasagne to go along with that because getting those to work is a huge pain! If you have any complaints or suggestions please let me know in the comments or send me an email.

No comments:

Post a Comment