Unwrapping the Black Box: An Introduction to Convolutional Neural Networks

In one of my previous posts I showed how to build a neural network from scratch. Convolutional Neural Networks are a little more complex than the basic batch gradient descent model but the characteristics are largely the same. I will not be building a CNN from scratch (I would like to in the near future) instead, I want to go over the basics of how a CNN works, how it differentiates itself from a fully connected layer and why it preforms so well at computer vision tasks. I will provide another post where I will show how to code up a CNN which will get us a score of well over 99% accuracy on the MNIST training set. We can also try it out on some more interesting data sets as well.

In order to understand why CNN's are so powerful let's first look at some of the problems which undue the fully connected network. A fully connected neural network consists of every neuron in layer $l$ having a connection with every neuron in layer $l+1$. For us to create a flexible network which adapts well to new information we want to include many hidden layers. A deeper network means more neurons which in turn give us the ability to learn more precise features. The problem arises however when we realize that with each successive layer we are adding more and more free parameters. The network must learn each of these free parameters, therefore deeper and deeper networks quickly become computational issues. As we build the network deeper learning becomes slower as all the extra free parameters must be updated on each iteration. Another problem is that backpropagation with certain activation functions has a tendency to update weights closer to the output much faster than weights nearer the inputs. This is what is known as the vanishing gradient problem. Both of these problems contribute to deep, fully connected nets learning rather slowly and failing to converge at the global minimum. The CNN provides a solution to both of these issues.

Local Receptive Field
Let's look at the computation issue first as a solution to this will provide us with the structure of the CNN. The problem with the fully connected network was that we were trying to do everything at once. There really is no need to connect each neuron in layer $l$ to every neuron in layer $l+1$. So let's try a different approach and see what happens. In order to do this we should take a subset of the data (say 5x5 pixel area) and connect each of these (25) inputs to just one neuron. The subset of our data is called a local receptive field.

After calculating all the weights for this local receptive field plus adding a bias unit we will move our subset over slightly (say 1 row of pixels). This length is called the stride length. We will then scan this section. At which point we will once again move over until we reach the end of the row at which point we will move our subset down one stride and scan across again. We will continue this process until all sections have been scanned and each section has been mapped to its own neuron. Think of it like we are sliding across the image. The mathematical term for sliding is called convolution which is where the CNN gets its name.

Shared Weights and Kernels
This is sounding less computationally taxing but if we are mapping a new set of weight to each neuron that is still a lot of free parameters. This is where the CNN becomes really powerful. Instead of creating a different mapping with a different set of weights for each local receptive field, every neuron in our hidden layer shares the same set of weights. Let me repeat that, every neuron shares the same weights and biases.

At this point you might be more confused than anything and if you are thinking a layer that shares the same weight and biases sounds like it can't do a whole lot, you are exactly right. By itself it is virtually useless, but if we stack a lot of these kernels or filters together we can learn a wide variety of distinct and complex features. Instead of building our network deep we are building it wide.

The value of each kernel is that it becomes very good at distinguishing a very distinct set of feature.
If we think about the way we view objects it may become a bit clearer as to why this is such a valuable asset. When perceiving an object the first things we decipher are the borders. The borders are how objects project themselves from the surrounding space. The border posses sharp, distinct edges making them easy to identify. We identify an object in much the same way we might build a puzzle. We start at the edges because the sharp corners are easily distinguishable and work our way inward associating pieces with similar characteristics with one another. Borders and distinct lines are also the first thing a CNN perceives.

Here are is a depiction of the types of features the first layer in a CNN will decipher. Notice the stark contrast in textures and how the lines point in all different directions.

Another reason why it is useful to have whole sections of the network which are tasked at learning very specific types of features is that if we think about identifying something like say a bird; the beak, legs, wings and body all have very distinct features. In any given image some of these features may be hidden. Maybe the legs are tucked up underneath or obstructed, we want our network to say 'well, I see a beak, wings, tail feathers etc. I don't see any legs but this is still probably a bird.' It can adapt to motion and position.

Shared weights also allow us to identify objects no matter where they are in an image or how much of the image they take up. A network which can only find large birds centered in an image is not of much use. We want our network to have the ability to distinguish the bird no matter where they are in the image. Shared weights which convolute the image give us this ability. If there is a bird in the image that a human could see we want our network to do the same.

The equation for calculating the shared weights and bias unit in the CNN are as follows
$$\sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4 w_{l,m} a_{j+l, k+m} \right)$$
where $\sigma$ denotes the activation function, $b$ is the shared bias, $w_{l,m}$ are the shared weights associated with the local receptive field and $a_{x,y}$ represents the input activation at position $x, y$.

After learning the characteristics of the edges each successive layer learns more and more precise features. If we return to our puzzle analogy this is much the same way that after completing the edges we may start building some of the very distinct structures within the puzzle. Perhaps there is a large red object in one section of the puzzle. We will begin to associate all the red pieces with that part of the puzzle. Our network begins by identifying the edges, then moves toward identifying general structures each layer becoming more and more precise. Because the features being learned by each layer are becoming more and more precise it is useful to employ successively larger sets of kernels in each layer. This allows each layers feature recognition abilities to become more exact and diverse than its predecessor. Once we reach the end of our network we throw on one or two fully connected layers which pulls the whole thing together and finally spit out an output. This approach allows us to greatly reduce the number of free parameters that a fully connected network would have while simultaneously teaching the network to identify very specific features. Another advantage a CNN has over a fully connected layer is the interpretation of the layers. Rather than being a jumbled mess of different weights a CNN learns recognizable patterns which become more distinct as the network grows deeper.

Zero Padding
Another idea which you might find troubling is how do we deal with the pixels on the edge? Sure they will be accounted for in at least 1 local receptive field, but if we are using a local receptive field of size 5x5 and a stride length of 1 the pixels in the center of the image will be accounted for 25 times. This seems like we might not be paying enough attention to the pixels at the edge. To account for this there is a technique called zero-padding where rows and columns of zeros are added to the edge of the image. This allows us to include extra local receptive fields to account for the pixels at the edges. A network which includes zero-padding is called wide convolution while a network which does not include it is referred to as narrow convolution.

Pooling Layer
One other note about the hidden layers of a CNN is that it is common practice to include a pooling layer between each hidden layer. Pooling layers take a subset (say 2x2) of the hidden layer kernel and maps its information to a single neuron. A common pooling practice is max-pooling where only the maximum activation unit from the subset of kernel activation units is mapped to the pooling layer. The pooling layer is used to reduce the spatial size of the network by reducing the number of free parameters. This is used to promote the CNN's overall goal of being computationally light while still having the ability to learn a diverse set of features. Apart from pooling another idea to reduce the number of free parameters while still leaving the network flexible is to simply increase the stride length and remove the pooling layer all together. The two ideas are different approaches to answering the same problem and maybe in a future post I'll dig deeper into both of them.

Activation Function
Multiple kernels at each hidden level have solved the computation problem so now let's look at how to solve the vanishing gradient issue. As I mentioned earlier the vanishing gradient is a product of the Sigmoid activation function. The Sigmoid is nice because its range falls between $(0, 1)$ which can therefore be interpreted as a probability distribution. If you know a little statistics it might help to think of the Sigmoid like a CDF which therefore means its derivative is its PDF. The problem though is that the structure of the PDF and CDF are very different and particularly the difference in the range will give us problems. The range of the derivative of the Sigmoid is $(0, .25)$. The problem comes from the fact that in the backpropagation algorithm we need to use the derivative of the Sigmoid in order to compute the gradient (remember the equation for computing the gradient is $\nabla_a C \odot \sigma'(z^l)$ where $\sigma'(z^l)$ is the derivative of the activation function). This means that even if our weights are close to 0 the largest amount of information that can be passed from one layer to another during backpropagation is 25%.

As we move backwards to the layers nearer and nearer the inputs we are losing exponential amounts of information. In order to fix this problem we need an activation function which is capable of passing all its information backwards to the previous layers. In other words something which has a derivative of 1 or very close to it. Enter the rectifier.
$$f(x) = \text{max}(0,x)$$
The derivative of the rectifier is
$$\begin{Bmatrix}
0 \text{ if } x \le 0 \\
1 \text{ if } x \gt 0 \\
\end{Bmatrix}$$
This looks like an extremely simply activation function and that's because it is. There are some really nice properties about it though that make it very useful. The Sigmoid is nice because its output is easy to interpret. The rectifier is nice because its derivative is 1 which means it carries all the information back through the network. The rectifiers range is $(0, \infty)$ where it's gradient is 0 if $x$ is less than 0 and 1 if $x$ is greater than 0. This has the property of transferring all the information back through backpropagation. It is also nice because it serves partially as its own form of regularizaiton. By sending all terms less than 0 to 0 it immediately prunes those weights from the network. This creates the property of only relying on the weights that are necessary rather than relying a little on a lot of different weight. The network is sparse but very adaptable to new information. This version of the rectifier is called the Rectified Linear Function (ReL) which is the hard max approximation of the Rectified Linear Unit (ReLu). The ReLu is given by the equation $f(x) = \ln (1 + \text{e}^x)$.

Both the ReL and the ReLu do very good jobs at finding the global minimum. Recently the ReL has become the more popular choice but both work very well. There are other activation functions that are used with CNN's but because of the properties I just mentioned versions of the rectifier are currently the most popular.

CNN's are very effective at interpreting visual information. They are able to do this by their ability to 'mimic' the way we consciously perceive objects. The current trend is to make learning algorithms which do better and better at mimicking the biology of perception (or at least how we believe perception works). Perhaps in the future we will find alternatives to this approach but this is currently the best approach we have. CNN's are the next step on the road toward computer vision and who knows maybe this is the answer. I have my reservations but we will never make the next step if we do not pursue all the avenues afforded to us by CNN's.

Unwrapping the Black Box

Wednesday, August 3, 2016

An Introduction to Convolutional Neural Networks

No comments:

Post a Comment

About Me

Social