Unwrapping the Black Box: Building a Neural Network from Scratch

Code to follow along with

Neural networks are one the most powerful learning techniques and with Python libraries such as TensorFlow, Theano, Torch, etc. their implementation is only getting easier. Machine learning and neural networks are buzz words being thrown around a lot in business and computer science circles. People seem to think these ideas will unlock the big breakthrough in AI we have been waiting 60 years for. They may be right but before we crown machine learning as the prized jewel of 21st century ingenuity let's first figure out how they work. This will hopefully give us some intuition about what is and is not possible with machine learning. To a majority of the public neural networks are a black box. We want to get over that and hopefully understand some of the practical uses of neural networks, as well as some of the potential drawbacks. I believe that the best way to learn about something is to build it yourself, test it, break it and test it again. This will really challenge our understanding of the algorithm. In this post my goal is to show you how I built a neural network from scratch.
For brevities sake I will not be going over the full derivations in this post. This post is designed to provide a working example and an intuition about how backpropagation works. If you want a deeper mathematical proof you can read about it here. This is the first coding tutorial on this blog and I encourage you to please play with the code and send me recommendations.

The neural network we will be working through is a fully connected, feedforward net utilizing backpropagation and gradient descent. In terms of neural networks this is about as vanilla as they come but that is not to understate both the importance and the complexity of this algorithm. All artificial neural networks are built from the foundations created by this algorithm. The same principles that underlie the backpropagation algorithm we will be using can also be generalized to fit more complicated deep learning algorithms such as convolutional and recurrent neural networks.

The task at hand will be to create a neural network which can read the pixel data of handwritten digits and accurately predict what digit it is looking at. This data set is great for testing out new ideas as each image is only 784 pixels (28x28). Because the images are so small even computers without a powerful GPU can build very effective learning algorithms for it. Learning to train on the MNIST data set is the neural networks equivalent of the "Hello World" script, almost everyone starts here. Building a model from scratch will be more challenging but this is more about the learning experience than the actual results.

First, lets get familiar with the data. Here are a couple of examples from the dataset.

The training set consists of 42000 rows, each row referring the pixel information for a handwritten digit. The training set also contains the proper label for each of the 42000 digits. Because we are given the proper label to each value in the training set it is implied that this will be a supervised learning experiment. The neurons in our network will be learning the attributes to each of the digits 0-9. We are also provided with a test set consisting of 28,000 examples. The test set does not contain the proper labels though. This is because after we have properly trained our network we would like to test how well it adapts to new information.

Now that we have a general idea of how the data is configured lets load it into python. In this tutorial I have broken several different pieces up into multiple scripts. This could easily be complied in one script but it is good practice, especially once we work on larger projects to break our project up into multiple pieces. This tactic makes debugging substantially easier plus it increases readability for others. Our program has been broken up into a datainit.py, a functions.py and a run.py file. The datainit.py file loads, initializes and shapes the data into the proper form. The functions.py file is were the algorithm lies, there we have our cost, backpropagation, and feedforward functions. The run.py file is were we run the algorithm using the initialized data.

The majority of this post will be dedicated to the functions file but I will quickly go over the datainit and run files. The functions file may be the meat of the neural network but a well planned initialization can make our our ANN (artificial neural network) extremely flexible. Flexibility will allow us to easily manipulate the total number of hidden layers and neurons in each hidden layer. If we can incorporate this methodology we can easily test out different combinations to see which gives us the best results.

 import pandas as pd  
 import numpy as np  
 import os  
   
 def get(hidden_layers, output_layer):  
   # load training data  
   train = pd.read_csv(os.getcwd() + '\\train.csv')  
   test = pd.read_csv(os.getcwd() + '\\test.csv')  
     
   # Set X and y values  
   X = np.matrix(train.ix[:40000,'pixel0':])  
   y = np.matrix(train.ix[:40000,'label'])  
   X_cv = np.matrix(train.ix[40000:,'pixel0':])  
   y_cv = np.matrix(train.ix[40000:, 'label'])  
   X_test = np.matrix(test.ix[:,'pixel0':])  
     
   # set layer values  
   input_layer = np.size(X, 1)  
   hidden_nodes = hidden_layers + 23  
     
   # create weights info  
   layers = []  
   layers.append(input_layer)  
   for i in range(hidden_layers):  
     layers.append(hidden_nodes)  
   
   layers.append(output_layer)  
     
   weights_info = [(layers[i+1], layers[i]) for i in range(len(layers) - 1)]  
   weights = [(layers[i+1], layers[i] + 1) for i in range(len(layers) - 1)]  
     
   # create thetas  
   orig_thetas = []   
   for i in range(len(weights_info)):  
     orig_thetas.append(np.matrix(np.random.randn(*weights_info[i])))  
     orig_thetas[i] = orig_thetas[i] / np.sqrt(X.shape[1])  
     orig_thetas[i] = np.insert(orig_thetas[i], 0, np.random.randn(), 1)  
     
   return X, y, orig_thetas, weights, test, X_cv, y_cv, X_test

In the datainit.py we first load in the train and test files so that they can easily be insert into the functions file. We also need to splice the train data into a training set and a cross validation set. The cross validation set is partially used to test that our model is learning to fit new data. It is also used to easily compare different hyper-parameter combinations i.e. lambda and model sizes . We will get to these later but it is important now to get a general idea of what the cross validation set is used for and why it is so important. We train our network on the train data then test it and tweak our hyper-parameters on the cross validation set before finally running it on the test set. 40,000 rows of the training set are placed into train and the last 2,000 are placed in cv. A general rule of thumb is the 70-30 rule which says put 70% of the data into train and 30% into cv. In this experiment I have gone a little heavier on the training side for a couple reasons: 1) This is a fairly small data set. Generally in machine learning the more data to train with the more accurate our results will be. 2) Because each of the rows of information is so similar 2,000 new examples should be enough to tweak our hyper-parameters fairly well. If our dataset was more complex I would adhere more to the 70-30 rule. (A word of warning; if your data is ordered make sure to shuffle it before splitting it up into a training and cross validation sets. If we were to feed the model the information for the digits 0-7 in the training set and 8-9 in the cross validation our model will adapt extremely poorly because it has never seen 8's or 9's before. In this case our data comes pre-shuffled so that's not something we need to worry about.)

The next code block creates our input, hidden layer, and output sizes. By tweaking the hidden_nodes variable we can manipulate the size of each hidden layer. I currently have it set to the number of hidden layers plus 23 which equals 25. The number of hidden layers is instantiated by the __init__ method in run.py. By changing the number of layers it takes we can easily adjust how many hidden layers our network contains. Play around with that to see what combinations give the best results.

Near the end of the get function we save the size of our thetas in a list of tuples called weights in order to call them at a later time. We also include the bias unit here which is what the +1 is for. The function ends with orig_thetas. This is where we create our original guess for theta. This will no doubt be a poor estimate as we are randomly creating these thetas but how we create our thetas will go a long way towards helping our model avoid saturation. Saturation occurs when our all our thetas start heading toward 1 or -1 (in this case since we are using the Sigmoid activation function, different activation functions have different saturation points). In order to avoid saturation we want our initialized thetas to be clustered around 0. This will help us avoid saturation as our thetas will have a lot of room to move around. The thetas associated with the bias unit are initialized using a Gaussian distribution with μ = 0 and σ = 1. The rest of the thetas (also called weights) are initialized using the same parameters but we divided them by the square root of the number of features in our data. This clusters our initial thetas very close to zero. We have all the necessary information to start creating our ANN.

The first thing we will need is our activation function. For this network I have chosen to use the Sigmoid function. There are loads of other activation functions and choosing which ones work best for which problems is one of the hottest current research topics in machine learning. I chose the Sigmoid function because it is relatively intuitive and for a simple learning algorithm such as this the choice of activation function will not make a huge impact. I also chose it because the Sigmoid function is often the first activation function taught when learning about neural networks because it's output is so simple to interpret.

 def sigmoid(z):  
   return 1/(1 + np.exp(-z))

We can now use this function in other parts of our code.

The next topic we should tackle is the cost function. The idea behind our cost function is that we will feed our inputs forward through our neural network. Our inputs are multiplied by our thetas whose result are run through the Sigmoid function. This output becomes the input for the next layer. We repeat this process until we reach the output layer. The output layer produces an estimate which we will compare to the correct answer, y. We can look at how many we got right and how many we got wrong. This is our prediction accuracy. The initial guess we made will be terrible but we can use this information to compute a cost function. If our guess is terrible we will have a really high cost. The closer we get to the actual answer the smaller our cost will get. Our goal with backpropagation is to reduce the cost. We do this by determining how much of the cost can be attributed to each neuron and then reducing it's influence. This is the basic intuition behind the cost function and backpropagation. We will go into more detail in a little but but for now lets code up the cost function.

 def cost(all_thetas, weights, X, y, lamb):  
   thetas = unpack_thetas(all_thetas, weights)  
     
   # add column of 1's  
   X = X/255  
   a1 = np.insert(X, 0, 1, 1)  
     
   # create a binary index matrix of y data and initialize activation layers  
   encoder = OneHotEncoder(sparse=False)  
   y_matrix = encoder.fit_transform(y.T)  
   act_layers = activation_layers(a1, thetas)  
     
   # cost function created in seperate parts  
   first = np.multiply(-y_matrix, np.log(act_layers[-1]))  
   second = np.multiply(1 - y_matrix, np.log(1 - act_layers[-1]))  
     
   # regularization  
   reg_1 = lamb/(2 * len(X))  
   reg_2 = 0  
   for i in range(len(thetas)):  
     reg_2 += np.power(thetas[i][...,1:], 2).sum()  
     
   J = 1/len(X) * (first - second).sum() + (reg_1 * reg_2)  
   print('Current Cost')  
   print(J)  
   print('*' * 20)  
   return J

Let's go over what's going on here. The unpack_thetas function at the top is a function we will write later. For now just know that it creates a list of matrices with our thetas shaped into the proper dimensions. Our X's are grayscale values listed from 0-255 (0 being white and 255 being black). Dividing our inputs by 255 normalizes the data to values between 0 and 1. Our program will still work without this but it will speed up learning quite a bit. Next we insert a column of 1's into the first column of our X matrix. This is the bias unit and by adding it we have activated our inputs. Think of the bias unit as the intercept in a linear equation. Without the bias unit the activation function (Sigmoid) will only have the ability to adjust the slope of the curve. The activation unit gives the model the flexibility to move along the axis just like the intercept in a linear equation allows the slope to move up and down along the y-axis.

The Sigmoid function outputs values between 0 and 1. The output layer is the likelihood that the given inputs correspond to any of the possible labels. The value with the highest likelihood is our estimate for that image. In this instance our output matrix needs to be of dimension m (where m is the number rows in our input matrix, in this case 42,000 if we are using the full training set) x 10 (the number of possible choices for which the output can take on). In order to then compare these with the y's we need to transform them into an comparable format. The way we do this is with OneHotEncoder. This function will transform our y's into an m x 10 matrix. There will be a 1 in the column which corresponds to the proper y label and a 0 in all other columns. Our output layer will now be an m x 10 and our y_matrix is an m x 10. Each column corresponds to the likelihood of the given input being equal to it's column index. For example the first column of the output layer gives us all the likelihoods that that input was a 0. Each column represents the digit it is indexed by.

The act_layers variable is us feeding the first activation layer (a1) forward through the network and creating all the subsequent activation layers. In this we use the activation_layers function which we build further down the program. Let's take a look at how we do this now.

 def activation_layers(a1, thetas):  
   act_layers = []  
   act_layers.append(a1)  
   for i in range(len(thetas)):  
     act_layers.append(sigmoid(act_layers[i] * thetas[i].T))  
     if i != (len(thetas) - 1):  
       act_layers[i+1] = np.insert(act_layers[i + 1], 0, 1, 1)  
   return act_layers

We create an empty list and append our first activation layer to it. Next we run a for loop which appends each new layer of our activation function. We use matrix multiplication to multiply the previous activation layer by the transpose of theta. The Sigmoid activation function is then run on each of these and this matrix is appended to the act_layers list. So for the first loop we multiply the first element of act_layers, which was the newly appended a1, by the first element of thetas. We do this for each element of thetas so our length of act_layers is the length of thetas + 1. We also add a bias unit to each matrix except for the final matrix. We do not add a bias to the final matrix because this is our output matrix. The bias is used as the intercept for future calculations through the feedforward network but we will not be using the final element of act_layers to construct any new layers.
Note: Numpy has weird syntax for matrix and element wise multiplication, np.multiply is element-wise and * is matrix multiplication.

Our activation calculations are now completed and we can look at how to calculate the cost.

 first = np.multiply(-y_matrix, np.log(act_layers[-1]))  
 second = np.multiply(1 - y_matrix, np.log(1 - act_layers[-1]))

The technical term for this cost function is the cross-entropy cost function. That is just a fancy way of saying if our output is close to 1 and the corresponding output in y_matrix is 1 the cost will be very close to 0. If our output is close to 0 but it should be close to 1 that term heads towards infinity. That is the intuition behind first. It penalizes for false negatives. The alternate is true for second. It penalizes for false positives.

The final part of the cost function is regularization.

 reg_1 = lamb/(2 * len(X))  
 reg_2 = 0  
 for i in range(len(thetas)):  
   reg_2 += np.power(thetas[i][...,1:], 2).sum()

Regularization is basically the idea that a simpler model is better than a complex model. There is no a priori reason that this must be true but empirically simpler models generally do learn to interpret new data more accurately than more complex models. We are using L2 regularization which is the squared sum of the thetas multiplied by our tuning parameter lambda divided by twice the number of elements in X . Our model gets an extra penalty for having large thetas. The bigger the value of lambda (I call it lamb since lambda is a built-in in python) the faster our thetas will shrink. If we choose an appropriately large lambda the unnecessary thetas should quickly tend toward 0. If we have too many large thetas our model will be likely to over fit the training data. We do not regularize the thetas associated with the bias unit because we want the bias to be able to move freely. We are not concerned with minimizing our thetas here. Finally we put it all together and we return our cost, J.

Let's tackle gradient function now. For the gradient we will be using backpropagation. The basic idea behind backprop is that we will take the derivative of the cost function with respect to each of the individual thetas. We can start by computing the error term and then working backward using the chain rule to find the derivative for each individual theta affecting the subsequent hidden layer. The derivative of the cost with respect to theta is telling us how much of the cost is related to any given theta. To put it another way backprop tells us if we were to change theta by a given amount how much will that effect the cost? What is the rate of change of the cost with respect to theta? Since our goal is to minimize the cost function all we need to do is take the information from the gradient and subtract it from our thetas (it is also smart to multiply our gradient by a learning rate less than 1 to prevent the future thetas from bouncing around too much, therefore failing to converge). In this example I used an optimization function but we could have easily used a for loop and updated it using the technique I just mentioned. The reason I use an optimization algorithm rather than the method I just mentioned is that occasionally we will get stuck in local minimums. Sometimes we will get really bad estimates because of this problem. The optimization algorithm is slightly more complicated but the intuition is the same.

 def gradient(all_thetas, weights, X, y, lamb):  
   thetas = unpack_thetas(all_thetas, weights)  
   # add column of 1's  
   X = X/255  
   a1 = np.insert(X, 0, 1, 1)  
     
   # create a binary index matrix of y data and activation layers  
   encoder = OneHotEncoder(sparse=False)  
   y_matrix = encoder.fit_transform(y.T)     
   act_layers = activation_layers(a1, thetas)  
     
     
   #slice out first column in all thetas except theta1   
   theta_delta = []  
   for i in range(len(thetas)):  
     theta_delta.append(thetas[i][:, 1:])  
     
   d = []  
   Delta = []  
   theta_grad = []  
   for i in range(len(thetas)):  
     if i == 0:  
       d.insert(0, act_layers[-(i+1)] - y_matrix) #Work backwards through act_layers  
         
     else:  
       d.insert(0, np.multiply(d[0] * theta_delta[-i],np.multiply(  
         act_layers[-(i+1)][:, 1:], (1-act_layers[-(i+1)][:, 1:]))))  
   
     # Create Deltas  
     Delta.insert(0, d[0].T * act_layers[-(i+2)])  
       
     #Create Theta_grad  
     theta_grad.insert(0, Delta[0] / len(y_matrix))  
      
   # place a column of 0's in the first column of all thetas  
   for i in range(len(thetas)):  
     theta_delta[i] = lamb/len(y_matrix) * theta_delta[i]  
     theta_grad[i] += np.insert(theta_delta[i], 0, 0, 1)    
   gradient_theta = pack_thetas(theta_grad)  
   print(act_layers[-1])  
   return gradient_theta

The first several lines are the same as in the cost function because we need to reuse this information. We could put it in its own function but a few lines of repetition is not a huge issue. If you would like to read a fuller explanation and mathematical proof of the gradient you can find that here. I hope the explanation above provided enough intuition about what the backpropagation algorithm is trying to do so that we can work through the code here.

The derivation with respect to the bias unit is slightly different than that of theta so first we will slice out the thetas associated with the bias unit and append these to theta_delta. Next, we compute the error by subtracting our final element in act_layers by y_matrix and appending that to d. Our algorithm is going to work backwards through act_layers while simultaneously appending new matrices into the 0th element in d. After we have computed that we compute Delta by multiplying the newly created d by the act_layers indexed starting at -2 and working backwards. The first iteration will multiply the error term, which will be d indexed at -1, by act_layers indexed at -2. This is multiplying the error associated with layer n by the activation layer n-1. This calculation will give us the unnormalized derivatives associated with each theta. After that we will normalize our Delta's by dividing them by m and appending it to theta_grad. These are our gradients without regularization. These last several steps are the trickiest part of the algorithm and if it doesn't make total sense that's okay. A proof of the gradient will help clear a lot of the steps up. As long as the code makes sense we are good.

The final part is fairly simple. This is were we take regularization into account. The derivative of L2 regularization is just use of a simple power rule. We replace the column associated with the bias unit with 0's because we do not regularize the bias unit. Add the regularized error terms to our gradient, return gradient_theta and we are done.

Pack_thetas, unpack_thetas, and forward_propagate are the 3 functions I have not gone over yet and I will touch on them briefly here. The reason we need to pack and unpack thetas is because the optimization function only accepts vectors. Pack_thetas unrolls the gradient matrices into one really long vector and unpack_thetas puts them back into matrix forms. Forward_propagate is simply used to test our accuracy on new data after we have trained our network. I currently have it structured to show the results on the cross validation set but this can be easily changed by commenting out lines 91 - 95 and uncommenting lines 97 - 102. You will also need to remove y from the accepted arguments in forward propagate. In the predict_all method in run.py you will also need to replace self.X_cv and self.y_cv with just self.X_test. If you do this it will create a csv of predictions for the test set which you can submit to kaggle. I trained this set for 2000 iterations and received a score just shy of 97%. You can go for longer if you would like but the gains will be minimal. 2000 iterations took roughly an hour on my machine. 97% is quite good and not too far from human accuracy but it is possible to do better and in future posts we will tackle just how to do that using convolutional neural networks.

Unwrapping the Black Box

Thursday, July 14, 2016

Building a Neural Network from Scratch

No comments:

Post a Comment

About Me

Social