Neural networks achieve state-of-the-art performance for a variety of machine learning tasks such as speech recognition and image classification. In fact, if you give verbal commands to your phone, your speech is likely processed using a neural net. One especially interesting colloquial article about neural networks is here where, given an image, a trained network is used to emphasize features in the image to which the network responds. These types of results likely require millions of clean training images and large computational resources; I want to create neural nets from scratch and see just how challenging it can be to train and use these nets. Ignoring large convolutional networks for now (which are typical for these image problems), the first step is to just code up a toy neural net. Below I’ll talk details and mention some of the issues encountered, but the following image illustrates some of the results.

Before getting to the neural net, I’ll mention data. I wanted to generate two dimensional training data under a variety of models (most not linearly separable) with two or three classes. This is simply so I can nicely plot the results. I decided to produce 100 data points per class and haphazardly wrote the python code below. This generates data and labels from the selected model, centers the data, and scales it to be in .

import numpy as np class Data(): def __init__(self): '''Generates 100 points per class in R2 under a variety of models. Use comments below to pick the data model. Data is centered and then scaled to lie in [-1,1] x [-1,1]''' points = two_gaussians() #points = ball_and_cup() #points = shifted_cubics() #points = bulls_eye() #points = crecent_moons() #points = shifted_sinosoids() #points = bulls_eye_3_categories() #points = shifted_sinosoids_3_categories() points = points - np.mean(points, axis=1,keepdims=True) # center data points = np.dot(np.diag(1/np.max(np.absolute(points),axis=1)),points) # scale axes label = np.zeros(len(points.T), dtype='uint8') for i in xrange(len(points.T)/100): label[100*i:100*(i+1)] = i self.points = points self.dim = len(points) self.num_points = points.shape[1] self.num_classes = len(set(label)) self.label = label ################################################################################ # Functions for the data models ################################################################################ def two_gaussians(): data = np.random.randn(2,200) shift = 1.6*np.ones([2,200]) shift[:,100:200] = -1*shift[:,100:200] return data + shift #------------------------------------------------------------------------------- def ball_and_cup(): temp = np.zeros([2,100]) data = np.zeros([2,200]) temp[0,:] = np.random.uniform(0,0.5,[1,100]) temp[1,:] = np.random.uniform(0,2*np.pi,[1,100]) data[0,0:100]=np.multiply(temp[0,:],np.cos(temp[1,:])) data[1,0:100]=np.multiply(temp[0,:],np.sin(temp[1,:])) data[0,100:200]=np.random.uniform(-1.4,1.4,[1,100]) data[1,100:200]=data[0,100:200]**2+0.1*np.random.randn(100)-0.8 return data #------------------------------------------------------------------------------- def shifted_cubics(): x = np.linspace(-8,8, 100) y = np.linspace(-8,8, 100) xx = x**3+20*np.random.randn(1,100) yy = y**3+150+ 20*np.random.randn(1,100) data = np.zeros([2,200]) data[0,0:100] = x data[1,0:100] = xx data[0,100:200] = y data[1,100:200] = yy return data #------------------------------------------------------------------------------- def bulls_eye(): temp = np.zeros([2,200]) data = np.zeros([2,200]) temp[0,0:100] = np.random.uniform(0,1,[1,100]) temp[0,100:200] = np.random.uniform(1.5,2.5,[1,100]) temp[1,:] = np.random.uniform(0,2*np.pi,[1,200]) data[0,:]=np.multiply(temp[0,:],np.cos(temp[1,:])) data[1,:]=np.multiply(temp[0,:],np.sin(temp[1,:])) return data #------------------------------------------------------------------------------- def crecent_moons(): temp = np.zeros([2,200]) data = np.zeros([2,200]) temp[0,:] = np.random.uniform(2,3,[1,200]) temp[1,0:100] = np.random.uniform(0,np.pi,[1,100]) temp[1,100:200] = np.random.uniform(np.pi,2*np.pi,[1,100]) data[0,:]=np.multiply(temp[0,:],np.cos(temp[1,:])) data[1,:]=np.multiply(temp[0,:],np.sin(temp[1,:])) data[0,100:200]+=2.5 data[1,100:200]+=1.7 return data #------------------------------------------------------------------------------- def shifted_sinosoids(): t = np.random.uniform(0,4*np.pi,[1,200]) data = np.zeros([2,200]) data[0,:] = t data[1,0:100] = np.sin(t[0,0:100]) data[1,100:200] = np.sin(t[0,100:200])+1 data[1,0:100] += 0.1*np.random.randn(100) data[1,100:200] += 0.1*np.random.randn(100) return data #------------------------------------------------------------------------------- def bulls_eye_3_categories(): temp = np.zeros([2,300]) data = np.zeros([2,300]) temp[0,0:100] = np.random.uniform(0,1,[1,100]) temp[0,100:200] = np.random.uniform(1.5,2.5,[1,100]) temp[0,200:300] = np.random.uniform(3,4,[1,100]) temp[1,:] = np.random.uniform(0,2*np.pi,[1,300]) data[0,:]=np.multiply(temp[0,:],np.cos(temp[1,:])) data[1,:]=np.multiply(temp[0,:],np.sin(temp[1,:])) return data #------------------------------------------------------------------------------- def shifted_sinosoids_3_categories(): t = np.random.uniform(0,4*np.pi,[1,300]) data = np.zeros([2,300]) data[0,:] = t data[1,0:100] = np.sin(t[0,0:100]) data[1,100:200] = np.sin(t[0,100:200])+1.2 data[1,200:300] = np.sin(t[0,200:300])+2.4 data[1,0:100] += 0.1*np.random.randn(100) data[1,100:200] += 0.1*np.random.randn(100) data[1,200:300] += 0.1*np.random.randn(100) return data

Moving on to creating the neural net, I’ll forgo the basics as there are many resources available; I’m not sure how people learned anything before the internet. In particular, this site has some interesting best practices. I’ll set some notation and discuss what I learned from coding up a neural net.

My toy neural net is a feed forward net with fully connected layers. I want to be able to adjust the structure of my net, so my code will initialize a neural net given the dimension of my data, the number of data classes, the number of hidden layers, and the number of neurons per layer. To set some notation, if is an input vector, I will use to represent the operation of passing to the first hidden layer. Here is my non-linear activation function (I will choose an ReLU – this is a common choice in practice and has some computational benefits too). and are my weights and bias associated with the linear operator between the layers. Yes, adding the bias is not a linear operation, but adding a dimension to the input layer and incorporating the bias as another column in makes the operation linear. I prefer to keep the bias separate from as I will apply regularization to the ‘s and not the ‘s. Similarly, represents the forward pass from layer to layer . The last layer (whose values are called scores) is . The output has dimension equal to the number of classes, and note the ReLU operation is missing. At this point we need to take the scores and create an objective function which measures dissimilarity from the desired labels; this is the first place where I actually learned something new.

My naïve choice of objective would be a normalization of followed by least squares. Unfortunately this is more sensitive to outliers or mislabeled data than other methods. I will normalize with a softmax function and instead follow this with cross-entropy as my objective function. One interpretation of this follows by assuming each score entry is related to the log of our belief that is in class . More specifically, for two classes, we suppose

.

.

Here ensures the probabilities sum to one. By solving each equation above for the probabilities and summing, we find . Thus we find

.

This is precisely the output of the softmax function given . In simpler terms, the softmax emphasizes the max entry in while the smaller terms decay exponentially (hence the name softmax). Interpreting this as our belief in class assignment, cross-entropy measures the difference between this categorical probability distribution and the one given according to the training labels. By minimizing, we make the distribution representing our belief in label assignment align with the actual labels. See KL-divergence for more information.

To minimize the cross-entropy, I perform standard gradient decent along with backtracking in case a minimum is overshot. Here the chain rule is your friend. The gradient is backpropagated via repeated applications of the chain rule which makes for nice code; I run a loop to propagate the gradient from layer to layer. Of course this also means that for small values, the gradient may become very small in early layers. I did in fact observe that when I made my nets deep, the gradient for the first few layers was essentially zero. Any precision errors will also compound for early layers, but this wasn’t a clear issue for me.

I’m not really going to address initialization of the net, but some good heuristics for initialization can be found here. I also regularized to avoid over fitting by instituting an penalty on my weights. Putting this all together, the code below contains all the tools to create, initialize, and train my neural net given data produced in the previous block of code. Each instance of the class NeuralNet will initialize the net and perform a single forward pass with the provided data. The code for training and testing the net should be rather self explanatory.

################################################################################ ################################################################################ # hyperparameters learning_rate = 1e0 # initial multiplier for step in gradient decent reg = 1e-5 # l2 regularization parameter iter_test = 500 # number of iterations after which to compare loss function size_hidden_layers = [5,5] # vector where each entry represents the number of nodes in a hidden layer ################################################################################ ################################################################################ import numpy as np class NeuralNet: def __init__( self, Data, size_hidden_layers, reg, iter_test, learning_rate): '''Initializes a fully connected, feed forward nueral net with the number of layers and neurons per layer according to 'size_hidden_layers.' Weights are initialized with small random weights and bias is zero. Layer values are set accordong to an initial forward pass with 'Data' as the input''' self.size_layers = size_hidden_layers + [Data.num_classes] self.W = [] self.b = [] self.layer_vals = [] self.scores = [] self.reg = reg self.iter_test = iter_test self.learning_rate = learning_rate dim_layer_in = Data.dim for size_layer in self.size_layers: self.W.append(np.sqrt(2./dim_layer_in) * np.random.randn(size_layer, dim_layer_in)) self.b.append(np.zeros([size_layer,1])) dim_layer_in = size_layer self.forward(Data.points) #------------------------------------------------------------------------------- def forward(self, points): '''Performs a foward pass with 'points' as input to populate layer values''' layer_in = points self.layer_vals = [] for i in xrange(len(self.W)): if i == len(self.W) - 1: # no activation function on last layer self.scores = np.dot(self.W[i], layer_in) + self.b[i] else: layer_out = f(np.dot(self.W[i], layer_in) + self.b[i]) self.layer_vals.append(layer_out) layer_in = layer_out #------------------------------------------------------------------------------- def calc_loss(self, Data): '''evaluates the objective function (loss) we seek minimize''' scores_shift = self.scores - np.max(self.scores) scores_norm = np.exp(scores_shift)/ np.sum(scores_shift, axis=0, keepdims=True) scores_log = -np.log(scores_norm) # average cross-entropy loss loss_data = np.sum(scores_log[Data.label, range(Data.num_points)])/Data.num_points loss_reg = 0 for W in self.W: loss_reg = loss_reg + 0.5*self.reg*np.sum(W**2) # regularization loss self.loss = loss_data + loss_reg # total loss #------------------------------------------------------------------------------- def train(self, Data): '''Trains neural net via standard gradient decent with backtracking''' loss_prev = 1e10 W_prev = self.W b_prev = self.b for i in xrange(50000): self.forward(Data.points) self.calc_loss(Data) dW, db = gradient(self, Data) # test if loss function has grown if i % iter_test == 0 and self.loss - loss_prev > 1e-9: self.learning_rate /= 5 # adjusting step size print "step size too large - adjust learning rate to %f" %learning_rate self.W = W_prev # backtracking self.b = b_prev continue # status infomation displayed if i % iter_test == 0: print "iteration %d: loss %f" % (i, self.loss) # update weights and bias for j in xrange(len(self.W)): self.W[j] -= self.learning_rate * dW[j] self.b[j] -= self.learning_rate * db[j] loss_prev = self.loss W_prev = self.W b_prev = self.b #------------------------------------------------------------------------------- def test(self, Data): '''Calculates the percentage of misclassified points given points and labels in Data''' self.forward(Data.points) predicted_class = np.argmax(self.scores, axis=0).flatten() misclass = predicted_class - Data.label misclass[misclass !=0] = 1 misclass_pct = float(np.sum(misclass))/Data.num_points print "%f our of %i points misclassified" %(misclass_pct, Data.num_points) ################################################################################ # Functions for neural net forward pass and gradient backpropogation ################################################################################ def f(x): '''Activation function to be used - I'll choose ReLU''' y = np.maximum(0,x) return y #------------------------------------------------------------------------------- def back_prop_parameter(dloss_dout, inpt, W, reg): '''Propogates the gradient to the weights and bias between layers inpt and out''' dloss_dW = np.dot(dloss_dout, inpt.T) + reg*W dloss_db = np.sum(dloss_dout, axis=1, keepdims=True) return dloss_dW, dloss_db #------------------------------------------------------------------------------- def back_prop_layer(dloss_dout, W, layer_vals): '''Propogates the gradient to the layer (with layer_vals) just before out (W is between them)''' dloss_dlayer = np.dot(W.T, dloss_dout) dloss_dlayer[layer_vals <= 0] = 0 # gradient through non-linearity in hidden layers return dloss_dlayer #------------------------------------------------------------------------------- def gradient(Net, Data): '''Calculates gradient of weights and bias parameters''' scores_exp = np.exp(Net.scores) scores_norm = scores_exp / np.sum(scores_exp, axis=0, keepdims=True) dloss_dscores = scores_norm dloss_dscores[Data.label, range(Data.num_points)] -= 1 dloss_dscores /= Data.num_points dW = [] db = [] # find the gradients for the last set of parameters dloss_dW, dloss_db = back_prop_parameter(dloss_dscores, Net.layer_vals[-1], Net.W[-1], Net.reg) dW.append(dloss_dW) db.append(dloss_db) # loop to backpropogate the gradient through the rest of the layers dloss_dlayer_z = dloss_dscores for i in reversed(xrange(1,len(Net.layer_vals))): #gradient through hidden layers dloss_dlayer = back_prop_layer(dloss_dlayer_z, Net.W[i+1], Net.layer_vals[i]) dloss_dW,dloss_db = back_prop_parameter(dloss_dlayer, Net.layer_vals[i-1], Net.W[i], Net.reg) dW.insert(0,dloss_dW) db.insert(0,dloss_db) dloss_dlayer_z = dloss_dlayer # find the gradients for the first set of parameters dloss_dlayer = back_prop_layer(dloss_dlayer_z, Net.W[1], Net.layer_vals[0]) dloss_dW, dloss_db = back_prop_parameter(dloss_dlayer, Data.points, Net.W[0], Net.reg) dW.insert(0,dloss_dW) db.insert(0,dloss_db) return dW, db

Obviously, there is nothing special about two classes. The code above works for data with any number of classes. I produced a few examples with three classes just to test.

Everything seems to be working pretty well, so I thought I would try to generate some more challenging data. I tried the shifted sine waves below assuming that it would be difficult to fit a neural net to the data.

Indeed, I had to add a second hidden layer to capture this model. I also tried fitting even more complex models with neural nets using a variety of different architectures. My overall observation was something that is very well known: training deeper neural nets is difficult. As I reached 4 or 5 hidden layers, my training process tended to converge to local minima which did not well fit the data. Some sort of pre-training step would be necessary to initialize the net to (hopefully) start close to a “good” local minimum. This has been applied in literature/applications with only moderate success. I didn’t really explore this much as the point of all of this was to get a simple neural net up and running. I’ll likely hit up a few more toy applications of neural nets in the future, but now it is on to creating convolutional nets and playing with real data.