## Neural Net Toy Problems

Neural networks achieve state-of-the-art performance for a variety of machine learning tasks such as speech recognition and image classification.  In fact, if you give verbal commands to your phone, your speech is likely processed using a neural net.  One especially interesting colloquial article about neural networks is here where, given an image, a trained network is used to emphasize features in the image to which the network responds.  These types of results likely require millions of clean training images and large computational resources; I want to create neural nets from scratch and see just how challenging it can be to train and use these nets.  Ignoring large convolutional networks for now (which are typical for these image problems), the first step is to just code up a toy neural net.  Below I’ll talk details and mention some of the issues encountered, but the following image illustrates some of the results. Left: Training data.  Center: Linear classifier.  Right: Feed forward neural net with one hidden layer containing 5 neurons.

Before getting to the neural net, I’ll mention data. I wanted to generate two dimensional training data under a variety of models (most not linearly separable) with two or three classes.  This is simply so I can nicely plot the results.  I decided to produce 100 data points per class and haphazardly wrote the python code below.  This generates data and labels from the selected model, centers the data, and scales it to be in $[-1,1]\times [-1,1]$.

import numpy as np

class Data():
def __init__(self):
'''Generates 100 points per class in R2 under a variety of models.
Use comments below to pick the data model.  Data is centered and then scaled
to lie in [-1,1] x [-1,1]'''

points = two_gaussians()
#points = ball_and_cup()
#points = shifted_cubics()
#points = bulls_eye()
#points = crecent_moons()
#points = shifted_sinosoids()
#points = bulls_eye_3_categories()
#points = shifted_sinosoids_3_categories()

points = points - np.mean(points, axis=1,keepdims=True) # center data
points = np.dot(np.diag(1/np.max(np.absolute(points),axis=1)),points) # scale axes
label = np.zeros(len(points.T), dtype='uint8')
for i in xrange(len(points.T)/100):
label[100*i:100*(i+1)] = i

self.points = points
self.dim = len(points)
self.num_points = points.shape
self.num_classes = len(set(label))
self.label = label

################################################################################
# Functions for the data models
################################################################################

def two_gaussians():
data = np.random.randn(2,200)
shift = 1.6*np.ones([2,200])
shift[:,100:200] = -1*shift[:,100:200]
return data + shift
#-------------------------------------------------------------------------------
def ball_and_cup():
temp = np.zeros([2,100])
data = np.zeros([2,200])
temp[0,:] = np.random.uniform(0,0.5,[1,100])
temp[1,:] = np.random.uniform(0,2*np.pi,[1,100])
data[0,0:100]=np.multiply(temp[0,:],np.cos(temp[1,:]))
data[1,0:100]=np.multiply(temp[0,:],np.sin(temp[1,:]))
data[0,100:200]=np.random.uniform(-1.4,1.4,[1,100])
data[1,100:200]=data[0,100:200]**2+0.1*np.random.randn(100)-0.8
return data
#-------------------------------------------------------------------------------
def shifted_cubics():
x = np.linspace(-8,8, 100)
y = np.linspace(-8,8, 100)
xx = x**3+20*np.random.randn(1,100)
yy = y**3+150+ 20*np.random.randn(1,100)
data = np.zeros([2,200])
data[0,0:100] = x
data[1,0:100] = xx
data[0,100:200] = y
data[1,100:200] = yy
return data
#-------------------------------------------------------------------------------
def bulls_eye():
temp = np.zeros([2,200])
data = np.zeros([2,200])
temp[0,0:100] = np.random.uniform(0,1,[1,100])
temp[0,100:200] = np.random.uniform(1.5,2.5,[1,100])
temp[1,:] = np.random.uniform(0,2*np.pi,[1,200])
data[0,:]=np.multiply(temp[0,:],np.cos(temp[1,:]))
data[1,:]=np.multiply(temp[0,:],np.sin(temp[1,:]))
return data
#-------------------------------------------------------------------------------
def crecent_moons():
temp = np.zeros([2,200])
data = np.zeros([2,200])
temp[0,:] = np.random.uniform(2,3,[1,200])
temp[1,0:100] = np.random.uniform(0,np.pi,[1,100])
temp[1,100:200] = np.random.uniform(np.pi,2*np.pi,[1,100])
data[0,:]=np.multiply(temp[0,:],np.cos(temp[1,:]))
data[1,:]=np.multiply(temp[0,:],np.sin(temp[1,:]))
data[0,100:200]+=2.5
data[1,100:200]+=1.7
return data
#-------------------------------------------------------------------------------
def shifted_sinosoids():
t = np.random.uniform(0,4*np.pi,[1,200])
data = np.zeros([2,200])
data[0,:] = t
data[1,0:100] = np.sin(t[0,0:100])
data[1,100:200] = np.sin(t[0,100:200])+1
data[1,0:100] += 0.1*np.random.randn(100)
data[1,100:200] += 0.1*np.random.randn(100)
return data
#-------------------------------------------------------------------------------
def bulls_eye_3_categories():
temp = np.zeros([2,300])
data = np.zeros([2,300])
temp[0,0:100] = np.random.uniform(0,1,[1,100])
temp[0,100:200] = np.random.uniform(1.5,2.5,[1,100])
temp[0,200:300] = np.random.uniform(3,4,[1,100])
temp[1,:] = np.random.uniform(0,2*np.pi,[1,300])
data[0,:]=np.multiply(temp[0,:],np.cos(temp[1,:]))
data[1,:]=np.multiply(temp[0,:],np.sin(temp[1,:]))
return data
#-------------------------------------------------------------------------------
def shifted_sinosoids_3_categories():
t = np.random.uniform(0,4*np.pi,[1,300])
data = np.zeros([2,300])
data[0,:] = t
data[1,0:100] = np.sin(t[0,0:100])
data[1,100:200] = np.sin(t[0,100:200])+1.2
data[1,200:300] = np.sin(t[0,200:300])+2.4
data[1,0:100] += 0.1*np.random.randn(100)
data[1,100:200] += 0.1*np.random.randn(100)
data[1,200:300] += 0.1*np.random.randn(100)
return data


Moving on to creating the neural net, I’ll forgo the basics as there are many resources available;  I’m not sure how people learned anything before the internet.  In particular, this site has some interesting best practices.  I’ll set some notation and discuss what I learned from coding up a neural net.

My toy neural net is a feed forward net with fully connected layers.  I want to be able to adjust the structure of my net, so my code will initialize a neural net given the dimension of my data, the number of data classes, the number of hidden layers, and the number of neurons per layer.  To set some notation, if $x_0$ is an input vector, I will use $f(W_0x_0+b_0)=x_1$ to represent the operation of passing to the first hidden layer.  Here $f$ is my non-linear activation function (I will choose an ReLU – this is a common choice in practice and has some computational benefits too). $W_0$ and $b_0$ are my weights and bias associated with the linear operator between the layers.  Yes, adding the bias is not a linear operation, but adding a dimension to the input layer and incorporating the bias as another column in $W$ makes the operation linear.  I prefer to keep the bias separate from $W$ as I will apply regularization to the $W$‘s and not the $b$‘s.  Similarly, $f(W_nx_n+b_n)=x_{n+1}$ represents the forward pass from layer $n$ to layer $n+1$.  The last layer (whose values are called scores) is $W_{N-1}x_{N-1}+b_{N-1}=x_N$.  The output $x_N$ has dimension equal to the number of classes, and note the ReLU operation is missing.  At this point we need to take the scores and create an objective function which measures dissimilarity from the desired labels; this is the first place where I actually learned something new.

My naïve choice of objective would be a normalization of $x_N$  followed by least squares.  Unfortunately this is more sensitive to outliers or mislabeled data than other methods.  I will normalize with a softmax function and instead follow this with cross-entropy as my objective function.  One interpretation of this follows by assuming each score entry $x_N[i]$ is related to the log of our belief that $x_0$ is in class $i$.  More specifically, for two classes, we suppose $\ln(\mbox{Pr} (x_0 \in \mbox{class 0})) = x_N - \ln(\alpha)$. $\ln(\mbox{Pr} (x_0 \in \mbox{class 1})) = x_N - \ln(\alpha)$.

Here $\alpha$ ensures the probabilities sum to one.  By solving each equation above for the probabilities and summing, we find $\alpha = e^{x_N} + e^{x_N}$.  Thus we find $\mbox{Pr} (x_0 \in \mbox{class 0}) = e^{x_N}/ (e^{x_N} + e^{x_N})$ $\mbox{Pr} (x_0 \in \mbox{class 1}) = e^{x_N}/ (e^{x_N} + e^{x_N})$.

This is precisely the output of the softmax function given $x_N$.  In simpler terms, the softmax emphasizes the max entry in $x_N$ while the smaller terms decay exponentially (hence the name softmax).  Interpreting this as our belief in class assignment, cross-entropy measures the difference between this categorical probability distribution and the one given according to the training labels.  By minimizing, we make the distribution representing our belief in label assignment align with the actual labels.  See KL-divergence for more information.

To minimize the cross-entropy, I perform standard gradient decent along with backtracking in case a minimum is overshot.  Here the chain rule is your friend.  The gradient is backpropagated via repeated applications of the chain rule which makes for nice code; I run a loop to propagate the gradient from layer to layer.  Of course this also means that for small values, the gradient may become very small in early layers.  I did in fact observe that when I made my nets deep, the gradient for the first few layers was essentially zero.  Any precision errors will also compound for early layers, but this wasn’t a clear issue for me.

I’m not really going to address initialization of the net, but some good heuristics for initialization can be found here.  I also regularized to avoid over fitting by instituting an $\ell_2$ penalty on my weights.   Putting this all together, the code below contains all the tools to create, initialize, and train my neural net given data produced in the previous block of code.  Each instance of the class NeuralNet will initialize the net and perform a single forward pass with the provided data.  The code for training and testing the net should be rather self explanatory.

################################################################################
################################################################################

# hyperparameters
learning_rate = 1e0 # initial multiplier for step in gradient decent
reg = 1e-5 # l2 regularization parameter
iter_test = 500 # number of iterations after which to compare loss function
size_hidden_layers = [5,5] # vector where each entry represents the number of nodes in a hidden layer

################################################################################
################################################################################
import numpy as np

class NeuralNet:
def __init__(
self,
Data,
size_hidden_layers,
reg,
iter_test,
learning_rate):

'''Initializes a fully connected, feed forward nueral net with the number of layers
and neurons per layer according to 'size_hidden_layers.'  Weights are initialized
with small random weights and bias is zero. Layer values are set accordong to an
initial forward pass with 'Data' as the input'''

self.size_layers = size_hidden_layers + [Data.num_classes]
self.W = []
self.b = []
self.layer_vals = []
self.scores = []
self.reg = reg
self.iter_test = iter_test
self.learning_rate = learning_rate

dim_layer_in = Data.dim
for size_layer in self.size_layers:
self.W.append(np.sqrt(2./dim_layer_in) * np.random.randn(size_layer, dim_layer_in))
self.b.append(np.zeros([size_layer,1]))
dim_layer_in = size_layer

self.forward(Data.points)

#-------------------------------------------------------------------------------

def forward(self, points):
'''Performs a foward pass with 'points' as input to populate layer values'''

layer_in = points
self.layer_vals = []
for i in xrange(len(self.W)):
if i == len(self.W) - 1: # no activation function on last layer
self.scores = np.dot(self.W[i], layer_in) + self.b[i]
else:
layer_out = f(np.dot(self.W[i], layer_in) + self.b[i])
self.layer_vals.append(layer_out)
layer_in = layer_out

#-------------------------------------------------------------------------------

def calc_loss(self, Data):
'''evaluates the objective function (loss) we seek minimize'''

scores_shift = self.scores - np.max(self.scores)
scores_norm = np.exp(scores_shift)/ np.sum(scores_shift, axis=0, keepdims=True)
scores_log = -np.log(scores_norm)
# average cross-entropy loss
loss_data = np.sum(scores_log[Data.label, range(Data.num_points)])/Data.num_points
loss_reg = 0
for W in self.W:
loss_reg = loss_reg +  0.5*self.reg*np.sum(W**2) # regularization loss
self.loss = loss_data + loss_reg # total loss

#-------------------------------------------------------------------------------

def train(self, Data):
'''Trains neural net via standard gradient decent with backtracking'''

loss_prev = 1e10
W_prev = self.W
b_prev = self.b
for i in xrange(50000):
self.forward(Data.points)
self.calc_loss(Data)

# test if loss function has grown
if i % iter_test == 0 and self.loss - loss_prev &gt; 1e-9:
self.learning_rate /= 5 # adjusting step size
print &quot;step size too large - adjust learning rate to %f&quot; %learning_rate
self.W = W_prev # backtracking
self.b = b_prev
continue

# status infomation displayed
if i % iter_test == 0:
print &quot;iteration %d: loss %f&quot; % (i, self.loss)

# update weights and bias
for j in xrange(len(self.W)):
self.W[j] -= self.learning_rate * dW[j]
self.b[j] -= self.learning_rate * db[j]

loss_prev = self.loss
W_prev = self.W
b_prev = self.b

#-------------------------------------------------------------------------------

def test(self, Data):
'''Calculates the percentage of misclassified points given points and labels in Data'''

self.forward(Data.points)
predicted_class = np.argmax(self.scores, axis=0).flatten()
misclass = predicted_class - Data.label
misclass[misclass !=0] = 1
misclass_pct = float(np.sum(misclass))/Data.num_points
print &quot;%f our of %i points misclassified&quot; %(misclass_pct, Data.num_points)

################################################################################
# Functions for neural net forward pass and gradient backpropogation
################################################################################

def f(x):
'''Activation function to be used - I'll choose ReLU'''

y = np.maximum(0,x)
return y

#-------------------------------------------------------------------------------

def back_prop_parameter(dloss_dout, inpt, W, reg):
'''Propogates the gradient to the weights and bias between layers inpt and out'''

dloss_dW = np.dot(dloss_dout, inpt.T) + reg*W
dloss_db = np.sum(dloss_dout, axis=1, keepdims=True)
return dloss_dW, dloss_db

#-------------------------------------------------------------------------------

def back_prop_layer(dloss_dout, W, layer_vals):
'''Propogates the gradient to the layer (with layer_vals) just before out (W is between them)'''

dloss_dlayer = np.dot(W.T, dloss_dout)
dloss_dlayer[layer_vals &lt;= 0] = 0 # gradient through non-linearity in hidden layers
return dloss_dlayer

#-------------------------------------------------------------------------------

'''Calculates gradient of weights and bias parameters'''

scores_exp = np.exp(Net.scores)
scores_norm = scores_exp / np.sum(scores_exp, axis=0, keepdims=True)
dloss_dscores = scores_norm
dloss_dscores[Data.label, range(Data.num_points)] -= 1
dloss_dscores /= Data.num_points

dW = []
db = []

# find the gradients for the last set of parameters
dloss_dW, dloss_db = back_prop_parameter(dloss_dscores, Net.layer_vals[-1], Net.W[-1], Net.reg)
dW.append(dloss_dW)
db.append(dloss_db)

# loop to backpropogate the gradient through the rest of the layers
dloss_dlayer_z = dloss_dscores
for i in reversed(xrange(1,len(Net.layer_vals))): #gradient through hidden layers
dloss_dlayer = back_prop_layer(dloss_dlayer_z, Net.W[i+1], Net.layer_vals[i])
dloss_dW,dloss_db = back_prop_parameter(dloss_dlayer, Net.layer_vals[i-1], Net.W[i], Net.reg)
dW.insert(0,dloss_dW)
db.insert(0,dloss_db)
dloss_dlayer_z = dloss_dlayer

# find the gradients for the first set of parameters
dloss_dlayer = back_prop_layer(dloss_dlayer_z, Net.W, Net.layer_vals)
dloss_dW, dloss_db =  back_prop_parameter(dloss_dlayer, Data.points, Net.W, Net.reg)
dW.insert(0,dloss_dW)
db.insert(0,dloss_db)

return dW, db


Obviously, there is nothing special about two classes.  The code above works for data with any number of classes.  I produced a few examples with three classes just to test. Left: Training data.  Right:Feed forward neural net with one hidden layer with 5 neurons

Everything seems to be working pretty well, so I thought I would try to generate some more challenging data.  I tried the shifted sine waves below assuming that it would be difficult to fit a neural net to the data. Left: Training data.  Center: Neural net with one hidden layer with 5 neurons.  Right: Neural net with two hidden layers with 5 neurons each

Indeed, I had to add a second hidden layer to capture this model.  I also tried fitting even more complex models with neural nets using a variety of different architectures.  My overall observation was something that is very well known: training deeper neural nets is difficult.  As I reached 4 or 5 hidden layers, my training process tended to converge to local minima which did not well fit the data.  Some sort of pre-training step would be necessary to initialize the net to (hopefully) start close to a “good” local minimum.  This has been applied in literature/applications with only moderate success.   I didn’t really explore this much as the point of all of this was to get a simple neural net up and running.  I’ll likely hit up a few more toy applications of neural nets in the future, but now it is on to creating convolutional nets and playing with real data.