Neural Net Toy Problems

Neural networks achieve state-of-the-art performance for a variety of machine learning tasks such as speech recognition and image classification.  In fact, if you give verbal commands to your phone, your speech is likely processed using a neural net.  One especially interesting colloquial article about neural networks is here where, given an image, a trained network is used to emphasize features in the image to which the network responds.  These types of results likely require millions of clean training images and large computational resources; I want to create neural nets from scratch and see just how challenging it can be to train and use these nets.  Ignoring large convolutional networks for now (which are typical for these image problems), the first step is to just code up a toy neural net.  Below I’ll talk details and mention some of the issues encountered, but the following image illustrates some of the results.

multiple
Left: Training data.  Center: Linear classifier.  Right: Feed forward neural net with one hidden layer containing 5 neurons.

Before getting to the neural net, I’ll mention data. I wanted to generate two dimensional training data under a variety of models (most not linearly separable) with two or three classes.  This is simply so I can nicely plot the results.  I decided to produce 100 data points per class and haphazardly wrote the python code below.  This generates data and labels from the selected model, centers the data, and scales it to be in [-1,1]\times [-1,1].

import numpy as np

class Data():
    def __init__(self):
        '''Generates 100 points per class in R2 under a variety of models.
        Use comments below to pick the data model.  Data is centered and then scaled
        to lie in [-1,1] x [-1,1]'''

        points = two_gaussians()
        #points = ball_and_cup()
        #points = shifted_cubics()
        #points = bulls_eye()
        #points = crecent_moons()
        #points = shifted_sinosoids()
        #points = bulls_eye_3_categories()
        #points = shifted_sinosoids_3_categories()

        points = points - np.mean(points, axis=1,keepdims=True) # center data
        points = np.dot(np.diag(1/np.max(np.absolute(points),axis=1)),points) # scale axes
        label = np.zeros(len(points.T), dtype='uint8')
        for i in xrange(len(points.T)/100):
            label[100*i:100*(i+1)] = i

        self.points = points
        self.dim = len(points)
        self.num_points = points.shape[1]
        self.num_classes = len(set(label))
        self.label = label

################################################################################
# Functions for the data models
################################################################################

def two_gaussians():
    data = np.random.randn(2,200)
    shift = 1.6*np.ones([2,200])
    shift[:,100:200] = -1*shift[:,100:200]
    return data + shift
#-------------------------------------------------------------------------------
def ball_and_cup():
    temp = np.zeros([2,100])
    data = np.zeros([2,200])
    temp[0,:] = np.random.uniform(0,0.5,[1,100])
    temp[1,:] = np.random.uniform(0,2*np.pi,[1,100])
    data[0,0:100]=np.multiply(temp[0,:],np.cos(temp[1,:]))
    data[1,0:100]=np.multiply(temp[0,:],np.sin(temp[1,:]))
    data[0,100:200]=np.random.uniform(-1.4,1.4,[1,100])
    data[1,100:200]=data[0,100:200]**2+0.1*np.random.randn(100)-0.8
    return data
#-------------------------------------------------------------------------------
def shifted_cubics():
    x = np.linspace(-8,8, 100)
    y = np.linspace(-8,8, 100)
    xx = x**3+20*np.random.randn(1,100)
    yy = y**3+150+ 20*np.random.randn(1,100)
    data = np.zeros([2,200])
    data[0,0:100] = x
    data[1,0:100] = xx
    data[0,100:200] = y
    data[1,100:200] = yy
    return data
#-------------------------------------------------------------------------------
def bulls_eye():
    temp = np.zeros([2,200])
    data = np.zeros([2,200])
    temp[0,0:100] = np.random.uniform(0,1,[1,100])
    temp[0,100:200] = np.random.uniform(1.5,2.5,[1,100])
    temp[1,:] = np.random.uniform(0,2*np.pi,[1,200])
    data[0,:]=np.multiply(temp[0,:],np.cos(temp[1,:]))
    data[1,:]=np.multiply(temp[0,:],np.sin(temp[1,:]))
    return data
#-------------------------------------------------------------------------------
def crecent_moons():
    temp = np.zeros([2,200])
    data = np.zeros([2,200])
    temp[0,:] = np.random.uniform(2,3,[1,200])
    temp[1,0:100] = np.random.uniform(0,np.pi,[1,100])
    temp[1,100:200] = np.random.uniform(np.pi,2*np.pi,[1,100])
    data[0,:]=np.multiply(temp[0,:],np.cos(temp[1,:]))
    data[1,:]=np.multiply(temp[0,:],np.sin(temp[1,:]))
    data[0,100:200]+=2.5
    data[1,100:200]+=1.7
    return data
#-------------------------------------------------------------------------------
def shifted_sinosoids():
    t = np.random.uniform(0,4*np.pi,[1,200])
    data = np.zeros([2,200])
    data[0,:] = t
    data[1,0:100] = np.sin(t[0,0:100])
    data[1,100:200] = np.sin(t[0,100:200])+1
    data[1,0:100] += 0.1*np.random.randn(100)
    data[1,100:200] += 0.1*np.random.randn(100)
    return data
#-------------------------------------------------------------------------------
def bulls_eye_3_categories():
    temp = np.zeros([2,300])
    data = np.zeros([2,300])
    temp[0,0:100] = np.random.uniform(0,1,[1,100])
    temp[0,100:200] = np.random.uniform(1.5,2.5,[1,100])
    temp[0,200:300] = np.random.uniform(3,4,[1,100])
    temp[1,:] = np.random.uniform(0,2*np.pi,[1,300])
    data[0,:]=np.multiply(temp[0,:],np.cos(temp[1,:]))
    data[1,:]=np.multiply(temp[0,:],np.sin(temp[1,:]))
    return data
#-------------------------------------------------------------------------------
def shifted_sinosoids_3_categories():
    t = np.random.uniform(0,4*np.pi,[1,300])
    data = np.zeros([2,300])
    data[0,:] = t
    data[1,0:100] = np.sin(t[0,0:100])
    data[1,100:200] = np.sin(t[0,100:200])+1.2
    data[1,200:300] = np.sin(t[0,200:300])+2.4
    data[1,0:100] += 0.1*np.random.randn(100)
    data[1,100:200] += 0.1*np.random.randn(100)
    data[1,200:300] += 0.1*np.random.randn(100)
    return data

Moving on to creating the neural net, I’ll forgo the basics as there are many resources available;  I’m not sure how people learned anything before the internet.  In particular, this site has some interesting best practices.  I’ll set some notation and discuss what I learned from coding up a neural net.

My toy neural net is a feed forward net with fully connected layers.  I want to be able to adjust the structure of my net, so my code will initialize a neural net given the dimension of my data, the number of data classes, the number of hidden layers, and the number of neurons per layer.  To set some notation, if x_0 is an input vector, I will use f(W_0x_0+b_0)=x_1 to represent the operation of passing to the first hidden layer.  Here f is my non-linear activation function (I will choose an ReLU – this is a common choice in practice and has some computational benefits too). W_0 and b_0 are my weights and bias associated with the linear operator between the layers.  Yes, adding the bias is not a linear operation, but adding a dimension to the input layer and incorporating the bias as another column in W makes the operation linear.  I prefer to keep the bias separate from W as I will apply regularization to the W‘s and not the b‘s.  Similarly, f(W_nx_n+b_n)=x_{n+1} represents the forward pass from layer n to layer n+1.  The last layer (whose values are called scores) is W_{N-1}x_{N-1}+b_{N-1}=x_N.  The output x_N has dimension equal to the number of classes, and note the ReLU operation is missing.  At this point we need to take the scores and create an objective function which measures dissimilarity from the desired labels; this is the first place where I actually learned something new.

My naïve choice of objective would be a normalization of x_N  followed by least squares.  Unfortunately this is more sensitive to outliers or mislabeled data than other methods.  I will normalize with a softmax function and instead follow this with cross-entropy as my objective function.  One interpretation of this follows by assuming each score entry x_N[i] is related to the log of our belief that x_0 is in class i.  More specifically, for two classes, we suppose

\ln(\mbox{Pr} (x_0 \in \mbox{class 0})) = x_N[0] - \ln(\alpha).
\ln(\mbox{Pr} (x_0 \in \mbox{class 1})) = x_N[1] - \ln(\alpha).

Here \alpha ensures the probabilities sum to one.  By solving each equation above for the probabilities and summing, we find \alpha = e^{x_N[0]} + e^{x_N[1]}.  Thus we find

\mbox{Pr} (x_0 \in \mbox{class 0}) = e^{x_N[0]}/ (e^{x_N[0]} + e^{x_N[1]})
\mbox{Pr} (x_0 \in \mbox{class 1}) = e^{x_N[1]}/ (e^{x_N[0]} + e^{x_N[1]}) .

This is precisely the output of the softmax function given x_N.  In simpler terms, the softmax emphasizes the max entry in x_N while the smaller terms decay exponentially (hence the name softmax).  Interpreting this as our belief in class assignment, cross-entropy measures the difference between this categorical probability distribution and the one given according to the training labels.  By minimizing, we make the distribution representing our belief in label assignment align with the actual labels.  See KL-divergence for more information.

To minimize the cross-entropy, I perform standard gradient decent along with backtracking in case a minimum is overshot.  Here the chain rule is your friend.  The gradient is backpropagated via repeated applications of the chain rule which makes for nice code; I run a loop to propagate the gradient from layer to layer.  Of course this also means that for small values, the gradient may become very small in early layers.  I did in fact observe that when I made my nets deep, the gradient for the first few layers was essentially zero.  Any precision errors will also compound for early layers, but this wasn’t a clear issue for me.

I’m not really going to address initialization of the net, but some good heuristics for initialization can be found here.  I also regularized to avoid over fitting by instituting an \ell_2 penalty on my weights.   Putting this all together, the code below contains all the tools to create, initialize, and train my neural net given data produced in the previous block of code.  Each instance of the class NeuralNet will initialize the net and perform a single forward pass with the provided data.  The code for training and testing the net should be rather self explanatory.

################################################################################
################################################################################

# hyperparameters
learning_rate = 1e0 # initial multiplier for step in gradient decent
reg = 1e-5 # l2 regularization parameter
iter_test = 500 # number of iterations after which to compare loss function
size_hidden_layers = [5,5] # vector where each entry represents the number of nodes in a hidden layer

################################################################################
################################################################################
import numpy as np

class NeuralNet:
    def __init__(
        self,
        Data,
        size_hidden_layers,
        reg,
        iter_test,
        learning_rate):

        '''Initializes a fully connected, feed forward nueral net with the number of layers
        and neurons per layer according to 'size_hidden_layers.'  Weights are initialized
        with small random weights and bias is zero. Layer values are set accordong to an
        initial forward pass with 'Data' as the input'''

        self.size_layers = size_hidden_layers + [Data.num_classes]
        self.W = []
        self.b = []
        self.layer_vals = []
        self.scores = []
        self.reg = reg
        self.iter_test = iter_test
        self.learning_rate = learning_rate

        dim_layer_in = Data.dim
        for size_layer in self.size_layers:
            self.W.append(np.sqrt(2./dim_layer_in) * np.random.randn(size_layer, dim_layer_in))
            self.b.append(np.zeros([size_layer,1]))
            dim_layer_in = size_layer

        self.forward(Data.points)

#------------------------------------------------------------------------------- 

    def forward(self, points):
        '''Performs a foward pass with 'points' as input to populate layer values'''

        layer_in = points
        self.layer_vals = []
        for i in xrange(len(self.W)):
            if i == len(self.W) - 1: # no activation function on last layer
                self.scores = np.dot(self.W[i], layer_in) + self.b[i]
            else:
                layer_out = f(np.dot(self.W[i], layer_in) + self.b[i])
                self.layer_vals.append(layer_out)
            layer_in = layer_out

#------------------------------------------------------------------------------- 

    def calc_loss(self, Data):
        '''evaluates the objective function (loss) we seek minimize'''

        scores_shift = self.scores - np.max(self.scores)
        scores_norm = np.exp(scores_shift)/ np.sum(scores_shift, axis=0, keepdims=True)
        scores_log = -np.log(scores_norm)
        # average cross-entropy loss
        loss_data = np.sum(scores_log[Data.label, range(Data.num_points)])/Data.num_points
        loss_reg = 0
        for W in self.W:
            loss_reg = loss_reg +  0.5*self.reg*np.sum(W**2) # regularization loss
        self.loss = loss_data + loss_reg # total loss

#------------------------------------------------------------------------------- 

    def train(self, Data):
        '''Trains neural net via standard gradient decent with backtracking'''

        loss_prev = 1e10
        W_prev = self.W
        b_prev = self.b
        for i in xrange(50000):
            self.forward(Data.points)
            self.calc_loss(Data)
            dW, db = gradient(self, Data)

            # test if loss function has grown
            if i % iter_test == 0 and self.loss - loss_prev > 1e-9:
                self.learning_rate /= 5 # adjusting step size
                print "step size too large - adjust learning rate to %f" %learning_rate
                self.W = W_prev # backtracking
                self.b = b_prev
                continue

            # status infomation displayed
            if i % iter_test == 0:
                print "iteration %d: loss %f" % (i, self.loss)

            # update weights and bias
            for j in xrange(len(self.W)):
                self.W[j] -= self.learning_rate * dW[j]
                self.b[j] -= self.learning_rate * db[j]

            loss_prev = self.loss
            W_prev = self.W
            b_prev = self.b

#-------------------------------------------------------------------------------  

    def test(self, Data):
        '''Calculates the percentage of misclassified points given points and labels in Data'''

        self.forward(Data.points)
        predicted_class = np.argmax(self.scores, axis=0).flatten()
        misclass = predicted_class - Data.label
        misclass[misclass !=0] = 1
        misclass_pct = float(np.sum(misclass))/Data.num_points
        print "%f our of %i points misclassified" %(misclass_pct, Data.num_points) 

################################################################################
# Functions for neural net forward pass and gradient backpropogation
################################################################################

def f(x):
    '''Activation function to be used - I'll choose ReLU'''

    y = np.maximum(0,x)
    return y

#------------------------------------------------------------------------------- 

def back_prop_parameter(dloss_dout, inpt, W, reg):
    '''Propogates the gradient to the weights and bias between layers inpt and out'''

    dloss_dW = np.dot(dloss_dout, inpt.T) + reg*W
    dloss_db = np.sum(dloss_dout, axis=1, keepdims=True)
    return dloss_dW, dloss_db

#------------------------------------------------------------------------------- 

def back_prop_layer(dloss_dout, W, layer_vals):
    '''Propogates the gradient to the layer (with layer_vals) just before out (W is between them)'''

    dloss_dlayer = np.dot(W.T, dloss_dout)
    dloss_dlayer[layer_vals <= 0] = 0 # gradient through non-linearity in hidden layers
    return dloss_dlayer

#------------------------------------------------------------------------------- 

def gradient(Net, Data):
    '''Calculates gradient of weights and bias parameters'''

    scores_exp = np.exp(Net.scores)
    scores_norm = scores_exp / np.sum(scores_exp, axis=0, keepdims=True)
    dloss_dscores = scores_norm
    dloss_dscores[Data.label, range(Data.num_points)] -= 1
    dloss_dscores /= Data.num_points

    dW = []
    db = []

    # find the gradients for the last set of parameters
    dloss_dW, dloss_db = back_prop_parameter(dloss_dscores, Net.layer_vals[-1], Net.W[-1], Net.reg)
    dW.append(dloss_dW)
    db.append(dloss_db)

    # loop to backpropogate the gradient through the rest of the layers
    dloss_dlayer_z = dloss_dscores
    for i in reversed(xrange(1,len(Net.layer_vals))): #gradient through hidden layers
        dloss_dlayer = back_prop_layer(dloss_dlayer_z, Net.W[i+1], Net.layer_vals[i])
        dloss_dW,dloss_db = back_prop_parameter(dloss_dlayer, Net.layer_vals[i-1], Net.W[i], Net.reg)
        dW.insert(0,dloss_dW)
        db.insert(0,dloss_db)
        dloss_dlayer_z = dloss_dlayer

    # find the gradients for the first set of parameters
    dloss_dlayer = back_prop_layer(dloss_dlayer_z, Net.W[1], Net.layer_vals[0])
    dloss_dW, dloss_db =  back_prop_parameter(dloss_dlayer, Data.points, Net.W[0], Net.reg)
    dW.insert(0,dloss_dW)
    db.insert(0,dloss_db)

    return dW, db

Obviously, there is nothing special about two classes.  The code above works for data with any number of classes.  I produced a few examples with three classes just to test.

multiple_3cat
Left: Training data.  Right:Feed forward neural net with one hidden layer with 5 neurons

Everything seems to be working pretty well, so I thought I would try to generate some more challenging data.  I tried the shifted sine waves below assuming that it would be difficult to fit a neural net to the data.

sines
Left: Training data.  Center: Neural net with one hidden layer with 5 neurons.  Right: Neural net with two hidden layers with 5 neurons each

Indeed, I had to add a second hidden layer to capture this model.  I also tried fitting even more complex models with neural nets using a variety of different architectures.  My overall observation was something that is very well known: training deeper neural nets is difficult.  As I reached 4 or 5 hidden layers, my training process tended to converge to local minima which did not well fit the data.  Some sort of pre-training step would be necessary to initialize the net to (hopefully) start close to a “good” local minimum.  This has been applied in literature/applications with only moderate success.   I didn’t really explore this much as the point of all of this was to get a simple neural net up and running.  I’ll likely hit up a few more toy applications of neural nets in the future, but now it is on to creating convolutional nets and playing with real data.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s