August 06, 2019 · 7 mins read

# Lesson 5 - Neural Network Fundamentals

To really understand deep learning, you need to understand how neural networks, the mechanism that makes deep learning deep, work. This tutorial explains how to write a neural network from scratch using PyTorch.

This is the fifth post on my fast.ai journey. Check out the other posts here.

## The most basic neural network

Neural networks are a beautiful technique for artificial intelligence inspired by the (human?) brain. Neural networks consist of three main parts: input layers, hidden layers and output layers. The input layers are the features of the network. In case of image classification, these are pixels. There can be any number of hidden layers. The more hidden layers there are, the more complex the network is. There are also setbacks to having many hidden layers such as much longer training time and a higher chance of overfitting. A schematic (source):

Neurons, denoted by circles in the schematic, are computational boxes. The edges between nodes are the weights. The numbers start at the input layer and through a series of computations they end up in the output layer. The output layer can be used to interpret the classification of the network.

Each node in the output layer represents one of the classes. The value of the each neuron in theoutput layer is in the range 0 - 1 representing the confidence the input was of a certain class.

### Neurons

Each neuron has a so-called activation. An activation can be seen as the value of the neuron.

The input of each neuron is the output of the previous layer. Because there are multiple neurons in the previous layer, this input is represented as a vector.

### Weights

The weights are at the heart of the neural network. By adjusting the weights, the activations of the neurons change resulting in different values of the output layer.

When someone talks about “training a neural network” all he means is adjusting the weights resulting in different values for the output layer. More on training later.

### Activation functions

Activation functions introduce nonlinearities in the model, enabling it to represent any arbitrary mathematical function according to the universal approximation theorem(Horniket al., 1989; Cybenko, 1989).

Popular activation functions are: ($x \in \mathbb{R^n}$ where $n$ is the number of features)

• ReLU, Rectified Linear Unit: $max(x, 0)$. This is the most commonly used activation function.
• Sigmoid: $\frac{1}{1 + e^z}$. This activation function used to be the most used activation function, but it’s not anymore due to the high computational cost.
• Softmax: $\frac{x}{\displaystyle\sum_i x^I}$. This activation function scales every activation in a layer to range $(0, 1)$ where the sum equals $1$.

### Computing the activations

Because computing the activation for each neuron individually would be very slow, we usually use linear algebra to compute the weights for an entire layer at once. Because the activations for a layer are dependent on the previous layer, it’s not possible to compute everything at once.

Say $a^{(j)}$ is a vector containing the weights of the current layer and $\Theta^{(j)}$ are the weights of layer $j-1$ to $j$, you can compute the activation for the layer like so:

In fancy terms, the process of computing the weights in this manner is called feedforward.

### Backpropogation

Adjusting the weights is done using an algorithm called backpropogation. This algorithm is quite complex so it really deserves it own post that I’ll write later.

On a high level backpropogation computes the gradient (first derivative) of the cost function and subtracts the gradient from the weight matrix.

### The cost function

In machine learning, classifiers use a different cost function from models that predict values in a continuous space. This is what it looks like:

, where:

• $L$: Total number of layers in the network
• $s_l$: number of units (not counting bias unit) in layer $l$
• $K$: number of output classes

If this function looks too complex, it just yields higher cost when the model was very certain about a wrong classification.

In deep learning jargon, this loss function is called cross entropy loss.

## In PyTorch

Now that you know the theory, we can implement the a neural network in Python using PyTorch.

## Constructing the network

All the images in MNIST are 28 by 28 pixels so the size of the input layer is $28 \times 28 = 784$. The size of the hidden layer is 50. Because we got 10 classes, the output layer has 10 features.

All neural networks in PyTorch inherit nn.Module. Creating the neural network is simple:

class Mnist_NN(nn.Module):
def __init__(self):
super().__init__()
self.lin1 = nn.Linear(784, 50, bias=True)
self.lin2 = nn.Linear(50, 10, bias=True)


Because PyTorch can’t infer the feedforward method, we should implement that ourselves. As you can see this network uses ReLUs.

def forward(self, xb):
x = self.lin1(xb)
x = F.relu(x)
return self.lin2(x)


To use an NVIDIA GPU, tell PyTorch to use CUDA.

model = Mnist_NN().cuda()


## Training

Cross entropy loss is predefined in PyTorch:

loss_func = nn.CrossEntropyLoss()


In order to train the network, we need to write an update method that performs backpropogation.

def update(x,y,lr):
# Calculate losses
y_hat = model(x)
loss = loss_func(y_hat, y)

# Find the gradient and take a step
loss.backward()
optimizer.step()

# Return the loss
return loss.item()


An epoch is performing feedfoward, and one step of backpropogation on the full dataset. We can perform one epoch by looping over the training data like so:

losses = [update(x,y,1e-3) for x,y in data.train_dl]
plt.plot(losses);


This model has a validation accuracy of 96.1% which is great for a first neural network from scratch!

## Conclusion

Neural networks consists of neurons connected by weights. These weights change the value of the neurons allowing the model to learn to represent data. Activation functions introduce nonlinearities allowing the model to represent any mathematical function, and therefore every pattern, given enough computational power.

There’s one thing left to learn: how the weights get updated using backpropagation. Stay tuned for that!

A huge thanks to Sam Miserendino for proofreading this post!