Vanilla Neural Network Forward Pass

This tutorial will show how to perform the forward pass of a vanilla neural network.

Neural Network Overview

A neural network is an algorithm that takes an input and produces an output. They're typically used for applications where the underlying process is difficult (if not impossible) to model directly. Neural networks are highly parameterized, with the largest ones containing billions of parameters. The values of these parameters are learned from data. The goal is to have the network provide an approximation of some real process. Thus, they are often called "function approximators".

The following are some common applications of neural networks:

Input	Output
Image	Classification of image, locations of objects in image
Audio File	Speech to text, language translation
Stock Data	Future price predictions
Data from Sensors	Instructions to actuators (e.g. car gas or steering)

The most basic neural networks (often referred to as vanilla neural networks) consist of two basic components: Linear layers (also known as fully-connected (FC) layers) and activation functions.

Linear Layers:

A linear layer is a linear transformation calculated by performing a matrix multiplication of the input to that layer by the weights. The input has dimensions [batch size, n_features].

batch size is the number of different instances (also called examples) that will be passed through the network. For example, 32 different images may be passed through the network at once.

n_features is the number of features each instance has. For example, if the input is an image of size 28x28 pixels, that has been unrolled to a vector (size 784 = 28 * 28), then we say this has 784 features.

The input, \(X\), is multiplied by the weights matrix, \(W\). The shape of \(W\) is [in_features, out_features], where out_features is the number of features after the transformation.

Additionally, a bias term, \(b\), is added. This is a vector of size out_features, and it will add a different scalar each of the features calculated during the multiplication of \(X\) and \(W\). The complete linear layer is shown below, and the output of this layer is often referred to as \(Z\): \[Z = XW + b\] Here \(Z\) will have shape [batch size, out_features].

Activation functions:

ReLU:

In order to allow the neural network to be expressive enough to capture a wide variety of processes, it needs to be non-linear. This is done by passing the output of the linear layer through non-linear activation functions.

The most common activation function is the Rectified Linear Unit (ReLU): \[\textrm{ReLU}(x) = \begin{cases}0 & x<0\\ x & x >= 0\\ \end{cases}\] When \(x\) is greater than \(0\) \(x\) is unchanged, otherwise it is \(0\).

Sigmoid:

Another common activation function is the sigmoid function: \[\textrm{Sigmoid}(x) = \dfrac{1}{1+e^{-x}}\] Sigmoid is used less often primarily because its gradient gets very small at large values of x, leading to slow learning.

Softmax:

Unlike the ReLU and sigmoid which take scalars as input, the softmax function takes a vector as its input and outputs a normalized vector of the same length. That is, the softmax scales each element of the input vector so that the sum of all the elements in the output equals 1. \[\textrm{Softmax}(\overrightarrow{z})_i = \dfrac{e^{z_i}}{\sum_{j=1}^K a_i e^{z_j}}\] In words this is saying, "Calculate \(e^{z_j}\) for every \(z_j\) element in \(z\), and then sum them together. The softmax value for a specific element \(z_i\) is \(z_i\) divided by this sum."

Here's an example:

\(\overrightarrow{z} = [2, -1, 0]\)

\(e^2 = 7.39\)

\(e^{-1} = 0.37\)

\(e^0 = 1\)

\(\textrm{Softmax}(\overrightarrow{z}) = [0.844, 0.042, 0.114]\)

The softmax function is used to create a probability distribution. This is useful so that neural networks can provide a probability for how certain they are in their predictions. If the example above is the prediction of which of three classes an input image belongs to, the network is 84.4% sure the image belongs to the first class.

Loss Functions:

Loss functions are not needed when making predictions. They are used when training the model. They give an evaluation for how good the prediction is compared to the true known value (the 'ground truth'), and this error is used to train the model.

The two most common and basic loss functions are mean squared error (MSE) and cross entropy loss (CE), which is sometimes called log loss.

MSE:

MSE is simply the average squares of the errors: For every prediction, calculate the difference from the ground truth, square this value, and get the average of this over all the instances. \[\textrm{MSE} = \dfrac{1}{n}\sum_{i=1}^n (Y_i - \hat{Y}_i)^2\] MSE is usually used for regression tasks where the network is trying to calculate a specific value, for example the price of a house.

Cross Entropy:

Cross entropy loss, on the other hand, is usually used for categorical predictions, such as predicting what object is an in image. It is calculated by taking the negative log of the probability corresponding to the target class, and averaging this over all the instances. \[ CE = \dfrac{1}{n}\sum_{i=1}^n -log(P^i_k)\] Where \(P^i_k\) is the predicted probability corresponding to the target class of the \(ith\) instance. (Note this is not necessarily the prediction with the highest value, for example in the case where the model strongly predicts the wrong class).

If the model is perfectly accurate and predicts \(1\) for the target class, the loss is \(0\) since \(-log(1)\) is 0. And if the model is as far off as possible, and predicts \(0\) for the target class, then the loss is \(\infty\) since \(-log(0)\) is \(\infty\). In reality, since the probabilities are calculated using the softmax function, the probability for a class will never be exactly \(0\) and thus the loss will never be \(\infty\).

Worked out Example

Now we'll go through the entire calculation from input to loss for a small neural network.

For this example the input will have 3 features and a batch size of 2. We'll pretend we are doing a classification problem with 3 classes, thus we'll use cross entropy loss.

The network architecture is as follows:

Linear Layer
ReLU Activation
Linear Layer
Softmax
Cross Entropy Loss

Linear layers that are not the input or output are often called hidden layers. Hence, this is a network with 2 hidden layers.

Let's propagate an input through the network, looking at each component of the architecture in sequence.

Linear Layer 1:

We'll start with random data for the input, which has a shape of [2, 3], corresponding to the batch size and number of features respectively. \[ X = \begin{bmatrix} 1 & -1 & 0 \\ 3 & 1 & 2 \end{bmatrix} \] So our input, \(X\), has 3 instances with 3 features. We have to decide how many features we want the output of the linear layer to have. Let's go with 4. This means the weight matrix, \(W_1\) for the linear layer will have shape [3, 4]. Once again, we'll assign it random weights: \[ W_1 = \begin{bmatrix} 1 & -1 & 3 & 1 \\ -2 & 0 & 4 & -3\\ 4 & 3 & -4 & 1 \end{bmatrix} \] For simplicity I've kept all the values as integers, but in practice the weights are floats, typically initialized in the range -1 to 1. Much research has been done on effective weight initialization schemes. A common one for linear layers is to generate values from a uniform distribution in the range \((-\sqrt{k}, \sqrt{k})\) where \(k = \dfrac{1}{in\_features}\)

We also need a bias term, \(b_1\) with one bias for each of the output features: \[ b_1 = \begin{bmatrix} 3 & 0 & 2 & 0 \end{bmatrix} \] The linear layer is computed below: \[ Z_1 = XW_1 + b_1 = \begin{bmatrix} 6 & -1 & 1 & 4 \\ 12 & 3 & 7 & 2 \end{bmatrix} \]

ReLU Activation:

For the ReLU activation we simply replace every negative value with a 0. This function is what gives the network its non-linearity, allow it to be far more expressive. \[ A = \textrm{ReLU}(Z_1) = \begin{bmatrix} 6 & 0 & 1 & 4 \\ 12 & 3 & 7 & 2 \end{bmatrix} \]

Linear Layer 2:

For the previous linear layer we got to decide how many features the output from that layer should have (we went with 4). However, since this is the last liner layer, the number of output features will not change after this layer. And since we have defined the problem as a classification with 3 classes, we need 3 outputs, so that we can assign each output as a probability to each of the classes. Thus, our weight matrix, \(W_2\) will be shape [4, 3] and \(b_2\) will be [1, 3]:

\( W_2 = \begin{bmatrix} -1 & 1 & -1 \\ -3 & -1 & 4 \\ 5 & 1 & 1 \\ 1 & 0 & 2 \end{bmatrix} \)

\( b_2 = \begin{bmatrix} 2 & -5 & 0 \end{bmatrix} \)

Once again, these values were just randomly selected. Computing them as before gets \(Z_2\): \[ Z_2 = AW_2 + b_2 = \begin{bmatrix} 1 & 2 & 3\\ 14 & 11 & 11 \end{bmatrix} \]

Softmax

If all we wanted was to use this network to make predictions there would be no reason to go further (assuming the model was already trained). This is because the ordering of the inputs to the softmax is the same as the ordering of the outputs. That is, the highest input to the softmax will also be the highest output, the lowest input will be the lowest output, etc.

However, if we want to see the actually probabilities, and of course if we want to set the model up to be able to train, we need the softmax:

\( P = \textrm{Softmax}(Z_2) \) \( = \begin{bmatrix} 0.09 & 0.24 & 0.67 \\ 0.91 & 0.05 & 0.05 \end{bmatrix} \)

This means the first instance most likely belongs to class 3 (67% probability)and the second instance most likely belongs to class 1 (91% probability). Note how the sum for each row is 1 (aside from rounding error).

Cross Entropy Loss

If we were training the model, we would need to calculate the loss, in this case using cross entropy loss. Calculating the loss requires knowing the true label (that is, what class each instance truly belongs to). It is calculated as follows, where \(P_i^t\) is the probability of the target class for the \(i^{th}\) instance:

\[L_{CE} = \dfrac{1}{m}\sum_{i=1}^{m}P_i^t\]

In words this means: Take the negative log of the probability corresponding to the true label for each instance in the batch, sum all these up, and divide by the batch size.

Let's say the first instance actually belongs to class 2, and the second instance belongs to class 1. The first instance has a predicted probability of 24% that it belongs to class 2. The second instance has a predicted probability of 91% that it belongs to the class 1. Thus, the loss is calculated as follows:

\( L_{CE} = \dfrac{-log(0.24) - log(0.91)}{2} \) \( = 0.75 \)

And that's everything for the forward pass, including the loss calculation!