CNN Neural Network Forward Pass

This tutorial will show how to perform the forward pass of a convolutional neural network.

CNN Overview

The main advantage of using CNN layers instead of fully connected layers is that the number of parameters in fully connected layers can get unreasonably large. For example, if an input image is 100x100 pixels, that is an input size of 10,000 when unrolled. If the hidden layer has 100 nodes, that means the first layer has 1 million weights, since every input is connected to every hidden node.

And 100x100 is not a particularly large image. A 1000x1000 image with 100 hidden nodes would be 100 million weights for the first layer alone!

CNNs address this by using shared weights. A kernel containing weights (also called a filter) is slid over the image, and the weights of this kernel are applied at each location. In deep learning this operation is known as a convolution. Let's look at an example of how this is done.

Convolution Operation

Single Channel Convolution:

For this example the input is 3x3x1. This is comparable to a grayscale image which has only one color channel. The kernel in this example is 2x2. The values of the input and the values of the kernel weights are listed on them.

Input
0.1
0.5
0
-1
0.3
-0.5
-0.2
0.9
-0.4
Kernel
2
1
-1
1

To perform the convolution, the kernel strides across the input, and at each step the kernel weight is multiplied by the corresponding input weight. This is done for all the kernel weights, and then they are summed. The sum is then the output for that step.

Convolution
0.1
0.5
0
-1
0.3
-0.5
-0.2
0.9
-0.4
2
1
-1
1
Output
2.0
0.2
-0.6
-1.2

This is known as 2d convolution. This is because the kernel moves around in two dimensions. NOTE: Although this operation is referred to as a convolution, mathematically it's actually a cross-correlation. In a true convolution, the kernel would be rotated 180°. However, in deep learning the kernel is NOT flipped, but we refer to the operation as a convolution anyway.

It's also possible to have a 1d convolution or even a 3d convolution. An example of a 1d convolution is a time series, where a 1d kernel strides along it. An example of a 3d convolution is a video, where a 3d kernel strides in 3 dimensions across the height of a frame, the width of a frame, and across multiple frames.

NOTE: Each kernel typically includes a single scalar bias (this is not shown in the animation above).

Multi-Channel Convolution:

In the previous example, the input and the kernel each had a single channel (i.e. their depth was each one). However, an input to a convolution layer usually has multiple channels. For example, a color image has 3 channels (rgb). The kernel does not stride across these channels. Instead, the kernel has its own dimension that corresponds to each of the input channels. And this is generally true for all convolution layers:

The depth of the kernel will equal the depth of the input.

That is, the number of channels that a kernel has must be the same as the number of channels of the input to the convolution layer. (Although it is possible to have a non-standard convolution where the kernel has a single channel that strides over the input depth.)

Here is an example of an input with two channels. The kernel also has two channels. Each channel of the kernel slides around the two spatial dimensions (height and width) of its corresponding input channel, and the values are summed. It can be thought of as multiple single channel convolutions, where the result is summed up over all the channels on each step of a stride. This is depicted below:

Figure 1: Convolution with 2 input channels (source)

As can be seen, regardless of the number of input channels, a single kernel will produce an output with a single channel. So how does the output get a depth greater than one? Multiple kernels!

Multi-Kernel Convolution:

Any real world model will have many kernels per convolution layer in order to be sufficiently expressive. The general architecture is to increase the number of kernels as the layers get deeper. This is because as the model gets deeper the spatial dimensions are reduced, so the depth can get deeper without the number of parameters getting too large.

In order to use multiple kernels, the operation shown in Figure 1 is repeated with each kernel independently. All of the individual 2d outputs (known as feature maps) are then stacked depthwise, resulting in a 3d output. Thus:

The depth of the output will be equal to the number of kernels.

An example of an input with two channels convolved with 2 kernels is shown in Figure 2 below.

Figure 2: Convolution with 2 input channels and 2 kernels

Strides:

In the examples above the kernel slid over the input one cell at a time. However, it's common for the kernel to take bigger steps. These steps are known as strides. The example below shows a stride of 2.

Convolution with stride 2
0.1
0.5
0
-1
0.3
-0.5
-0.2
0.9
-0.4
0.5
-0.2
-0.1
-0.3
0.4
0.3
0.2
2
1
-1
1
Output
-0.1
0.1
0.4
-0.6

The main advantage of using a stride greater than 1 is that it greatly reduces the spatial dimensions of the output.

NOTE: Any position that puts the kernel outside the input is not included. For instance, in the example above if the stride was 3 the output would be shape 1x1 because shifting the kernel 3 spaces would move part of it outside the input.

Padding:

Even if the stride is 1 the output will be spatial reduced from the input (assuming the kernel size is greater than 1). In order to keep the output and input sizes the same, padding can be placed around the input. Padding is simply additional values placed around the perimeter of the input. Typically, these values are 0s. Below are examples of a 2x2 input padded with size 1 and size 2.

2x2 with padding 1
0
0
0
0
0
-0.5
-0.7
0
0
0.4
-0.2
0
0
0
0
0
2x2 with padding 2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
-0.5
-0.7
0
0
0
0
0.4
-0.2
0
0
0
0
0
0
0
0
0
0
0
0
0
0

Padding is typically referred to by 3 names that may not be obvious. Here's definitions of the common padding terms:

Computing Output Size:

Square inputs and kernels are usually used, so the height is usually the same as the width. The size of the output width, \(w_{out}\), given input width \(w_{in}\), kernel size \(k\), padding \(p\), and stride \(s\) is calculated as follows: \[ w_{out} = \textrm{floor}(\frac{w_{in} - k + 2p}{s}) + 1 \] Of course, if the height and width are different, the above formula can be used by substituting in the relevent height values.

Pooling Operation

Convolutional neural networks typically have one additional component not found in vanilla neural networks: pooling layers. Pooling layers have no learnable parameters, but they reduce the spatial dimension.

Aside from reducing the memory requirements and computational load by reducing the size of the feature map, pooling layers can make the network more robust to translations. Specifically, max pooling layers introduce some translational invariance into the model. Translational invariance means that if the input is translated the output is unchanged. However, there is a limit to the amount of this invariance, and significant translations will still usually yield different outputs even with many max pooling layers.

The max pooling layers also add a small amount of rotational invariance. This rotational invariance is far smaller than the translational invariance, and tends to hold only for rotations of a few degrees.

The main downsides of max pooling layers is that they throw away information and that invariance may not be desired (for example with object detection or semantic segmentation where precise locations of pixels are wanted).

Now let's look at an example of how the mechanics of max pooling are done.

Max Pooling:

Max pooling is similar to convolution in that it strides a kernel over an input (or feature map). There are two main differences. First, rather than doing a set of elementwise multiplications followed by a sum, the max value is extracted from the region the kernel is over. Second, the pooling layer is done independently per channel, so the number of input channels equals the number of output channels. An example using a 2x2 kernel and a stride of 1 is shown below.

Max Pool
0.1
0.5
0
-1
0.3
-0.5
-0.2
0.9
1.1
Output
0.5
0.5
0.9
1.1

Average Pooling:

Another type of pooling operation is average pooling. This simply averages the values the pooling kernel is over. However, this is used less often than max pooling since max pooling layers tend to perform better. The likely reason for the superior performance of max pooling is it keeps ths the strongest features while getting rid of the less important ones. An example using a 2x2 kernel and a stride of 1 is shown below.

Average Pool
0.1
0.5
0
-1
0.3
-0.5
-0.2
0.9
1.1
Output
-0.025
0.075
0
0.45

Global Average Pooling:

A third type of pooling operation is global average pooling. This simply averages the values over entire feature maps. It's equivalent to using average pooling with a kernel equal to the spatial dimension of the feature map. This can be useful as an output layer.

Global Average Pool
0.1
0.5
0
-1
0.3
-0.5
-0.2
0.9
1.1
Output
0.125

Summary

CNNs typically stack repeating units of convolution layers, ReLU layers, and pooling layers (e.g. conv layer 1, ReLU, max pool, conv layer 2, ReLU, max pool, etc.). With this architecture the spatial dimensions are reduced while the number of channels increases.

At the end of the network fully connected (FC) layers with the desired number of outputs and a softmax are often used. In order to transition the convolution layers to the first fully connected layer, either the spatial dimension is reduced to one through the convolutions/pooling layers, or the 2d spatial dimension is unrolled (i.e. flattened) into a single dimension. This is shown in Figure 3 below.

Figure 3: CNN architecture (source)