Beginners Guide to Understanding Convolutional Neural Networks

Learn about the important components that make up a Convolutional Neural Network

Sabina Pokhrel

Convolutional Neural Network (CNN) is a type of deep neural network which has proven to perform well in computer vision tasks such as image classification, object detection, object localization and neural style transfer. In this post, I will explain about the different layers that make up a convolutional neural network: convolution layer and pooling layer.

A convolution layer transforms the input image in order to extract features from it. In this transformation, the image is convolved with a kernel (or filter).

Image convolution (source)

A kernel is a small matrix, with its height and width smaller than the image to be convolved. It is also known as a convolution matrix or convolution mask. This kernel slides across the height and width of the image input and dot product of the kernel and the image are computed at every spatial position. The length by which the kernel slides is known as the stride length. In the image below, the input image is of size 5X5, the kernel is of size 3X3 and the stride length is 1. The output image is also referred to as the convolved feature.

When convolving a coloured image (RGB image) with channels 3, the channel of the filters must be 3 as well. In other words, in convolution, the number of channels in the kernel must be the same as the number of channels in the input image.

Convolution on RGB image (source)

When we want to extract more than one feature from an image using convolution, we can use multiple kernels instead of using just one. In such a case, the size of all the kernels must be the same. The convolved features of the input image the output are stacked one after the other to create an output so that the number of channels is equal to the number of filters used. See the image below for reference.

Convolution of RGB image using multiple filters (kernels) (source)

An activation function is the last component of the convolutional layer to increase the non-linearity in the output. Generally, ReLu function or Tanh function is used as an activation function in a convolution layer. Here is an image of a simple convolution layer, where a 6X6X3 input image is convolved with two kernels of size 4X4X3 to get a convolved feature of size 3X3X2, to which activation function is applied to get the output, which is also referred to as feature map.

A convolution layer (source)


Leave a Comment