[Computer Vision] CNN

This page summarizes and explains what is Convolutional Neural Network and how it is beneficial for image detections problems.

Background

For problems like image detection and pattern recognition, multilayer neural network using gradient descent method for optimization has been applied.

Traiditional Method

This is the procedure for the traditional machine learning algorithms. There were problems that were present in the overall process.

1. Hand-designed feature extractor.

The collection of raw data and extraction of the useful data are done by humans. The information is lost in such a process.

2. Too many parameters

One image contains of several hundreds pixels. This necessitates input for multi-layer to have more than thousands of weights. In addition, given certain tasks, including but not limited to hand-writing detection, there are variances regarding the proposed image shapes. These facts require the system to have more memory and train sets.

3. Loss of data in topology

The data is transformed into 1D data, which does not provide relationship between pixels when set in 2D.

There were attempts to solve these issues, by LeCun, using his own ideas proposed by LeNet.

LeNet

LeNet, the network proposed by LeCun, compromises of five different versions. This paper is to demonstrate how the LeNet-1 and LeNet-5 function.

LeNet-1

This is the model that was proposed in 1989. This model comprises of the features that are similar to that of modern Convolultional Neural Network models.

LeNet-1

It can be seen that the model comprises of repetition of Convolutional Layer and Pooling Layer. In addition, some of the interesting features include…

Creating “feature maps” using 5 x 5 Convolutional Layer,
Shared weight, in which overall image shares the same weight and bias in the convolutional kernel,
Subsampling using Average Pooling.

LeNet-5

The difference between LeNet-1 and LeNet-5 are input sizes, number of feature maps, and the size of fully-connected layer.

Architecture of LeNet-5

There are some abbreviations, including $C$, $S$, and $F$. $C$ stands for convolutional layer, $S$ stands for sub-sampling, and $F$ stands for fully-connected layer.

The input size is increased to 32x32. The image originally is of 28x28, but the corner or outer edges are to be reflected on the actual model.

There are increased number of feature maps that are generated - from 4, 12 to 6, 16. In addition, there are 3 fully-connected layers.

One aspect to mark is the number of feature maps increase from 6 to 16 at $C_3$. Here, not all the inputs from the previous layers contribute to $C_3$ layer.

Input Layer C3

The first feature map among those 16 utilize inputs from previous layer 0, 1, and 2. The purpose of this is to minimize the calculation and make the features global.

Results

For calculating the results, the Modified-NIST sets are used.

Dataset Used

Learning rate was decreased manually, ranging from 0.0005 to 0.00001.

Result

This is the result when the dataset is trained and tested with different algorithms. Regarding the error, LeNets did provide significant improvements compared to others.

In addition, tetsting with the traditional hardwares require improvements on the process. This model is proven to be advantageous on the memory and the timing compared to the models that require pre-training.

The actual paper can be found here.

jaehwanc

Background

LeNet

LeNet-1

LeNet-5

Results