You Only Look Once - YOLO - is a traditional approach that is used in object detection. YOLO applies a single neural network to predict bounding box and class probability. Bounding box is the rectangular box that covers the object to identify the location of the object. This technique re-identified multi-task classification problem to regression problem.

Introduction

Prior to YOLO, the techniques like R-CNN utilized “region proposal” methods to first create potential bounding boxes then run a classifier on boxes. Then, post-processing refines the bounding boxes, delete overlapping predicitons, and rescore the boxes.

Since YOLO reframed object detection into a single regression problem by training the full images and directly optimizing detection performance, there are several benefits:

  1. YOLO is fast. With the tools that are applied in the paper, the network runs at speed of 45 frames per second. Along with the efficiency, the model attained more-than-twice average precision compared to exisiting models when it was published.

  2. YOLO bases to a global image when making decisions. The model sees the entire image and test time.

  3. YOLO learns generalizable representations of objects. YOLO is less likely to malfunction with new domains and unexpected inputs.

Indeed, the accuracy may not be remarkable compared to exisiting models when it was published.

Architecture

YOLO is a single neural network model for object detection components. There are steps that YOLO takes to detect.

The model first divides input images into $S$ x $S$ grid. If the center of an object lies inside certain grid cell, that grid cell is responsible for detecting that object.

Each grid cell predicts $B$ number of bounding boxes and confidence score of it. Confidence score is a measurement for how trustable the bounding box is for containing certain object. It is calculated as follows:

\[Pr(Object) * IOU^{truth}_{pred}\]

IOU stands for intersection over union - the ratio of intersection of actual and predicted bounding box and union of the them. If grid cell does not have any object, $Pr = 0$, and vice versa. If condifidence score is same is IOU, the score is theoretical.

Each bounding box comprises of five prediction values - $x$, $y$, $w$, $h$, and confidence score. $x$ and $y$ are for comparative object location within the grid cell. $w$ and $h$ are for the size of the bounding box. They range from 0 to 1. Then, each grid cell predicts conditional class probabilities, which is calculated as follows:

\[C(CCP) = Pr(Class_i | Object)\]

This is a conditional probability of which class that object is under the circumstance that object lies on grid cell. Regardless of number of bounding boxes, the probability is calculated for only one class for each grid cell.

In testing procedure, conditional class probability is multiplied with confidence score for each bounding box - which is known as class-specific confidence score. This score represents the probability of specific class appearing in bounding box and that predicted bounding box fits the object.

Network Design

YOLO is a fully convolutional network that uses convolutional layers to extract features from input images. The network is composed of several convolutional layers, each followed by a max-pooling layer. The network also has a few fully connected layers.

Convolutional Network

This image demonstrates how convolutional neural network is applied in the model. The final output is 7 x 7 x 30 tensor of predictions.

Training

For training, the convolutional layers are pretrained with ImageNet dataset with 1,000 classes. The first 20 convolutional layers are only applied for pretraining, then the overall layers are connected later. For the pretrained model, the accuracy marked 88% for ImageNet 2012 validation dataset.

Leaky ReLU activation function is applied at every layer except the last layer with linear activation function. The error to be minimized is the sum-squared error. However, minimizing error does not perfectly suit with the goal to increase the mean average precision.

There are two losses calculated - localization loss for how well bounding box is located and classification loss for how well the class is predicted. To adhere to the issue of most of grid cells not having an object, the losses are adjusted with two parameters. Another is to understand that smaller bounding boxes are sensitive to locational adjustments.

Loss

This is how loss is calculated.

  1. Calculate loss of $x$ and $y$ with respect to bounding box predictor $j$ of grid cell $i$ in which object exists,
  2. Calculate loss of $w$ and $h$ with respect to bounding box predictor $j$ of grid cell $i$ in which object exists. For big boxes, small deviaion is to be indicated with square roots,
  3. Calculate loss of confidence score with respect to bounding box predictor $j$ of grid cell $i$ in which object exists,
  4. Calculate loss of confidence score with respect to bounding box predictor $j$ of grid cecll $i$ in which object does not exist,
  5. Calculate loss of conditional class probability in which object exists in grid cell $i$.

$\lambda_{coord}$ is balancing parameter for coordinates, and $\lambda_{noobj}$ is balancing parameter for box with and without the object.

There are different other techniques applied including varying learning rates, dropout and random scaling.

Inference

For inference, only one calculation should be done on the convolutional layer. Indeed, there may be multiple bounding boxes. This problem can be improved with non-maximal suppression - which increased mean average precision by 2~3%.

Limitations & Discussions

YOLO predicts two bounding boxes for one grid cell with one cell predicting one object. This yields spatial constraints that the detection may not be clear when two or more objects are located next to each other in one grid cell.

In addtion, if the new aspect ratio is input that is not trained, the accuracy drops.

Furthermore, with same parameters put on regardless of size of the bounding boxes, the performance may vary for smaller bounding boxes. This is also known as localization problem.

There are current models that base their logic on this model. They are to be explained later.

The overall paper can be found here.