You Only Look Once Explained

Author:	Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi
Released:	2015
Programming Language:	Python

You Only Look Once (YOLO) is a series of real-time object detection systems based on convolutional neural networks. First introduced by Joseph Redmon et al. in 2015,^[1] YOLO has undergone several iterations and improvements, becoming one of the most popular object detection frameworks.^[2]

The name "You Only Look Once" refers to the fact that the algorithm requires only one forward propagation pass through the neural network to make predictions, unlike previous region proposal-based techniques like R-CNN that require thousands for a single image.

Overview

Compared to previous methods like R-CNN and OverFeat, instead of applying the model to an image at multiple locations and scales, YOLO applies a single neural network to the full image. This network divides the image into regions and predicts bounding boxes and probabilities for each region. These bounding boxes are weighted by the predicted probabilities.

OverFeat

OverFeat was an early influential model for simultaneous object classification and localization. Its architecture is as follows:

Train a neural network for image classification only ("classification-trained network"). This could be one like the AlexNet.
The last layer of the trained network is removed, and for every possible object class, initialize a network module at the last layer ("regression network"). The base network has its parameters frozen. The regression network is trained to predict the

(x,y)

coordinates of two corners of the object's bounding box.

During inference time, the classification-trained network is run over the same image over many different zoom levels and croppings. For each, it outputs a class label and a probability for that class label. Each output is then processed by the regression network of the corresponding class. This results in thousands of bounding boxes with class labels and probability. These boxes are merged until only one single box with a single class label remains.

Versions

There are two parts to the YOLO series. The original part contained YOLOv1, v2, and v3, all released on a website maintained by Joseph Redmon.^[3]

YOLOv1

The original YOLO algorithm, introduced in 2015, divides the image into an

S x S

grid of cells. If the center of an object's bounding box falls into a grid cell, that cell is said to "contain" that object. Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and how accurate it thinks the box is that it predicts.

In more detail, the network performs the same convolutional operation over each of the

S²

patches. The output of the network on each patch is a tuple as follows:

(p_1, \dots, p_C, c_1, x_1, y_1, w_1, h_1, \dots, c_B, x_B, y_B, w_B, h_B)

where

p_i

is the conditional probability that the cell contains an object of class

, conditional on the cell containing at least one object.

x_j,y_j,w_j,h_j

are the center coordinates, width, and height of the

-th predicted bounding box that is centered in the cell. Multiple bounding boxes are predicted to allow each prediction to specialize in one kind of bounding box. For example, slender objects might be predicted by

j=2

while stout objects might be predicted by

j=1

c_j

is the predicted intersection over union (IoU) of each bounding box with its corresponding ground truth.The network architecture has 24 convolutional layers followed by 2 fully connected layers.

During training, for each cell, if it contains a ground truth bounding box, then only the predicted bounding boxes with the highest IoU with the ground truth bounding boxes is used for gradient descent. Concretely, let

be that predicted bounding box, and let

be the ground truth class label, then

x_j,y_j,w_j,h_j

are trained by gradient descent to approach the ground truth,

p_i

is trained towards

, other

p_i'

are trained towards zero.

If a cell contains no ground truth, then only

c_1,c_2,...,c_B

are trained by gradient descent to approach zero.

YOLOv2

Released in 2016, YOLOv2 (also known as YOLO9000)^[4] ^[5] improved upon the original model by incorporating batch normalization, a higher resolution classifier, and using anchor boxes to predict bounding boxes. It could detect over 9000 object categories. It was also released on GitHub under the Apache 2.0 license.

YOLOv3

YOLOv3, introduced in 2018, contained only "incremental" improvements, including the use of a more complex backbone network, multiple scales for detection, and a more sophisticated loss function.^[6]

YOLOv4 and beyond

Subsequent versions of YOLO (v4, v5, etc.)^[7] ^[8] ^[9] ^[10] have been developed by different researchers, further improving performance and introducing new features. These versions are not officially associated with the original YOLO authors but build upon their work., there are up to YOLOv8.

External links

Notes and References

1506.02640 . cs.CV . Joseph . Redmon . Santosh . Divvala . You Only Look Once: Unified, Real-Time Object Detection . 2016-05-09 . Girshick . Ross . Farhadi . Ali.
Terven . Juan . Córdova-Esparza . Diana-Margarita . Romero-González . Julio-Alejandro . 2023-11-20 . A Comprehensive Review of YOLO Architectures in Computer Vision: From YOLOv1 to YOLOv8 and YOLO-NAS . Machine Learning and Knowledge Extraction . en . 5 . 4 . 1680–1716 . 10.3390/make5040083 . free . 2504-4990. 2304.00501 .
Web site: YOLO: Real-Time Object Detection . 2024-09-12 . pjreddie.com.
1612.08242 . cs.CV . Joseph . Redmon . Ali . Farhadi . YOLO9000: Better, Faster, Stronger . 2016-12-25.
Web site: YOLOv2: Real-Time Object Detection . 2024-09-12 . pjreddie.com.
1804.02767 . cs.CV . Joseph . Redmon . Ali . Farhadi . YOLOv3: An Incremental Improvement . 2018-04-08.
2004.10934 . cs.CV . Alexey . Bochkovskiy . Chien-Yao . Wang . YOLOv4: Optimal Speed and Accuracy of Object Detection . 2020-04-22 . Liao . Hong-Yuan Mark.
2011.08036 . cs.CV . Chien-Yao . Wang . Alexey . Bochkovskiy . Scaled-YOLOv4: Scaling Cross Stage Partial Network . 2021-02-21 . Liao . Hong-Yuan Mark.
2209.02976 . cs.CV . Chuyi . Li . Lulu . Li . YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications . 2022-09-07 . Jiang . Hongliang . Weng . Kaiheng . Geng . Yifei . Li . Liang . Ke . Zaidan . Li . Qingyuan . Cheng . Meng . Nie . Weiqiang . Li . Yiduo . Zhang . Bo . Liang . Yufei . Zhou . Linyuan . Xu . Xiaoming.
2207.02696 . cs.CV . Chien-Yao . Wang . Alexey . Bochkovskiy . YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors . 2022-07-06 . Liao . Hong-Yuan Mark.