YOLO-v1

Posted on 2018-07-08 | Edited on 2018-08-11

In this blog, I will introduce YOLO (version 1) algorithm. YOLO now has been a popular algorithm on Object Detection because of its speed and simple structure.

Original paper: You Only Look Once: Unified, Real-Time Object Detection

Architecture

Pipeline:

Divide the input image into S*S grid cells.
Put each cell into a CNN and output B predicted bounding boxes and each box has a confidence score.
Compute loss and training.
Use pre-trained model to predict.
Grid Cell
Firstly, YOLO divides input image into $ S\times S $ grid cells (Picture 1).
Picture 1

Next, each cell will produce $ B $ bounding boxes (also called predicted box or bbox). Each bounding box contains 5 predictions: center point coordinate $(x, y)$ of bbox, height $h$ and width $w$ of bbox and a conference score $S_{con}$. The original system sets $ S=7, B=2 $. For example, in Picture 2, bbox1(yellow) and bbox2(blue) are two predicting boxes of the cell(red). $(x_1, y_1, h_1, w_1)$ belong to bbox1.(x_2, y_2, h_2, w_2) belong to bbox2.
Picture 2

In addition, each bbox will have a conference score $S_{con}$ of predicted bbox. The equation is shown below:
$$S_{con} = Pr(object)*IOU$$
$Pr(object)$: the probability of a bbox contains the object.
$IOU$: the intersection over union between the bbox and ground truth box. For example, in Picture 2, the conference score of bbox1 will be
$$S_{con1}=\frac{area_{shadow}}{area_{union}}\leq1$$
Picture 3

Then, each cell will produce class probability $Pr(class_i|object)$ (the detected object belongs to a particular class), every category has one probability.
In the paper, they use PASCAL VOC dataset so the number of class $C$ is 20.
Thus, the main job in YOLO is to predict a $(7,7,2\times5+20)$ tensor through a CNN network.

CNN Structure

The deep convolutional neural network used in YOLO is inspired by GoogLeNet, and has 24 convolutional layers followed by 2 fully connected layers. The details are shown in Picture 4.

Picture 4. CNN Architecture

Training

Loss Computation

Picture 5. Loss Computation

loss = localization loss(1-2 lines) + confidence loss(3-4 lines) + classification loss(5 line)

For confidence loss, the target values of IoU and Pr(object) are:
IoU=1.
Pr(object)=1, if the cell contains the center point of ground truth;
Pr(object)=0, else.

Prediction

In prediction part, firstly, we need to compute
class-specific confidence score. The equation is shown below:
${class\ confidence\ score}=P_r(class_i)\times IoU$
$\qquad\qquad\qquad\qquad\qquad\ = box\ confidence\ score \times class\ probability$
In other words, let each box confidence score time each class probability. Thus, we will get $20\times(7\times7\times2)=20\times98$ class-specific confidence score.
Next, for each class, set zero if the $score<threshold$, and run NMS algorithm to delete redundant bboxes. (see reference[2] for the details of prediction and NMS)

References

[1] https://medium.com/@jonathan_hui/real-time-object-detection-with-yolo-yolov2-28b1b93e2088
[2]https://docs.google.com/presentation/d/1aeRvtKG21KHdD5lg6Hgyhx5rPq_ZOsGjG5rJ1HP7BbA/pub?start=false&loop=false&delayms=3000&slide=id.g137784ab86_4_2069
[3]https://blog.csdn.net/u014380165/article/details/72616238
[4]https://arxiv.org/abs/1506.02640

Database Collection

Posted on 2018-07-03 | Edited on 2018-07-05

In this article, I will collect some useful databases for Image Classification, Segmentation and Object Detection.

FAMOUS

CIFAR
CIFAR-10: 60000 color images (32x32) in 10 classes (6000 images per class). 50000 training images and 10000 test images.
CIFAR-100: 100 classes and 600 images each. 500 training images and 100 testing images per class. The 100 classes in the CIFAR-100 are grouped into 20 superclasses. Each image comes with a “fine” label (the class to which it belongs) and a “coarse” label (the superclass to which it belongs).
MNIST
MNIST database contains handwritten digits with 60,000 training images and 10,000 testing images.
ImageNet
ImageNet is an image database organized according to the WordNet hierarchy. It has over 20,000 classes and totally more that 14 million images.
SVHN (Street View House Numbers)

SVHN is a real-world image dataset for developing machine learning and object recognition algorithms with minimal requirement on data preprocessing and formatting. It can be seen as similar in flavor to MNIST(e.g., the images are of small cropped digits), but incorporates an order of magnitude more labeled data (over 600,000 digit images) and comes from a significantly harder, unsolved, real world problem (recognizing digits and numbers in natural scene images). SVHN is obtained from house numbers in Google Street View images.
STL-10
STL-10 has 10 classes, 500 training images (10 pre-defined folds), and 800 test images per class. This database can be used for developing unsupervised feature learning, deep learning, self-taught learning algorithms.
PASCAL VOC2012 dataset
Face
Labeled Faces in the Wild (LWF): http://vis-www.cs.umass.edu/lfw/
CUHK Face Alignment Database: http://mmlab.ie.cuhk.edu.hk/archive/CNN_FacePoint.htm
Annotated Facial Landmarks in the Wild: https://www.tugraz.at/institute/icg/research/team-bischof/lrs/downloads/aflw/#database
Multi-Task Facial Landmark: https://www.safaribooksonline.com/library/view/deep-learning-for/9781788295628/001f4f21-ab6e-48e1-8623-8a0ec35fcce9.xhtml
Large-scale CelebFaces Attributes: http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html
CBCL Face Database: http://cbcl.mit.edu/software-datasets/FaceData2.html
Pedestrian
CUHK Occlusion Dataset: http://mmlab.ie.cuhk.edu.hk/datasets/cuhk_occlusion/index.html
CUHK Person Re-identification Datasets: http://www.ee.cuhk.edu.hk/~xgwang/CUHK_identification.html
CBCL Pedestrian Database: http://cbcl.mit.edu/software-datasets/PedestrianData.html
INRIA Person Dataset: http://pascal.inrialpes.fr/data/human/
Fashion
Large-scale Fashion (DeepFashion) Database: http://mmlab.ie.cuhk.edu.hk/projects/DeepFashion.html
Color-Fashion Dataset: https://sites.google.com/site/fashionparsing/dataset
Car
CBCL Car Database: http://cbcl.mit.edu/software-datasets/CarData.html
INRIA Car Data Set:
Toyota Motor Europe (TME) Motorway Dataset: http://cmp.felk.cvut.cz/data/motorway/

Updating……

Hello World

Posted on 2018-05-29 | Edited on 2018-07-03 | In test

Welcome to Hexo! This is your very first post. Check documentation for more info. If you get any problems when using Hexo, you can find the answer in troubleshooting or you can ask me on GitHub.

Quick Start

Create a new post

1	$ hexo new "My New Post"

More info: Writing

Run server

1	$ hexo server

More info: Server

Generate static files

1	$ hexo generate

More info: Generating

Deploy to remote sites

1	$ hexo deploy

More info: Deployment