You Only Look Once: Unified, Real-Time Object Detection

字数总计：8.3k | 阅读估时：34分钟

基本信息
文章内容

基本信息

authors: Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi
tags: #Object-detection, #Training, #Computer-architecture, #Pipelines, #Microprocessors, #Neural-networks, #Real-time-systems
date: 2016-06
note-Date:2021-10-30
note: 以上就是大名鼎鼎的YOLO的原论文的标题，“你只需要看一眼”，统一的、实时的目标检测，最早是看吴恩达老师的深度学习课时了解到这个模型的，他在第四章讲卷积神经网络的时候大概说了一下YOLO，但是时间长了有些概念也比较模糊了，所以就来再浅显的读一下原论文。这篇论文的一作是来自华盛顿大学的Joseph Redmon，他博士期间在CVPR2016发表了这篇，YOLOv2和YOLOv3都是出自这位大牛之手，因此也被称为YOLO之父，但是有意思的，就在大家都翘首以盼等待YOLOv4的时候，他却在2020年宣布要停止一切关于计算机视觉的研究，原因是自己的开源算法被用在了军事和隐私问题上（大牛果然都是这么任性罒ω罒）。这篇论文也是one-stage目标检测模型的开山制作，此前一直都由双阶段的R-CNN系列模型独占鳌头，YOLO的出现极大的提高了目标检测模型的预测效率，并且对后面的one-stage模型奠定了基础，可以说有划时代的意义。

文章内容

Abstract

We present YOLO, a new approach to object detection. Prior work on object detection repurposes classifiers to perform detection. Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.

这里大概说的就是本文提出了一种新的目标检测方法YOLO，先前的目标检测工作是对分类器稍加修改后用来进行预检测，而本文的工作是将bounding box预测作为一个空间分类的回归任务和类别预测任务，这里没太完全理解，他说的先前的工作是针对Faster R-CNN还是其他模型而言，因为Faster R-CNN也是将bounding box分成了回归任务和类别预测任务，只不过是有两个阶段。后面又说了这是一个单一的神经网络模型，并且通过这一个模型可以完整的预测出bounding box的边界框和类别，不需要其他模型辅助，整个pipeline只有一个模型就够，因为只有一个模型所以就可以更好的对模型进行优化。

Our unified architecture is extremely fast. Our base YOLO model processes images in real-time at 45 frames per second. A smaller version of the network, Fast YOLO, processes an astounding 155 frames per second while still achieving double the mAP of other real-time detectors. Compared to state-of-the-art detection systems, YOLO makes more localization errors but is less likely to predict false positives on background. Finally, YOLO learns very general representations of objects. It outperforms other detection methods, including DPM and R-CNN, when generalizing from natural images to other domains like artwork.

后面这一段就说了YOLO达到的效果，在速度上非常的快可以达到45FPS，这在当时已经非常快了，毕竟当时Faster R-CNN最快也就是一秒20多帧，并且检测效果也很好，可以达到某些实时目标检测模型的两倍之多，这里应该不是针对Faster R-CNN说的，当时Faster R-CNN的精度还是挺高的，直到后面ReinaNet出现的时候，One-stage模型的准确率才首次超过Two-stage模型。

Introduction

Current detection systems repurpose classifiers to perform detection. To detect an object, these systems take a classifier for that object and evaluate it at various locations and scales in a test image. Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image

这里已经对Abstract中的repurpose classifiers进行了解释，为了检测一个目标，这些系统会为这些目标选取一个分类器在测试图片的不同位置和尺度上评估这个分类的效果。就想DPM这样的模型使用滑动窗口的方法，让分类器在整个图像上滑动检测。但是这种方法的效率肯定会非常的低。

More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes. After classification, post-processing is used to refine the bounding boxes, eliminate duplicate detections, and rescore the boxes based on other objects in the scene. These complex pipelines are slow and hard to optimize because each individual component must be trained separately.

这里又提到了R-CNN系列，R-CNN的方法使用了一种region proposal的方法现在图像上生成可能的边界框，然后在这些proposal对应的bounding box上运行分类器和回归预测，然后再使用post-processing对预测的边界框进行处理和细化操作，消除掉重复检测，并且基于场景中的其他对象对bounding box进行重新标记，这里说的就是两个步骤，第一步是在anchor(region)上生成对应的proposal，然后再proposal上再做预测。R-CNN的这一系列模型的缺点也是非常明显，因为整个R-CNN是由多个模型或者多个步骤组成的，每个的模型需要单独的进行训练，所以不方便对整个R-CNN模型进行优化。

We reframe object detection as a single regression problem, straight from image pixels to bounding box coordinates and class probabilities. Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.

这里就提到了YOLO这个模型做出的贡献，YOLO将目标检测问题从Two-Stage的方法重构成为了一个One-Stage的方法，将其重新定义为了一个单一的回归问题，直接从图像的像素到bounding box的边界框坐标和类概率（注意这里预测的是类概率，并没有进行softmax，所以是回归问题，而不直接是分类的问题）。

A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes. YOLO trains on full images and directly optimizes detection performance.

YOLO可以使用一个单一的神经网络来预测多个bounding box的以及这些bounding box的回归值，并且可以在一张完整的图像上进行训练，然后直接优化这个模型的性能，而不需要想Faster R-CNN那样要同时优化多个模型。

从上图可以看出来，YOLO的使用非常简单。1）将输入图像调整大小为448×448。2）将图像送入至一个单一的卷积神经网络中。3）根据模型的置信度对检测的结果设定阈值。

YOLO is extremely fast. Since we frame detection as a regression problem we don’t need a complex pipeline. We simply run our neural network on a new image at test time to predict detections.Our base network runs at 45 frames per second with no batch processing on a Titan X GPU and a fast version runs at more than 150 fps.

这里说了YOLO的第一个优点就是速度非常快，因为边界框和类别都定义为了回归问题，所以就不需要特别复杂的pipeline。作者在测试阶段用新图片在YOLO的基本模型上进行测试，就达到了45FPS，这个速度可以直接对视频进行目标检测了。

YOLO reasons globally about the image when making predictions. Unlike sliding window and region proposal-based techniques, YOLO sees the entire image during training and test time so it implicitly encodes contextual information about classes as well as their appearance. Fast R-CNN, a top detection method, mistakes background patches in an image for objects because it can’t see the larger context. YOLO makes less than half the number of background errors compared to Fast R-CNN.

这里作者主要介绍了YOLO的第二个优点，就是YOLO不同于那些窗口滑动模型和region proposal-base的技术不同，它在训练和测试时可以直接对图片的全局信息进行推理，所以它可以隐式的编码关于类别和外观的图片下文信息。Fast R-CNN这个最好的检测模型会将图片中的部分错误的识别为物体，因为Fast R-CNN太过于关注局部的信息，而关注不到更大的背景信息。YOLO的背景误识别只有Fast R-CNN的一半。

YOLO learns generalizable representations of objects. When trained on natural images and tested on artwork, YOLO outperforms top detection methods like DPM and R-CNN by a wide margin.

这里说的是YOLO的第三个优点就是YOLO的泛化性非常的好，不管是在一些自然场景图片还是在对艺术作品进行目标检测都表现出了良好的效果，泛化性能远远的超出当时的其他目标检测模型，所以在检测到一些新领域或者意外输入时，YOLO也会展现出其良好的鲁棒性，不太可能会崩溃。

YOLO still lags behind state-of-the-art detection systems in accuracy. While it can quickly identify objects in images it struggles to precisely localize some objects, especially small ones.

最后，作者也说出了YOLO的不足之处：在精度上还是落后于现在最先进的检测系统，还有就是虽然YOLO能够快速的识别图像中的物体，但是却很难去精确定位一些物体，尤其是一些比较小的物体。

Unified Detection

We unify the separate components of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It also predicts all bounding boxes across all classes for an image simultaneously. This means our network reasons globally about the full image and all the objects in the image. The YOLO design enables end-to-end training and realtime speeds while maintaining high average precision.

这一章节就开始具体的介绍网络的设计了，这里作者的意思大概就是说YOLO将之前目标检测模型多个阶段或者说多个部分统一成了一个单的神经网络，并且YOLO使用了整个图像的特征来预测每一个bounding box。它也可以同时通过图片的所有类别预测所有的bounding box。这也就以为着YOLO可以通过完整的图片和图片中的对象进行全局的推理。YOLO的设计实现了端对端的训练和相对来说实时的速度，同时保持了较高的平均预测精度。

Our system divides the input image into an S×S grid. If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.

这里开始提到了真正的模型结构是如何设计的，首先==YOLO将输入的模型划分成了S×S个网格==，如果一个物体的中心落在一个网格的单元格中，该网格单元格就负责检测这个物体。

Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect how confident the model is that the box contains an object and also how accurate it thinks the box is that it predicts.

==每一个单元格预测B个bounding boxes和这些boxes的置信度分数==，这个置信度分数反映训练出的模型包含目标的概率以及预测边界框位置信息的准确程度，这里好像有点绕，大体意思就是说这个值越高，这个box中越可能存在目标，这个值越低，表明这个box中越不可能存在一个目标。

Formally we define confidence as Pr(Object) * IOU_pred^truth. If no object exists in that cell, the confidence scores should be zero. Otherwise we want the confidence score to equal the intersection over union (IOU) between the predicted box and the ground truth.

作者正式将这个置信度定义为上面的那坨公式，Pr(Object)表示的就是一个box中包含目标的概率（这个值等于0或者1），而IOU_pred^truth表示的是预测边界框和GTBox之间的IOU值（交并比），这两个相乘就是这个box的置信度分数，这么设计有什么好处呢？如果没有要检测的物体落入cell，也就是说cell对应的box中没有东西要检测，那么Pr(Object)这个值就是0，最后的置信度分数也为0，如果有待检测物体落入cell中，那么Pr(Object)就等于1，这个置信度分数就等于预测box和真实box之间的IOU值。

Each bounding box consists of 5 predictions: x, y, w, h, and confidence. The (x, y) coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are predicted relative to the whole image. Finally the confidence prediction represents the IOU between the predicted box and any ground truth box.

每一个bounding box边界框都会预测五个值：x, y, w, h 和 confidence，其中(x, y)坐标表示的是bounding box==相对于在原图上划分的单元格边界==的中心（也就说这个中心坐标是相对于单元格来说的而不是整张图说的，这个中心坐标始终落在单元格内部），而宽和高则是相对于整张图像的预测。最后置信测试表示为预测box和GTbox之间的IOU值。

Each grid cell also predicts C conditional class probabilities, Pr(Classi|Object). These probabilities are conditioned on the grid cell containing an object.

每一个单元格还要预测C个条件类别概率Pr(Classi|Object)，也就是说落到这个单元格中的待检测物体的类别概率，这个概率只对包含对象的单元格进行计算，也就是说一个单元格中以存在目标为前提下的类别概率。

We only predict one set of class probabilities per grid cell, regardless of the number of boxes B.

这里对每一个单元格值预测一个类别概率集合就可以了，直接忽略掉落在这个单元格上的B个bounding box的数量。

At test time we multiply the conditional class probabilities and the individual box confidence predictions，
Pr ( Class _i∣ Object ) * Pr ( Object ) * IOU_pred^truth = Pr ( Class _i) * IOU_pred^truth
which gives us class-specific confidence scores for each box. These scores encode both the probability of that class appearing in the box and how well the predicted box fits the object.

在测试阶段，将每一个单元格内的一组类条件概率Pr ( Class _i∣ Object )和单元格对应的每一个box的置信度预测值Pr ( Object )相乘，就得到了每一个box特征类别的置信度分数Pr ( Class _i)，这些置信度分数编码了该类别在当前的box中出现的概率，以及预测的box与对象的匹配程度。

Our system models detection as a regression problem. It divides the image into an S × S grid and for each grid cell predictsBbounding boxes, confidence for those boxes, and C class probabilities. These predictions are encoded as an S × S × (B ∗ 5 + C) tensor.

最后对前面的部分做一个总结：这篇论文将目标检测定义为一个回归问题，它将图像分成了S × S个网格，并且每一个网格预测B个bounding box、这些box置信度和C个类别概率，最后这些预测值编码被预测成为S × S × (B ∗ 5 + C)个张量，S是要分割的单元格的数量，B是每一个单元格负责预测几个box，5是4+1，4是四个坐标x, y, w, h和box的置信度confidence，而C表示这个单元格的类别概率。

在PASCAL VOC上，作者让S=7，B=2，而VOC上有20个类别，所有最后的预测值有7×7×30个tensor。就像下图这样。

Network Design

The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.

作者是在PASCAL VOC上做的测试，YOLO使用初始的卷积层从图像中提取特征，然后使用全连接层预测输出概率和左边。

Our network architecture is inspired by the GoogLeNet model for image classification. Our network has 24 convolutional layers followed by 2 fully connected layers. Instead of the inception modules used by GoogLeNet, we simply use 1×1 reduction layers followed by 3×3 convolutional layers。

这里作者说的很清楚，YOLO的整个网络结构是受到GoogLeNet的启发，但是和GoogLeNet又不相同， YOLO有24个卷积层和2个全连接层，并且不使用GoogLeNet的Inception模块，而是使用1×1的导出层，再使用3×3的卷积层。通过下图也能够很好的看出每一个卷积层的卷积核大小通道数大小和池化核的大小，比如第二层是由一个3×3的卷积和2×2的池化层组成，需要注意的是，在经过最后一个卷积层之后，得到了一个7×7×1024的feature map，这个feature map先要进行一个拉伸操作，将其拉伸为一个7×7×1024的向量，然后将这个向量通过全连接层压缩为4096维的向量，之后再通过一个全连接层压缩为1470维，为什么是1470维呢，因为最后要再一个7×7×30的矩阵上预测类别和bounding box，而7×7×30就是1470，得到这个向量后，再将这个向量重塑为一个7×7×30的矩阵，至此整个模型的运算结束，使用最后的得到的矩阵进行预测即可。

We also train a fast version of YOLO designed to push the boundaries of fast object detection. Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24) and fewer filters in those layers. Other than the size of the network, all training and testing parameters are thesame between YOLO and Fast YOLO.

处理YOLO之外，本篇论文还提出了一个更小的YOLO模型，叫做Fast YOLO，Fast YOLO相比YOLO只使用了9个卷积层，除了网络结构外，他们两个之间的所有的参数都是相同的。

We pretrain the convolutional layers on the ImageNet classification task at half the resolution (224×224 input image) and then double the resolution for detection.

本小节的最后，作者说YOLO的backbone部分是在分辨率为224×224的ImageNet上进行训练的，然后将分辨提高一倍进行检测。

Training

We pretrain our convolutional layers on the ImageNet 1000-class competition dataset. For pretraining we use the first 20 convolutional layers from Figure 3 followed by a average-pooling layer and a fully connected layer. We use the Darknet framework for all training and inference .

上文中也已经说过，作者将网络的主干部分在1000类的ImageNet上进行了预训练，与检测过程不同的是，这个预训练是在上文YOLO结构图的前20个卷积层上进行的并没有把24个卷积层都放上，然后后面接了一些平均池化和全连接层，也就是说，YOLO在训练的时候完全是按照一个分类模型来进行训练的。后面作者也提到了这个主干网络在ImageNet上训练后达到了88%的准确率。最后作者说他使用了一种叫做Darknet的模型用来所有的训练的推理，这句话应该是后面加上的，因为在原论文中并没有这一部分。

We then convert the model to perform detection. Ren et al. show that adding both convolutional and connected layers to pretrained networks can improve performance .Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights. Detection often requires fine-grained visual information so we increase the input resolution of the network from 224×224 to 448×448.

之后作者将这个在ImageNet上进行训练好的预训练模型用于目标检测的fine-tuning，根据Ren的一篇论文，在已经预训练好的网络中层架卷积层和全连接层可以提高性能，作者就按照Ren论文中的提示，给这个预训练好的模型增加了四个卷积层和两个具有随机初始化权重的全连接层，并且因为检测需要更加细粒度的视觉信息，所以就将分辨率提高了一倍。

Our final layer predicts both class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parametrize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.

YOLO的最后一层用来预测类别概率分数和bounding box的坐标。为了方便计算，YOLO将bounding box的回归值宽和高按照原图像的大小归一化在0和1之间，并且也将bounding box的x和y坐标也参数化为特定单元格位置的偏移量，因此他们在0和1之间也是有界的。

We use a linear activation function for the final layer and all other layers use the following leaky rectified linear activation:

YOLO所有的层都使用leaky ReLU进行激活，只有最后一层单独使用了ReLU函数激活。

Loss

We optimize for sum-squared error in the output of our model. We use sum-squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision.

YOLO优化了输出的误差平方和，使用误差平方和是因为它本身比较方便进行优化，但是它不能很好的做到最大化平均精度。

It weights localization error equally with classification error which may not be ideal. Also, in every image many grid cells do not contain anyobject. This pushes the “confidence” scores of those cells towards zero, often overpowering the gradient from cells that do contain objects. This can lead to model instability, causing training to diverge early on.

误差平方和对于定位误差和分类误差的权重相等，这并不是很理想。并且在每一张图片的多个单元格上不包含任何对象。这样就会将这些单元格的置信度分数推到0，就会超过那些带有对象的单元的梯度。这可能会导致模型不稳定，从而就导致训练的早期发散。

To remedy this, we increase the loss from bounding box coordinate predictions and decrease the loss from confidence predictions for boxes that don’t contain objects. We use two parameters, λ_coord and λ_noobj to accomplish this. We set λ_coord=5 and λ_noobj=0.5.

为了解决这个问题，作者增加了bounding box坐标预测的损失值，减少了不包含目标对象的box的置信度预测的损失，使用了λ_coord 和 λ_noobj两个参数，并且分别将其设置为5和0.5。

Sum-squared error also equally weights errors in large boxes and small boxes. Our error metric should reflect that small deviations in large boxes matter less than in small boxes. To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.

误差平方和也会均衡大box和小box之间的权重误差，作者通过其误差指标发现大的box中的小误差要比小box中小误差更小，本身不应该出现这种情况，应该是大box的误差大，小box的误差相对小，通过误差平方和之后，这些误差的权重就被中和了。为了部分解决这个问题，YOLO在预测边界框的时候，是预测的边界框高度和宽度的平方根，而不是直接预测原始高度和宽度，这样能让其原先不平衡的偏差相对平衡一些。

对于不同大小Bounding box预测，相对来说，小box预测的偏差比大的box更加无法忍受，因为它本身就已经很小了，加入它的大小是4×4，它比GTbox向左偏移了2个单位，那么它直接比自身偏移了1/2，而对于一个16×16大小的box，它只偏移了1/8而已，可见小box受偏移影响更大，具体可以参考上面的图，但是如果加上平方根之后，小box和大box之间损失的差距也就没有那么大了。

YOLO predicts multiple bounding boxes per grid cell. At training time we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of object, improving overall recall.

YOLO对每一个单元格都会预测多个bounding box。在训练阶段，只需要每一个Object只有一个bounding box专门负责。具体做法：与GTBox的IOU最大的那个bounding box负责该GTBox的预测（以及计算损失）。这会让bounding box的预测器更加“专门化”，每个预测器会对特定大小，长宽比或者类别的GTBox预测的更好，从而提高整体的召回率。

During training we optimize the following, multi-part loss function:
$𝟙 𝟙 𝟙 𝟙 𝟙$

上面的公式就是作者定义的损失，这个损失主要包含三个部分，第一个是边界框的损失，第二个是置信度分数的损失，第三个是类别的损失，拆开来开，首先来说，第一个加号前的公式 $𝟙$ ，这一部分是bounding box的中心坐标的损失，需要注意的是前面的λ_coord，这个东西在前文中已经说到了，它的值为5，它的引入是为了解决定位误差和分类误差的权重相等的问题，然后再说一下𝟙_ij^obj表示的是判断第i个网格中第j个bounding box 是否负责这个object（与该Object的ground truth box的IoU值最大的bounding box负责该object），后面的部分就是一个典型的误差平方和了。后面的 $𝟙$ 和前面的中心坐标损失时差不多的，也是引入了λ_coord，唯一不同的是引入了一个平方根，前文也已经说了，引入这个平方根的目的是为了解决大box和小box之间误差不平衡的问题，第三项 $𝟙$ 是包含目标的bounding box的置信度误差，因为一个单元格有B个box，每一个box会设置一个置信度分数，所以前面的i最大就是单元格个数，而j最大值就是B了，如果这个box包含目标，置信度为1*IOU_pred^true，否则为0，但是因为真正包含object的bounding box十分稀少，而不包含目标的bounding box非常多，这就造成了一种严重的不平衡，前文中已经说了为了解决这种极端的不平衡，所以就乘以了一个λ_noobj，要让不包含目标的bounding box置信度损失减少占比，为了方便计算就把不包含目标的bounding box的置信度损失单独拿出来计算也就是 $𝟙$ 这一部分。最后就是类别的损失了，主要注意的是因为每一个单元格只预测一组类别，所以前面就只有一个求和符号，而后面∑_{c ∈ classes}(p_i(c)−p̂_i(c))²这一部分则是要把一个单元格中所有的类别的损失相加，就到了一个单元格类别的总损失。至此，损失函数解析完毕。在csdn看到一张图解析的挺好的，就拿过来附在这里。

Note that the loss function only penalizes classification error if an object is present in that grid cell (hence the conditional class probability discussed earlier). It also only penalizes bounding box coordinate error if that predictor is “responsible” for the ground truth box (i.e. has the highest IOU of any predictor in that grid cell).

需要注意的是，损失函数只会惩罚单元格中存在对象时的分类误差，如果单元格中没有对象，那么分类错误会被直接忽略掉。损失函数也只会惩罚预测器对某个GTBox”负责“（即在网格单元中拥有最高IOU的那个预测器）时的bounding box坐标误差。

Inference

Just like in training, predicting detections for a test image only requires one network evaluation. On PASCAL VOC the network predicts 98 bounding boxes per image and class probabilities for each box. YOLO is extremely fast at test time since it only requires a single network evaluation, unlike classifier-based methods.

这里作者开始说推理部分或者说预测部分，因为YOLO是一个单阶段的网络所以它的检测只需要一次网络评估，相较于基于分类器的方法速度非常快。在PASCAL VOC上，网络预测对每张图像会预测98（7×7×2）个BBox和每个box的类概率。

The grid design enforces spatial diversity in the bounding box predictions.Often it is clear which grid cell an object falls in to and the network only predicts one box for each object.

YOLO基于单元格的设计在BBox的预测上增加了空间多样性。通常情况下，一个物体落在哪个单元格中是很清楚的，并且YOLO只对一个物体预测一个box。

However, some large objects or objects near the border of multiple cells can be well localized by multiple cells. Non-maximal suppression can be used to fix these multiple detections. While not critical to performance as it is for R-CNN or DPM, non-maximal suppression adds 2-3% in mAP.

许多大目标或者多个单元格边界附近的目标可能被多个单元格同时定位，非极大值抑制可以用于固定这些多重检测，虽然对于YOLO来说NMS的重要程度远远不如R-CNN和DPM，但是非最大值一直可以增加2-3%的mAP。

Limitations of YOLO

YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts two boxes and can only have one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model struggles with small objects that appear in groups, such as flocks of birds.

这里作者开始说了一些YOLO的局限性，上面这段主要是说的YOLO对边界框的约束太大，因为在PASCAL VOC上，每个单元格只能预测两个box，并且每个单元格只会预测一组类别概率，这就限制了YOLO去预测一些临近的小目标的能力，比如对于鸟群来说，YOLO的预测的效果就比较差。

Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations. Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image.

由于YOLO是从数据中学习预测边界框，所以它很难去泛化或者推广检测一些新的或者不常见的宽高比以及配置的对象，YOLO还是用了相对粗糙的特征来预测边界框，因为YOLO的backbone对输入数据进行了多个下采样。

Finally, while we train on a loss function that approximates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes. A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU. Our main source of error is incorrect localizations.

YOLO的损失函数对小box和大box中错误处理是一样的，大的box中的错误往往是良性的，而小box中的错误是无法容忍，影响要比大box大的多，这在上文中已经提到过，最后作者说这种错误的来源往往是错误的定位。

Conclusion

文章后面的部分就是将YOLO和当时比较流行的目标检测模型做了一些的对比，再有就是实验部分，这些部分可能对于目前的一些工作都没有多大借鉴意义就不看了。这篇文章是One-Stage网络的开山的之作，也算是深度学习目标检测anchor-free的第一篇论文，它的主要思想是将一张图像以网格形式进行划分，然后在这些网格上进行目标检测，这个划分的过程就是将一张图片从原始大小经过backbone（本文用的DarkNet)压缩成指定特征图的过程，比如划分成7×7，那就最后将会得到一个7×7大小的特征图，每一个单元格设置两个bounding box，负责预测中心落在当前单元格上的Object，YOLO要对每一个bounding box预测五个值，分别是bounding box的中心点、宽高还有一个置信度，置信度的作用是用来表征当前bounding box中是否存在目标，如果不存在那么就为0，如果存在那这个值就是预测bounding box和GTBox之间的IOU，这个值也就代表了bounding box是否存在物体以及如果存在物体那么它的存在的可能性有多高的一个值，因为YOLO只对一个单元格设置了一组类别概率，而未对bounding box单独预测类比概率，所以为每一个bounding box设置置信度的作用就是每一个bounding box对应的置信度值和bounding box对应的单元格的那一组类别概率相乘就可以得到每一个bounding box类别概率值，这样就可以计算每一个bounding box的类别损失了，也就是说每一个单元格需要预测需要(4 + 1) × B + C个值，这也就对应到最后特征图上通道的长度，其中B是每个单元格中bounding box的数量，而C就是类别个数，YOLO的损失也分为三部分，一部分是对于bounding box坐标和宽高的损失，一部分是置信度的损失，另一部分是类别的损失，但是需要注意的是，因为YOLOv1用的是误差平方和，所以对于bounding box的损失和不存在Object的bounding box的confidence score设置的损失权重一样，作者为了平衡，所以引入了两个参数用来平衡这两个损失的权重，因此置信度损失也分为了两部分一部分是存在Object的bounding box的置信度损失，这一部分不需要乘以权重，另一部分是不存在Bounding box的置信度分数损失，这个就需要乘以对应的全中了，除此之外还有大box和小box偏移损失权重相等而造成的不平衡问题，作者也在宽和高计算损失的部分引入了平方根，以平衡他们之间的损失。至此YOLOv1的原论文全部结束，YOLOv2再见。

2021-11-30 该篇文章被 hoyAng 打上标签: 深度学习图像处理目标检测损失函数归为分类: 笔记