对YOLOv3理解和复现(backbone并没用darknet53)的一点记录。
改进点
相比YOLOv2主要改进点为:
- 多尺度:网络backbone使用FPN,融合不同粒度的学习(语义和像素信息);detection head也分成3个尺度的feature map输出,更加适用于不同目标大小的检测;
- 多属性:分类损失改为 二元交叉熵损失,也就是同一个目标可以属于多个类,贴近实际应用场景。
网络结构
原论文网络结构采用darknet-53,针对COCO数据集(80类,故每个detection head的输出
为):
| layer | filters | size | input | output | |||
|---|---|---|---|---|---|---|---|
| 0 | conv | 32 | 3 x 3 / 1 | 416 x 416 x 3 | -> | 416 x 416 x 32 | 0.299 BFLOPs | 
| 1 | conv | 64 | 3 x 3 / 2 | 416 x 416 x 32 | -> | 208 x 208 x 64 | 1.595 BFLOPs | 
| 2 | conv | 32 | 1 x 1 / 1 | 208 x 208 x 64 | -> | 208 x 208 x 32 | 0.177 BFLOPs | 
| 3 | conv | 64 | 3 x 3 / 1 | 208 x 208 x 32 | -> | 208 x 208 x 64 | 1.595 BFLOPs | 
| 4 | res | 1 | 208 x 208 x 64 | -> | 208 x 208 x 64 | ||
| 5 | conv | 128 | 3 x 3 / 2 | 208 x 208 x 64 | -> | 104 x 104 x 128 | 1.595 BFLOPs | 
| 6 | conv | 64 | 1 x 1 / 1 | 104 x 104 x 128 | -> | 104 x 104 x 64 | 0.177 BFLOPs | 
| 7 | conv | 128 | 3 x 3 / 1 | 104 x 104 x 64 | -> | 104 x 104 x 128 | 1.595 BFLOPs | 
| 8 | res | 5 | 104 x 104 x 128 | -> | 104 x 104 x 128 | ||
| 9 | conv | 64 | 1 x 1 / 1 | 104 x 104 x 128 | -> | 104 x 104 x 64 | 0.177 BFLOPs | 
| 10 | conv | 128 | 3 x 3 / 1 | 104 x 104 x 64 | -> | 104 x 104 x 128 | 1.595 BFLOPs | 
| 11 | res | 8 | 104 x 104 x 128 | -> | 104 x 104 x 128 | ||
| 12 | conv | 256 | 3 x 3 / 2 | 104 x 104 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs | 
| 13 | conv | 128 | 1 x 1 / 1 | 52 x 52 x 256 | -> | 52 x 52 x 128 | 0.177 BFLOPs | 
| 14 | conv | 256 | 3 x 3 / 1 | 52 x 52 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs | 
| 15 | res | 12 | 52 x 52 x 256 | -> | 52 x 52 x 256 | ||
| 16 | conv | 128 | 1 x 1 / 1 | 52 x 52 x 256 | -> | 52 x 52 x 128 | 0.177 BFLOPs | 
| 17 | conv | 256 | 3 x 3 / 1 | 52 x 52 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs | 
| 18 | res | 15 | 52 x 52 x 256 | -> | 52 x 52 x 256 | ||
| 19 | conv | 128 | 1 x 1 / 1 | 52 x 52 x 256 | -> | 52 x 52 x 128 | 0.177 BFLOPs | 
| 20 | conv | 256 | 3 x 3 / 1 | 52 x 52 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs | 
| 21 | res | 18 | 52 x 52 x 256 | -> | 52 x 52 x 256 | ||
| 22 | conv | 128 | 1 x 1 / 1 | 52 x 52 x 256 | -> | 52 x 52 x 128 | 0.177 BFLOPs | 
| 23 | conv | 256 | 3 x 3 / 1 | 52 x 52 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs | 
| 24 | res | 21 | 52 x 52 x 256 | -> | 52 x 52 x 256 | ||
| 25 | conv | 128 | 1 x 1 / 1 | 52 x 52 x 256 | -> | 52 x 52 x 128 | 0.177 BFLOPs | 
| 26 | conv | 256 | 3 x 3 / 1 | 52 x 52 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs | 
| 27 | res | 24 | 52 x 52 x 256 | -> | 52 x 52 x 256 | ||
| 28 | conv | 128 | 1 x 1 / 1 | 52 x 52 x 256 | -> | 52 x 52 x 128 | 0.177 BFLOPs | 
| 29 | conv | 256 | 3 x 3 / 1 | 52 x 52 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs | 
| 30 | res | 27 | 52 x 52 x 256 | -> | 52 x 52 x 256 | ||
| 31 | conv | 128 | 1 x 1 / 1 | 52 x 52 x 256 | -> | 52 x 52 x 128 | 0.177 BFLOPs | 
| 32 | conv | 256 | 3 x 3 / 1 | 52 x 52 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs | 
| 33 | res | 30 | 52 x 52 x 256 | -> | 52 x 52 x 256 | ||
| 34 | conv | 128 | 1 x 1 / 1 | 52 x 52 x 256 | -> | 52 x 52 x 128 | 0.177 BFLOPs | 
| 35 | conv | 256 | 3 x 3 / 1 | 52 x 52 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs | 
| $\color{SpringGreen}{36}$ | res | 33 | 52 x 52 x 256 | -> | 52 x 52 x 256 | ||
| 37 | conv | 512 | 3 x 3 / 2 | 52 x 52 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs | 
| 38 | conv | 256 | 1 x 1 / 1 | 26 x 26 x 512 | -> | 26 x 26 x 256 | 0.177 BFLOPs | 
| 39 | conv | 512 | 3 x 3 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs | 
| 40 | res | 37 | 26 x 26 x 512 | -> | 26 x 26 x 512 | ||
| 41 | conv | 256 | 1 x 1 / 1 | 26 x 26 x 512 | -> | 26 x 26 x 256 | 0.177 BFLOPs | 
| 42 | conv | 512 | 3 x 3 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs | 
| 43 | res | 40 | 26 x 26 x 512 | -> | 26 x 26 x 512 | ||
| 44 | conv | 256 | 1 x 1 / 1 | 26 x 26 x 512 | -> | 26 x 26 x 256 | 0.177 BFLOPs | 
| 45 | conv | 512 | 3 x 3 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs | 
| 46 | res | 43 | 26 x 26 x 512 | -> | 26 x 26 x 512 | ||
| 47 | conv | 256 | 1 x 1 / 1 | 26 x 26 x 512 | -> | 26 x 26 x 256 | 0.177 BFLOPs | 
| 48 | conv | 512 | 3 x 3 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs | 
| 49 | res | 46 | 26 x 26 x 512 | -> | 26 x 26 x 512 | ||
| 50 | conv | 256 | 1 x 1 / 1 | 26 x 26 x 512 | -> | 26 x 26 x 256 | 0.177 BFLOPs | 
| 51 | conv | 512 | 3 x 3 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs | 
| 52 | res | 49 | 26 x 26 x 512 | -> | 26 x 26 x 512 | ||
| 53 | conv | 256 | 1 x 1 / 1 | 26 x 26 x 512 | -> | 26 x 26 x 256 | 0.177 BFLOPs | 
| 54 | conv | 512 | 3 x 3 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs | 
| 55 | res | 52 | 26 x 26 x 512 | -> | 26 x 26 x 512 | ||
| 56 | conv | 256 | 1 x 1 / 1 | 26 x 26 x 512 | -> | 26 x 26 x 256 | 0.177 BFLOPs | 
| 57 | conv | 512 | 3 x 3 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs | 
| 58 | res | 55 | 26 x 26 x 512 | -> | 26 x 26 x 512 | ||
| 59 | conv | 256 | 1 x 1 / 1 | 26 x 26 x 512 | -> | 26 x 26 x 256 | 0.177 BFLOPs | 
| 60 | conv | 512 | 3 x 3 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs | 
| $\color{blue}{61}$ | res | 58 | 26 x 26 x 512 | -> | 26 x 26 x 512 | ||
| 62 | conv | 1024 | 3 x 3 / 2 | 26 x 26 x 512 | -> | 13 x 13 x 1024 | 1.595 BFLOPs | 
| 63 | conv | 512 | 1 x 1 / 1 | 13 x 13 x 1024 | -> | 13 x 13 x 512 | 0.177 BFLOPs | 
| 64 | conv | 1024 | 3 x 3 / 1 | 13 x 13 x 512 | -> | 13 x 13 x 1024 | 1.595 BFLOPs | 
| 65 | res | 62 | 13 x 13 x 1024 | -> | 13 x 13 x 1024 | ||
| 66 | conv | 512 | 1 x 1 / 1 | 13 x 13 x 1024 | -> | 13 x 13 x 512 | 0.177 BFLOPs | 
| 67 | conv | 1024 | 3 x 3 / 1 | 13 x 13 x 512 | -> | 13 x 13 x 1024 | 1.595 BFLOPs | 
| 68 | res | 65 | 13 x 13 x 1024 | -> | 13 x 13 x 1024 | ||
| 69 | conv | 512 | 1 x 1 / 1 | 13 x 13 x 1024 | -> | 13 x 13 x 512 | 0.177 BFLOPs | 
| 70 | conv | 1024 | 3 x 3 / 1 | 13 x 13 x 512 | -> | 13 x 13 x 1024 | 1.595 BFLOPs | 
| 71 | res | 68 | 13 x 13 x 1024 | -> | 13 x 13 x 1024 | ||
| 72 | conv | 512 | 1 x 1 / 1 | 13 x 13 x 1024 | -> | 13 x 13 x 512 | 0.177 BFLOPs | 
| 73 | conv | 1024 | 3 x 3 / 1 | 13 x 13 x 512 | -> | 13 x 13 x 1024 | 1.595 BFLOPs | 
| 74 | res | 71 | 13 x 13 x 1024 | -> | 13 x 13 x 1024 | ||
| 75 | conv | 512 | 1 x 1 / 1 | 13 x 13 x 1024 | -> | 13 x 13 x 512 | 0.177 BFLOPs | 
| 76 | conv | 1024 | 3 x 3 / 1 | 13 x 13 x 512 | -> | 13 x 13 x 1024 | 1.595 BFLOPs | 
| 77 | conv | 512 | 1 x 1 / 1 | 13 x 13 x 1024 | -> | 13 x 13 x 512 | 0.177 BFLOPs | 
| 78 | conv | 1024 | 3 x 3 / 1 | 13 x 13 x 512 | -> | 13 x 13 x 1024 | 1.595 BFLOPs | 
| $\color{blue}{79}$ | conv | 512 | 1 x 1 / 1 | 13 x 13 x 1024 | -> | 13 x 13 x 512 | 0.177 BFLOPs | 
| 80 | conv | 1024 | 3 x 3 / 1 | 13 x 13 x 512 | -> | 13 x 13 x 1024 | 1.595 BFLOPs | 
| 81 | conv | 255 | 1 x 1 / 1 | 13 x 13 x 1024 | -> | 13 x 13 x 255 | 0.026 BFLOPs | 
| 82 | $\color{red}{detection}$ | ||||||
| 83 | route | $\color{blue}{79}$ | |||||
| 84 | conv | 256 | 1 x 1 / 1 | 13 x 13 x 512 | -> | 13 x 13 x 256 | 0.044 BFLOPs | 
| 85 | upsample | 2x | 13 x 13 x 256 | -> | 26 x 26 x 256 | ||
| 86 | route | 85 | $\color{blue}{61}$ | ||||
| 87 | conv | 256 | 1 x 1 / 1 | 26 x 26 x 768 | -> | 26 x 26 x 256 | 0.266 BFLOPs | 
| 88 | conv | 512 | 3 x 3 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs | 
| 89 | conv | 256 | 1 x 1 / 1 | 26 x 26 x 512 | -> | 26 x 26 x 256 | 0.177 BFLOPs | 
| 90 | conv | 512 | 3 x 3 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs | 
| $\color{SpringGreen}{91}$ | conv | 256 | 1 x 1 / 1 | 26 x 26 x 512 | -> | 26 x 26 x 256 | 0.177 BFLOPs | 
| 92 | conv | 512 | 3 x 3 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs | 
| 93 | conv | 255 | 1 x 1 / 1 | 26 x 26 x 512 | -> | 26 x 26 x 255 | 0.052 BFLOPs | 
| 94 | $\color{red}{detection}$ | ||||||
| 95 | route | $\color{SpringGreen}{91}$ | |||||
| 96 | conv | 128 | 1 x 1 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 128 | 0.044 BFLOPs | 
| 97 | upsample | 2x | 26 x 26 x 128 | -> | 52 x 52 x 128 | ||
| 98 | route | 97 | $\color{SpringGreen}{36}$ | ||||
| 99 | conv | 128 | 1 x 1 / 1 | 52 x 52 x 384 | -> | 52 x 52 x 128 | 0.266 BFLOPs | 
| 100 | conv | 256 | 3 x 3 / 1 | 52 x 52 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs | 
| 101 | conv | 128 | 1 x 1 / 1 | 52 x 52 x 256 | -> | 52 x 52 x 128 | 0.177 BFLOPs | 
| 102 | conv | 256 | 3 x 3 / 1 | 52 x 52 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs | 
| 103 | conv | 128 | 1 x 1 / 1 | 52 x 52 x 256 | -> | 52 x 52 x 128 | 0.177 BFLOPs | 
| 104 | conv | 256 | 3 x 3 / 1 | 52 x 52 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs | 
| 105 | conv | 255 | 1 x 1 / 1 | 52 x 52 x 256 | -> | 52 x 52 x 255 | 0.104 BFLOPs | 
| 106 | $\color{red}{detection}$ | 
多尺度
     
骨干结构参考了FPN结构
YOLOv3采用多个尺度进行预测(之所以从小到大解释,因为大尺度依赖小尺度的上采样):
- 小尺度:,网络接受的图,经过5个步长为2的卷积模块,得到第79层特征图,然后接上两层卷积以进行检测回归,最终得到输出特征图 ; 
- 中尺度:,顺着主骨干网络第79层,用卷积核将channel减半,然后上采样得到;取主骨干网络的第61层分支concat起来,得到特征图,然后接上7层卷积以进行检测回归,最终得到输出特征图 ; 
- 大尺度:,顺着中尺度的detection head第91层,用卷积核将channel减半,然后上采样得到;取主骨干网络的第36层分支concat起来,得到特征图,然后接上7层卷积以进行检测回归,最终得到输出特征图 ; 
对应的,通过聚类而得到的anchor box,也要分配到三个尺度的检测分支去:将尺度大的anchor分配到感受野大的(也就是下采样stride大的检测分支),将尺度小的anchor分配到感受野小的(也就是下采样stride小的检测分支)。
另外,训练的时候也用了多尺度的输入进行实验。
多属性
YOLOv3损失函数与YOLOv2损失函数是一致的,唯一的区别是分类误差部分,需要修改为:
也就是说,由于分类概率的预测改为每个类都是二元交叉熵预测,分类损失需要对所有类别都添加损失项,而不像YOLOv2 softmax那种计算方式。
 
        