对YOLOv3理解和复现(backbone并没用darknet53)的一点记录。
改进点
相比YOLOv2主要改进点为:
- 多尺度:网络
backbone
使用FPN
,融合不同粒度的学习(语义和像素信息);detection head
也分成3个尺度的feature map
输出,更加适用于不同目标大小的检测; - 多属性:分类损失改为 二元交叉熵损失,也就是同一个目标可以属于多个类,贴近实际应用场景。
网络结构
原论文网络结构采用darknet-53
,针对COCO数据集(80类,故每个detection head
的输出
为):
layer | filters | size | input | output | |||
---|---|---|---|---|---|---|---|
0 | conv | 32 | 3 x 3 / 1 | 416 x 416 x 3 | -> | 416 x 416 x 32 | 0.299 BFLOPs |
1 | conv | 64 | 3 x 3 / 2 | 416 x 416 x 32 | -> | 208 x 208 x 64 | 1.595 BFLOPs |
2 | conv | 32 | 1 x 1 / 1 | 208 x 208 x 64 | -> | 208 x 208 x 32 | 0.177 BFLOPs |
3 | conv | 64 | 3 x 3 / 1 | 208 x 208 x 32 | -> | 208 x 208 x 64 | 1.595 BFLOPs |
4 | res | 1 | 208 x 208 x 64 | -> | 208 x 208 x 64 | ||
5 | conv | 128 | 3 x 3 / 2 | 208 x 208 x 64 | -> | 104 x 104 x 128 | 1.595 BFLOPs |
6 | conv | 64 | 1 x 1 / 1 | 104 x 104 x 128 | -> | 104 x 104 x 64 | 0.177 BFLOPs |
7 | conv | 128 | 3 x 3 / 1 | 104 x 104 x 64 | -> | 104 x 104 x 128 | 1.595 BFLOPs |
8 | res | 5 | 104 x 104 x 128 | -> | 104 x 104 x 128 | ||
9 | conv | 64 | 1 x 1 / 1 | 104 x 104 x 128 | -> | 104 x 104 x 64 | 0.177 BFLOPs |
10 | conv | 128 | 3 x 3 / 1 | 104 x 104 x 64 | -> | 104 x 104 x 128 | 1.595 BFLOPs |
11 | res | 8 | 104 x 104 x 128 | -> | 104 x 104 x 128 | ||
12 | conv | 256 | 3 x 3 / 2 | 104 x 104 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs |
13 | conv | 128 | 1 x 1 / 1 | 52 x 52 x 256 | -> | 52 x 52 x 128 | 0.177 BFLOPs |
14 | conv | 256 | 3 x 3 / 1 | 52 x 52 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs |
15 | res | 12 | 52 x 52 x 256 | -> | 52 x 52 x 256 | ||
16 | conv | 128 | 1 x 1 / 1 | 52 x 52 x 256 | -> | 52 x 52 x 128 | 0.177 BFLOPs |
17 | conv | 256 | 3 x 3 / 1 | 52 x 52 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs |
18 | res | 15 | 52 x 52 x 256 | -> | 52 x 52 x 256 | ||
19 | conv | 128 | 1 x 1 / 1 | 52 x 52 x 256 | -> | 52 x 52 x 128 | 0.177 BFLOPs |
20 | conv | 256 | 3 x 3 / 1 | 52 x 52 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs |
21 | res | 18 | 52 x 52 x 256 | -> | 52 x 52 x 256 | ||
22 | conv | 128 | 1 x 1 / 1 | 52 x 52 x 256 | -> | 52 x 52 x 128 | 0.177 BFLOPs |
23 | conv | 256 | 3 x 3 / 1 | 52 x 52 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs |
24 | res | 21 | 52 x 52 x 256 | -> | 52 x 52 x 256 | ||
25 | conv | 128 | 1 x 1 / 1 | 52 x 52 x 256 | -> | 52 x 52 x 128 | 0.177 BFLOPs |
26 | conv | 256 | 3 x 3 / 1 | 52 x 52 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs |
27 | res | 24 | 52 x 52 x 256 | -> | 52 x 52 x 256 | ||
28 | conv | 128 | 1 x 1 / 1 | 52 x 52 x 256 | -> | 52 x 52 x 128 | 0.177 BFLOPs |
29 | conv | 256 | 3 x 3 / 1 | 52 x 52 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs |
30 | res | 27 | 52 x 52 x 256 | -> | 52 x 52 x 256 | ||
31 | conv | 128 | 1 x 1 / 1 | 52 x 52 x 256 | -> | 52 x 52 x 128 | 0.177 BFLOPs |
32 | conv | 256 | 3 x 3 / 1 | 52 x 52 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs |
33 | res | 30 | 52 x 52 x 256 | -> | 52 x 52 x 256 | ||
34 | conv | 128 | 1 x 1 / 1 | 52 x 52 x 256 | -> | 52 x 52 x 128 | 0.177 BFLOPs |
35 | conv | 256 | 3 x 3 / 1 | 52 x 52 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs |
$\color{SpringGreen}{36}$ | res | 33 | 52 x 52 x 256 | -> | 52 x 52 x 256 | ||
37 | conv | 512 | 3 x 3 / 2 | 52 x 52 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs |
38 | conv | 256 | 1 x 1 / 1 | 26 x 26 x 512 | -> | 26 x 26 x 256 | 0.177 BFLOPs |
39 | conv | 512 | 3 x 3 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs |
40 | res | 37 | 26 x 26 x 512 | -> | 26 x 26 x 512 | ||
41 | conv | 256 | 1 x 1 / 1 | 26 x 26 x 512 | -> | 26 x 26 x 256 | 0.177 BFLOPs |
42 | conv | 512 | 3 x 3 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs |
43 | res | 40 | 26 x 26 x 512 | -> | 26 x 26 x 512 | ||
44 | conv | 256 | 1 x 1 / 1 | 26 x 26 x 512 | -> | 26 x 26 x 256 | 0.177 BFLOPs |
45 | conv | 512 | 3 x 3 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs |
46 | res | 43 | 26 x 26 x 512 | -> | 26 x 26 x 512 | ||
47 | conv | 256 | 1 x 1 / 1 | 26 x 26 x 512 | -> | 26 x 26 x 256 | 0.177 BFLOPs |
48 | conv | 512 | 3 x 3 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs |
49 | res | 46 | 26 x 26 x 512 | -> | 26 x 26 x 512 | ||
50 | conv | 256 | 1 x 1 / 1 | 26 x 26 x 512 | -> | 26 x 26 x 256 | 0.177 BFLOPs |
51 | conv | 512 | 3 x 3 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs |
52 | res | 49 | 26 x 26 x 512 | -> | 26 x 26 x 512 | ||
53 | conv | 256 | 1 x 1 / 1 | 26 x 26 x 512 | -> | 26 x 26 x 256 | 0.177 BFLOPs |
54 | conv | 512 | 3 x 3 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs |
55 | res | 52 | 26 x 26 x 512 | -> | 26 x 26 x 512 | ||
56 | conv | 256 | 1 x 1 / 1 | 26 x 26 x 512 | -> | 26 x 26 x 256 | 0.177 BFLOPs |
57 | conv | 512 | 3 x 3 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs |
58 | res | 55 | 26 x 26 x 512 | -> | 26 x 26 x 512 | ||
59 | conv | 256 | 1 x 1 / 1 | 26 x 26 x 512 | -> | 26 x 26 x 256 | 0.177 BFLOPs |
60 | conv | 512 | 3 x 3 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs |
$\color{blue}{61}$ | res | 58 | 26 x 26 x 512 | -> | 26 x 26 x 512 | ||
62 | conv | 1024 | 3 x 3 / 2 | 26 x 26 x 512 | -> | 13 x 13 x 1024 | 1.595 BFLOPs |
63 | conv | 512 | 1 x 1 / 1 | 13 x 13 x 1024 | -> | 13 x 13 x 512 | 0.177 BFLOPs |
64 | conv | 1024 | 3 x 3 / 1 | 13 x 13 x 512 | -> | 13 x 13 x 1024 | 1.595 BFLOPs |
65 | res | 62 | 13 x 13 x 1024 | -> | 13 x 13 x 1024 | ||
66 | conv | 512 | 1 x 1 / 1 | 13 x 13 x 1024 | -> | 13 x 13 x 512 | 0.177 BFLOPs |
67 | conv | 1024 | 3 x 3 / 1 | 13 x 13 x 512 | -> | 13 x 13 x 1024 | 1.595 BFLOPs |
68 | res | 65 | 13 x 13 x 1024 | -> | 13 x 13 x 1024 | ||
69 | conv | 512 | 1 x 1 / 1 | 13 x 13 x 1024 | -> | 13 x 13 x 512 | 0.177 BFLOPs |
70 | conv | 1024 | 3 x 3 / 1 | 13 x 13 x 512 | -> | 13 x 13 x 1024 | 1.595 BFLOPs |
71 | res | 68 | 13 x 13 x 1024 | -> | 13 x 13 x 1024 | ||
72 | conv | 512 | 1 x 1 / 1 | 13 x 13 x 1024 | -> | 13 x 13 x 512 | 0.177 BFLOPs |
73 | conv | 1024 | 3 x 3 / 1 | 13 x 13 x 512 | -> | 13 x 13 x 1024 | 1.595 BFLOPs |
74 | res | 71 | 13 x 13 x 1024 | -> | 13 x 13 x 1024 | ||
75 | conv | 512 | 1 x 1 / 1 | 13 x 13 x 1024 | -> | 13 x 13 x 512 | 0.177 BFLOPs |
76 | conv | 1024 | 3 x 3 / 1 | 13 x 13 x 512 | -> | 13 x 13 x 1024 | 1.595 BFLOPs |
77 | conv | 512 | 1 x 1 / 1 | 13 x 13 x 1024 | -> | 13 x 13 x 512 | 0.177 BFLOPs |
78 | conv | 1024 | 3 x 3 / 1 | 13 x 13 x 512 | -> | 13 x 13 x 1024 | 1.595 BFLOPs |
$\color{blue}{79}$ | conv | 512 | 1 x 1 / 1 | 13 x 13 x 1024 | -> | 13 x 13 x 512 | 0.177 BFLOPs |
80 | conv | 1024 | 3 x 3 / 1 | 13 x 13 x 512 | -> | 13 x 13 x 1024 | 1.595 BFLOPs |
81 | conv | 255 | 1 x 1 / 1 | 13 x 13 x 1024 | -> | 13 x 13 x 255 | 0.026 BFLOPs |
82 | $\color{red}{detection}$ | ||||||
83 | route | $\color{blue}{79}$ | |||||
84 | conv | 256 | 1 x 1 / 1 | 13 x 13 x 512 | -> | 13 x 13 x 256 | 0.044 BFLOPs |
85 | upsample | 2x | 13 x 13 x 256 | -> | 26 x 26 x 256 | ||
86 | route | 85 | $\color{blue}{61}$ | ||||
87 | conv | 256 | 1 x 1 / 1 | 26 x 26 x 768 | -> | 26 x 26 x 256 | 0.266 BFLOPs |
88 | conv | 512 | 3 x 3 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs |
89 | conv | 256 | 1 x 1 / 1 | 26 x 26 x 512 | -> | 26 x 26 x 256 | 0.177 BFLOPs |
90 | conv | 512 | 3 x 3 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs |
$\color{SpringGreen}{91}$ | conv | 256 | 1 x 1 / 1 | 26 x 26 x 512 | -> | 26 x 26 x 256 | 0.177 BFLOPs |
92 | conv | 512 | 3 x 3 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 512 | 1.595 BFLOPs |
93 | conv | 255 | 1 x 1 / 1 | 26 x 26 x 512 | -> | 26 x 26 x 255 | 0.052 BFLOPs |
94 | $\color{red}{detection}$ | ||||||
95 | route | $\color{SpringGreen}{91}$ | |||||
96 | conv | 128 | 1 x 1 / 1 | 26 x 26 x 256 | -> | 26 x 26 x 128 | 0.044 BFLOPs |
97 | upsample | 2x | 26 x 26 x 128 | -> | 52 x 52 x 128 | ||
98 | route | 97 | $\color{SpringGreen}{36}$ | ||||
99 | conv | 128 | 1 x 1 / 1 | 52 x 52 x 384 | -> | 52 x 52 x 128 | 0.266 BFLOPs |
100 | conv | 256 | 3 x 3 / 1 | 52 x 52 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs |
101 | conv | 128 | 1 x 1 / 1 | 52 x 52 x 256 | -> | 52 x 52 x 128 | 0.177 BFLOPs |
102 | conv | 256 | 3 x 3 / 1 | 52 x 52 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs |
103 | conv | 128 | 1 x 1 / 1 | 52 x 52 x 256 | -> | 52 x 52 x 128 | 0.177 BFLOPs |
104 | conv | 256 | 3 x 3 / 1 | 52 x 52 x 128 | -> | 52 x 52 x 256 | 1.595 BFLOPs |
105 | conv | 255 | 1 x 1 / 1 | 52 x 52 x 256 | -> | 52 x 52 x 255 | 0.104 BFLOPs |
106 | $\color{red}{detection}$ |
多尺度
骨干结构参考了FPN结构
YOLOv3
采用多个尺度进行预测(之所以从小到大解释,因为大尺度依赖小尺度的上采样):
小尺度:,网络接受的图,经过5个步长为2的卷积模块,得到第79层特征图,然后接上两层卷积以进行检测回归,最终得到输出特征图 ;
中尺度:,顺着主骨干网络第79层,用卷积核将channel减半,然后上采样得到;取主骨干网络的第61层分支concat起来,得到特征图,然后接上7层卷积以进行检测回归,最终得到输出特征图 ;
大尺度:,顺着中尺度的detection head第91层,用卷积核将channel减半,然后上采样得到;取主骨干网络的第36层分支concat起来,得到特征图,然后接上7层卷积以进行检测回归,最终得到输出特征图 ;
对应的,通过聚类而得到的anchor box
,也要分配到三个尺度的检测分支去:将尺度大的anchor
分配到感受野大的(也就是下采样stride大的检测分支),将尺度小的anchor
分配到感受野小的(也就是下采样stride
小的检测分支)。
另外,训练的时候也用了多尺度的输入进行实验。
多属性
YOLOv3损失函数与YOLOv2损失函数是一致的,唯一的区别是分类误差部分,需要修改为:
也就是说,由于分类概率的预测改为每个类都是二元交叉熵预测,分类损失需要对所有类别都添加损失项,而不像YOLOv2 softmax
那种计算方式。