0%

Caffe使用的一些Q&A

训练好的模型测试效果差?不同层的同名blob变量意味着什么?

Question: 训练好的模型,部署测试的时候效果极差(相对训练时loss,cls等输出而言)

Answer: 可能的情况有:

  • deploy.prototxt中,层的name和train.prototxt中对应层不一致。debug过程可用:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
import cv2
import caffe
net = caffe.Net('deploy.prototxt', 'resnet18.caffemodel', caffe.TEST)
transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape})
# 将cv2或者caffe读入数据shape(image_height, image_width, channel_num)变为(channel_num, image_height, image_width)
transformer.set_transpose('data', (2, 0, 1))
net.blobs["data"].reshape(1, channel_num, image_height, image_width)
image = cv2.imread('image.jpg')/255
transformed_image = self.transformer.preprocess('data', image)
net.blobs['data'].data[...] = transformed_image
output = net.forward() # 输出层数据(batch_size, channel_num, image_height, image_width)
print([(k,v[0].data) for k,v in net.params.items()])
w1 = net.params['Convolution_top'][0].data # 查看网络权重参数,若name不一致,该层可能初始化为0,或随机初始化等不可预期操作
b1 = net.params['Convolution_top'][1].data
feature = = net.blobs['Convolution_name'].data # 查看网络对应name的输出特征
  • 输入的图片数据,和训练时输入的图片数据预处理不一致。如训练用opencv的cv2.imread('image.jpg'),而测试用caffe.io.load_image('image.jpg')。因为,cv2.imread()读入的图片是[0, 255]的BRG格式,而caffe.io.load_image()读入的数据是[0, 1]范围的RGB格式:
1
2
3
4
5
6
7
8
9
10
11
import caffe
net = caffe.Net('deploy.prototxt', 'resnet18.caffemodel', caffe.TEST)
transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape})
# 将cv2或者caffe读入数据shape(image_height, image_width, channel_num)变为(channel_num, image_height, image_width)
transformer.set_transpose('data', (2, 0, 1))
transformer.set_channel_swap('data', (2, 1, 0)) # 将channels从RGB变为BGR,这个只有在用caffe.io.load_image()才需要
net.blobs["data"].reshape(1, channel_num, image_height, image_width)
# 如果用cv2,读入则依据train.prototxt中数据预处理是否用了归一化,考虑cv2.imread('image.jpg')/255进行归一操作
image = caffe.io.load_image('image.jpg')
transformed_image = self.transformer.preprocess('data', image)
net.blobs['data'].data[...] = transformed_image

Question: 在训练模型时,模型加载一半,报出以下错误:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
...
I0614 06:47:37.005009 6183 layer_factory.hpp:77] Creating layer res4b_res4b_0_split
I0614 06:47:37.005156 6183 net.cpp:84] Creating Layer res4b_res4b_0_split
I0614 06:47:37.005290 6183 net.cpp:406] res4b_res4b_0_split <- res4b
I0614 06:47:37.005432 6183 net.cpp:380] res4b_res4b_0_split -> res4b_res4b_0_split_0
I0614 06:47:37.005589 6183 net.cpp:380] res4b_res4b_0_split -> res4b_res4b_0_split_1
I0614 06:47:37.005784 6183 net.cpp:122] Setting up res4b_res4b_0_split
I0614 06:47:37.005924 6183 net.cpp:129] Top shape: 128 256 6 18 (3538944)
I0614 06:47:37.006058 6183 net.cpp:129] Top shape: 128 256 6 18 (3538944)
I0614 06:47:37.006186 6183 net.cpp:137] Memory required for data: 3266455552
# 以下为两个分支,resnet_v2 residual block
# 第一个分支:conv 1*1,channel翻倍(我这里翻一半256->384),feature map stride为2(6 * 18 -> 3 * 9)
I0614 06:47:37.006321 6183 layer_factory.hpp:77] Creating layer res5a_branch1
I0614 06:47:37.006469 6183 net.cpp:84] Creating Layer res5a_branch1
I0614 06:47:37.006597 6183 net.cpp:406] res5a_branch1 <- res4b_res4b_0_split_0
I0614 06:47:37.006743 6183 net.cpp:380] res5a_branch1 -> res5a_branch1
I0614 06:47:37.010433 6183 net.cpp:122] Setting up res5a_branch1
I0614 06:47:37.010591 6183 net.cpp:129] Top shape: 128 384 3 9 (1327104)
I0614 06:47:37.010730 6183 net.cpp:137] Memory required for data: 3271763968
# 第二个分支:两个base building block,第一个要负责channel翻倍和feature map stride为2
# BN
I0614 06:47:37.010865 6183 layer_factory.hpp:77] Creating layer bn5a_branch2a
I0614 06:47:37.011039 6183 net.cpp:84] Creating Layer bn5a_branch2a
I0614 06:47:37.011173 6183 net.cpp:406] bn5a_branch2a <- res4b_res4b_0_split_1
I0614 06:47:37.011315 6183 net.cpp:380] bn5a_branch2a -> res5a_branch2a
I0614 06:47:37.011767 6183 net.cpp:122] Setting up bn5a_branch2a
I0614 06:47:37.011909 6183 net.cpp:129] Top shape: 128 256 6 18 (3538944)
I0614 06:47:37.012042 6183 net.cpp:137] Memory required for data: 3285919744
# scale
I0614 06:47:37.012198 6183 layer_factory.hpp:77] Creating layer scale5a_branch2a
I0614 06:47:37.012343 6183 net.cpp:84] Creating Layer scale5a_branch2a
I0614 06:47:37.012476 6183 net.cpp:406] scale5a_branch2a <- res5a_branch2a
I0614 06:47:37.012619 6183 net.cpp:367] scale5a_branch2a -> res5a_branch2a (in-place)
I0614 06:47:37.012815 6183 layer_factory.hpp:77] Creating layer scale5a_branch2a
I0614 06:47:37.013126 6183 net.cpp:122] Setting up scale5a_branch2a
I0614 06:47:37.013267 6183 net.cpp:129] Top shape: 128 256 6 18 (3538944)
I0614 06:47:37.013396 6183 net.cpp:137] Memory required for data: 3300075520
# ReLu
I0614 06:47:37.013535 6183 layer_factory.hpp:77] Creating layer res5a_branch2a_relu
I0614 06:47:37.013675 6183 net.cpp:84] Creating Layer res5a_branch2a_relu
I0614 06:47:37.013814 6183 net.cpp:406] res5a_branch2a_relu <- res5a_branch2a
I0614 06:47:37.013959 6183 net.cpp:367] res5a_branch2a_relu -> res5a_branch2a (in-place)
I0614 06:47:37.014320 6183 net.cpp:122] Setting up res5a_branch2a_relu
I0614 06:47:37.014461 6183 net.cpp:129] Top shape: 128 256 6 18 (3538944)
I0614 06:47:37.014588 6183 net.cpp:137] Memory required for data: 3314231296
# 问题出在这 conv 3*3 pad 1 stride 2 channel 256->384,这里输出blob(top layer)和输入blob(bottom layer)的大小已经完全不一致
# 但是我在prototxt中,仍给这两个变量命为同一个名,导致两个在进行inplace运算时,caffe尝试reshape而报错
I0614 06:47:37.014735 6183 layer_factory.hpp:77] Creating layer res5a_branch2a
I0614 06:47:37.014889 6183 net.cpp:84] Creating Layer res5a_branch2a
I0614 06:47:37.015050 6183 net.cpp:406] res5a_branch2a <- res5a_branch2a
I0614 06:47:37.015213 6183 net.cpp:367] res5a_branch2a -> res5a_branch2a (in-place)
F0614 06:47:37.033099 6183 cudnn_conv_layer.cpp:138] Check failed: status == CUDNN_STATUS_SUCCESS (3 vs. 0) CUDNN_STATUS_BAD_PARAM
*** Check failure stack trace: ***
@ 0x7fb4b4a675cd google::LogMessage::Fail()
@ 0x7fb4b4a69433 google::LogMessage::SendToLog()
@ 0x7fb4b4a6715b google::LogMessage::Flush()
@ 0x7fb4b4a69e1e google::LogMessageFatal::~LogMessageFatal()
@ 0x7fb4b5074598 caffe::CuDNNConvolutionLayer<>::Reshape()
@ 0x7fb4b519dddb caffe::Net<>::Init()
@ 0x7fb4b51a061e caffe::Net<>::Net()
@ 0x7fb4b51a9775 caffe::Solver<>::InitTrainNet()
@ 0x7fb4b51aaba5 caffe::Solver<>::Init()
@ 0x7fb4b51aaebf caffe::Solver<>::Solver()
@ 0x7fb4b51bc3d1 caffe::Creator_AdamSolver<>()
@ 0x40bfb3 train()
@ 0x408660 main
@ 0x7fb4b3b9c830 __libc_start_main
@ 0x408fb9 _start
@ (nil) (unknown)
Aborted (core dumped)

Answer:解释如上注释,不同层的同名blob变量可以看成是不同层,在复用c++里已经定义好的数组(inplace),所以数组的shape必须是一致的。

修改如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
layer {
bottom: "res5a_branch2a"
top: "res5a_branch2a"
name: "scale5a_branch2a"
type: "Scale"
scale_param {
bias_term: true
}
}
layer {
bottom: "res5a_branch2a"
top: "res5a_branch2a"
name: "res5a_branch2a_relu"
type: "ReLU"
}
layer {
bottom: "res5a_branch2a"
top: "res5a_branch2a" -> top: "res5a_branch2b"
name: "res5a_branch2a"
type: "Convolution"
convolution_param {
num_output: 384
kernel_size: 3
pad: 1
stride: 2
weight_filler {
type: "msra"
}
bias_term: false
}
}

layer {
bottom: "res5a_branch2a" -> bottom: "res5a_branch2b"
top: "res5a_branch2b"
name: "bn5a_branch2b"
type: "BatchNorm"
batch_norm_param{
use_global_stats: false
moving_average_fraction: 0.95
}
include {
phase: TRAIN
}
}
layer {
bottom: "res5a_branch2a" -> bottom: "res5a_branch2b"
top: "res5a_branch2b"
name: "bn5a_branch2b"
type: "BatchNorm"
batch_norm_param{
use_global_stats: true
}
include {
phase: TEST
}
}
感谢对原创的支持~