训练好的模型测试效果差?不同层的同名blob变量意味着什么?
Question: 训练好的模型,部署测试的时候效果极差(相对训练时loss,cls等输出而言)
Answer: 可能的情况有:
- deploy.prototxt中,层的name和train.prototxt中对应层不一致。debug过程可用:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
| import cv2 import caffe net = caffe.Net('deploy.prototxt', 'resnet18.caffemodel', caffe.TEST) transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape})
transformer.set_transpose('data', (2, 0, 1)) net.blobs["data"].reshape(1, channel_num, image_height, image_width) image = cv2.imread('image.jpg')/255 transformed_image = self.transformer.preprocess('data', image) net.blobs['data'].data[...] = transformed_image output = net.forward() print([(k,v[0].data) for k,v in net.params.items()]) w1 = net.params['Convolution_top'][0].data b1 = net.params['Convolution_top'][1].data feature = = net.blobs['Convolution_name'].data
|
- 输入的图片数据,和训练时输入的图片数据预处理不一致。如训练用opencv的
cv2.imread('image.jpg')
,而测试用caffe.io.load_image('image.jpg')
。因为,cv2.imread()
读入的图片是[0, 255]的BRG格式,而caffe.io.load_image()
读入的数据是[0, 1]范围的RGB格式:
1 2 3 4 5 6 7 8 9 10 11
| import caffe net = caffe.Net('deploy.prototxt', 'resnet18.caffemodel', caffe.TEST) transformer = caffe.io.Transformer({'data': net.blobs['data'].data.shape})
transformer.set_transpose('data', (2, 0, 1)) transformer.set_channel_swap('data', (2, 1, 0)) net.blobs["data"].reshape(1, channel_num, image_height, image_width)
image = caffe.io.load_image('image.jpg') transformed_image = self.transformer.preprocess('data', image) net.blobs['data'].data[...] = transformed_image
|
Question: 在训练模型时,模型加载一半,报出以下错误:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
| ... I0614 06:47:37.005009 6183 layer_factory.hpp:77] Creating layer res4b_res4b_0_split I0614 06:47:37.005156 6183 net.cpp:84] Creating Layer res4b_res4b_0_split I0614 06:47:37.005290 6183 net.cpp:406] res4b_res4b_0_split <- res4b I0614 06:47:37.005432 6183 net.cpp:380] res4b_res4b_0_split -> res4b_res4b_0_split_0 I0614 06:47:37.005589 6183 net.cpp:380] res4b_res4b_0_split -> res4b_res4b_0_split_1 I0614 06:47:37.005784 6183 net.cpp:122] Setting up res4b_res4b_0_split I0614 06:47:37.005924 6183 net.cpp:129] Top shape: 128 256 6 18 (3538944) I0614 06:47:37.006058 6183 net.cpp:129] Top shape: 128 256 6 18 (3538944) I0614 06:47:37.006186 6183 net.cpp:137] Memory required for data: 3266455552 # 以下为两个分支,resnet_v2 residual block # 第一个分支:conv 1*1,channel翻倍(我这里翻一半256->384),feature map stride为2(6 * 18 -> 3 * 9) I0614 06:47:37.006321 6183 layer_factory.hpp:77] Creating layer res5a_branch1 I0614 06:47:37.006469 6183 net.cpp:84] Creating Layer res5a_branch1 I0614 06:47:37.006597 6183 net.cpp:406] res5a_branch1 <- res4b_res4b_0_split_0 I0614 06:47:37.006743 6183 net.cpp:380] res5a_branch1 -> res5a_branch1 I0614 06:47:37.010433 6183 net.cpp:122] Setting up res5a_branch1 I0614 06:47:37.010591 6183 net.cpp:129] Top shape: 128 384 3 9 (1327104) I0614 06:47:37.010730 6183 net.cpp:137] Memory required for data: 3271763968 # 第二个分支:两个base building block,第一个要负责channel翻倍和feature map stride为2 # BN I0614 06:47:37.010865 6183 layer_factory.hpp:77] Creating layer bn5a_branch2a I0614 06:47:37.011039 6183 net.cpp:84] Creating Layer bn5a_branch2a I0614 06:47:37.011173 6183 net.cpp:406] bn5a_branch2a <- res4b_res4b_0_split_1 I0614 06:47:37.011315 6183 net.cpp:380] bn5a_branch2a -> res5a_branch2a I0614 06:47:37.011767 6183 net.cpp:122] Setting up bn5a_branch2a I0614 06:47:37.011909 6183 net.cpp:129] Top shape: 128 256 6 18 (3538944) I0614 06:47:37.012042 6183 net.cpp:137] Memory required for data: 3285919744 # scale I0614 06:47:37.012198 6183 layer_factory.hpp:77] Creating layer scale5a_branch2a I0614 06:47:37.012343 6183 net.cpp:84] Creating Layer scale5a_branch2a I0614 06:47:37.012476 6183 net.cpp:406] scale5a_branch2a <- res5a_branch2a I0614 06:47:37.012619 6183 net.cpp:367] scale5a_branch2a -> res5a_branch2a (in-place) I0614 06:47:37.012815 6183 layer_factory.hpp:77] Creating layer scale5a_branch2a I0614 06:47:37.013126 6183 net.cpp:122] Setting up scale5a_branch2a I0614 06:47:37.013267 6183 net.cpp:129] Top shape: 128 256 6 18 (3538944) I0614 06:47:37.013396 6183 net.cpp:137] Memory required for data: 3300075520 # ReLu I0614 06:47:37.013535 6183 layer_factory.hpp:77] Creating layer res5a_branch2a_relu I0614 06:47:37.013675 6183 net.cpp:84] Creating Layer res5a_branch2a_relu I0614 06:47:37.013814 6183 net.cpp:406] res5a_branch2a_relu <- res5a_branch2a I0614 06:47:37.013959 6183 net.cpp:367] res5a_branch2a_relu -> res5a_branch2a (in-place) I0614 06:47:37.014320 6183 net.cpp:122] Setting up res5a_branch2a_relu I0614 06:47:37.014461 6183 net.cpp:129] Top shape: 128 256 6 18 (3538944) I0614 06:47:37.014588 6183 net.cpp:137] Memory required for data: 3314231296 # 问题出在这 conv 3*3 pad 1 stride 2 channel 256->384,这里输出blob(top layer)和输入blob(bottom layer)的大小已经完全不一致 # 但是我在prototxt中,仍给这两个变量命为同一个名,导致两个在进行inplace运算时,caffe尝试reshape而报错 I0614 06:47:37.014735 6183 layer_factory.hpp:77] Creating layer res5a_branch2a I0614 06:47:37.014889 6183 net.cpp:84] Creating Layer res5a_branch2a I0614 06:47:37.015050 6183 net.cpp:406] res5a_branch2a <- res5a_branch2a I0614 06:47:37.015213 6183 net.cpp:367] res5a_branch2a -> res5a_branch2a (in-place) F0614 06:47:37.033099 6183 cudnn_conv_layer.cpp:138] Check failed: status == CUDNN_STATUS_SUCCESS (3 vs. 0) CUDNN_STATUS_BAD_PARAM *** Check failure stack trace: *** @ 0x7fb4b4a675cd google::LogMessage::Fail() @ 0x7fb4b4a69433 google::LogMessage::SendToLog() @ 0x7fb4b4a6715b google::LogMessage::Flush() @ 0x7fb4b4a69e1e google::LogMessageFatal::~LogMessageFatal() @ 0x7fb4b5074598 caffe::CuDNNConvolutionLayer<>::Reshape() @ 0x7fb4b519dddb caffe::Net<>::Init() @ 0x7fb4b51a061e caffe::Net<>::Net() @ 0x7fb4b51a9775 caffe::Solver<>::InitTrainNet() @ 0x7fb4b51aaba5 caffe::Solver<>::Init() @ 0x7fb4b51aaebf caffe::Solver<>::Solver() @ 0x7fb4b51bc3d1 caffe::Creator_AdamSolver<>() @ 0x40bfb3 train() @ 0x408660 main @ 0x7fb4b3b9c830 __libc_start_main @ 0x408fb9 _start @ (nil) (unknown) Aborted (core dumped)
|
Answer:解释如上注释,不同层的同名blob变量可以看成是不同层,在复用c++里已经定义好的数组(inplace),所以数组的shape必须是一致的。
修改如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
| layer { bottom: "res5a_branch2a" top: "res5a_branch2a" name: "scale5a_branch2a" type: "Scale" scale_param { bias_term: true } } layer { bottom: "res5a_branch2a" top: "res5a_branch2a" name: "res5a_branch2a_relu" type: "ReLU" } layer { bottom: "res5a_branch2a" top: "res5a_branch2a" -> top: "res5a_branch2b" name: "res5a_branch2a" type: "Convolution" convolution_param { num_output: 384 kernel_size: 3 pad: 1 stride: 2 weight_filler { type: "msra" } bias_term: false } }
layer { bottom: "res5a_branch2a" -> bottom: "res5a_branch2b" top: "res5a_branch2b" name: "bn5a_branch2b" type: "BatchNorm" batch_norm_param{ use_global_stats: false moving_average_fraction: 0.95 } include { phase: TRAIN } } layer { bottom: "res5a_branch2a" -> bottom: "res5a_branch2b" top: "res5a_branch2b" name: "bn5a_branch2b" type: "BatchNorm" batch_norm_param{ use_global_stats: true } include { phase: TEST } }
|