当前位置：网站首页>Thesis reading (2) - vggnet of classification

Thesis reading (2) - vggnet of classification

2022-07-28 23:02:00 【leu_ mon】

VGGNet

author ：Karen Simonyan & Andrew Zisserman
Company ： University of Oxford and Google DeepMind
Time ：ILSVRC 2014 runner-up
subject ：Very Deep Convolutional Networks for Large-Scale Image

Abstract

In this work , We study the influence of convolution network depth on the accuracy in large-scale image recognition environment . Our main contribution is to use very small （3×3） Convolutional filter architecture makes a comprehensive assessment of the increase in network depth , This shows that by pushing the depth to 16-19 The weighting layer can realize the significant improvement of the prior art configuration . These findings are our ImageNet Challenge 2014 The basis of the paper , Our team won the first place and the second place in the positioning and classification process . We also show that , Our representation is good for generalization of other datasets , The best results on other datasets . We make our two best ConvNet Models are publicly available , In order to further study the depth vision representation in computer vision .

background

With ConvNets It is becoming more and more popular in the field of computer vision , In order to achieve better accuracy , Many attempts have been made to improve Krizhevsky wait forsomeone （2012） Original architecture , This paper discusses ConvNet Another important aspect of Architecture Design —— Its depth . For this reason, other parameters of the architecture have been modified , Use very small in all layers （3×3） Convolutional filter , And by adding more convolution layers to steadily increase the depth of the network .

Model

Insert picture description here

All hidden layers are equipped with ReLU Activation function , The above network （ In addition to a ） No local response normalization （LRN）, From the network A Medium 11 A weighted layer （8 Convolutions and 3 All connection layers ） To the network E in Of 19 A weighted layer （16 Convolutions and 3 All connection layers ）.

By using three 3×3 Convolution layers are stacked to replace individual 7×7 layer . First , Three nonlinear correction layers are combined , Not a single one , This makes the decision function more discriminative . secondly , Reduce the number of parameters ： This can be seen as right 7×7 Convolution filter is regularized , Force them through 3×3 filter （ Inject nonlinearity between them ） decomposition .

combination 1×1 Convolution layer （ To configure C, surface 1） It is a way to increase the nonlinearity of decision function without affecting the receptive field of convolution layer .

result

Insert picture description here

Use the local response normalization network （A-LRN The Internet ） In the absence of any normalized layer , On the model A Did not improve .
The classification error increases with ConvNet Increase in depth and decrease in depth , To configure C（ Contains three 1×1 Convolution layer ） Than in the entire network layer 3×3 Convolution configuration D Worse .
When the depth reaches 19 When the layer , The error rate of the architecture is saturated , But a deeper model may be beneficial to a larger data set .
A shallow network of measurements top- 1 Error rate ratio network B Of top-1 Error rate （ On the center crop image ） high 7％, This proves that the deep network with small convolution kernel is better than the shallow network with large convolution kernel .
Scale jitter during training （S ∈ [256;512]） With fixed minimum edge （S = 256 or S = 384） Image training results better than , This confirms that the training set enhancement through scale jitter is indeed helpful in capturing multi-scale image statistics .

The above table adopts multi-scale test , Compared with the previous table, it is found that , Scale jitter during testing results in better performance .
Insert picture description here

Use multi crop images （multi-crop） Performance is more intensive than evaluation （dense） Slightly better , And these two methods are complementary , Because their combination is superior to each of them . The author believes that this is caused by different treatments of convolution boundary conditions .

multi-crop： That is, to carry out random cropping of multiple images , Then predict the structure of each sample through the network , And then you average all the outcomes
densely： utilize FCN Thought , Send the original map directly to the network for prediction , Change the last full connection layer to 1x1 Convolution of , In this way, we can finally get a prediction score map, And then you average the results

Insert picture description here
Trained 6 A single scale network and a multi-scale network ,7 Network integration has 7.3％ Of ILSVRC Test error （ Multiple test scales are used in the test ）, Use two best performing multiscale models （ To configure D and E） Are combined , A combination of intensive evaluation and multi cropping image evaluation is used to reduce the test error to 6.8％.

Insert picture description here

The model proposed in this paper is significantly better than that of previous generations in ILSVRC-2012 and ILSVRC-2013 The model with the best result in the competition .

Training & test

Batch size 256, momentum by 0.9, Weight falloff （L2 The penalty factor is set to 5⋅10−4）
Two fully connected layers take dropout Regularization （dropout The ratio is set to 0.5）
The learning rate is initially set to 0.01, The rate of study is 10 Double the ratio to reduce , Total decrease 3 Time , Training 74 individual epochs.
By using Glorot＆Bengio（2010） To initialize weights without pre training .
A fixed size is randomly cropped from the normalized training image 224×224 Of The input image
During training , The cropped image is randomly flipped horizontally and randomly RGB Color shift ; When testing , Enhance the test set by flipping the image horizontally
Using a large number of cropped images can improve accuracy , Because compared with the full convolution network , It makes the sampling of the input image more precise

Highlights of the article

The depth of the network increases , The network topology is simple
Replace big core with small core , And the use of 1*1 Convolution layer , These papers have been tried before the author
Experiments show that the effect of normalized response layer is not obvious
When testing, multiple test scales can increase performance