当前位置：网站首页>Architecture and performance analysis of convolutional neural network

Architecture and performance analysis of convolutional neural network

2022-06-12 07:50:00 【FishPotatoChen】

Catalog

1 summary
2 CNN
3 R-CNN
4 Fast R-CNN
5 Faster R-CNN
6 AlexNet
7 ResNet
8 Mask R-CNN
9 YOLO
10 SSD
11 RetinaNet

Convolutional neural network is now widely used in the field of image recognition , There are mainly the most primitive CNN The Internet , After that CNN The network evolved R-CNN The Internet , But because the speed is not fast enough , And there it is RCNN Improved version Fast R-CNN With a faster framework Faster R-CNN, These improve the detection speed , But the accuracy is not high , In order to improve the accuracy , The number of layers of neural network is increased , The gradient disappears , So it was put forward ResNet. Pool layer is added for different feature sizes , Put forward AlexNet. stay ResNet On the basis of , And then it comes to mask R-CNN. In order to make the detection faster , Improved the selection of candidate boxes , Using the idea of regression to create YOLO, YOLO Although it has improved the shortcomings of the previous method , But it also introduces new shortcomings , To improve YOLO The shortcomings of , take YOLO Combined with previous ideas , To create a SSD. YOLO And SSD Because of the single-stage structure ( one-shot) The speed is fast and the accuracy decreases , To solve this problem , Put forward RetinaNet Solve .

This paper will analyze the above frameworks , And its performance is analyzed and compared .

1 summary

The original idea of neural network comes from brain science , To identify objects , Mimicking the brain to build a low-level neural network , At this time, full connection is used , The programmer can adjust the parameters by himself , Its essence is human speculation , Experiment with different parameters , Observe the output , Then adjust the parameters manually , Until two objects can be distinguished , The essence is a classifier . Obviously this method is slow , And once there are too many layers , There will be no way to support the phenomenon of computational power , for instance , Suppose there is $5 * 5$ A set of pictures , Use one layer $10$ A complete connection of neurons , Then this layer has to be calculated $5 * 5 * 10$ Time calculation , Then add the output of each neuron , Get a number , This number is the characteristic of this object , By adjusting the 10 The parameters of neurons stabilize the eigenvalues of the same object , At this time , If you add another layer $10$ A complete connection of neurons , Then the number of calculations becomes $5 * 5 * 10 + 10 * 10$ Time calculation , And it needs to be adjusted $20$ Parameters of neurons , This not only causes a lot of pressure on the computer , Moreover, the stability and correctness of the results also depend greatly on the person who adjusts the parameters . To solve the parameter adjustment problem , A back propagation algorithm is proposed , Feed back the results to the neural network , Find the local optimal solution by derivation , Change the parameters of neurons , Optimize it , Local minimum reached . When the local optimal solution is reached , The parameter will get stuck in the local optimal solution , And the best solution is the global optimal solution , So there comes drop out, When the iteration is stable, some neuron parameters are set to zero , In this way, the starting position of the parameter can be placed in another place , Then we will find the optimal solution again .

Every object has its own characteristics , The human brain uses these features to distinguish objects , The computer is not so intelligent , It is impossible to distinguish objects directly by their features , Because the pictures are big and small , This is a different feature in the eyes of the computer , So we need a way , Zoom in and out of features , Make the photo size independent of features . To solve this problem , Convolution kernel and pooling layer are added on the basis of full connection , On the one hand, zoom the picture , On the other hand, it reduces the amount of computation , For better results . So convolutional neural network was born .

2 CNN

CNN Is the most primitive convolutional neural network . It is based on the original full connection , Increase convolution kernel , The extracted features , Then the extracted features are classified by full connection , Select the item with the highest score as the result of object recognition . give an example , To distinguish a cat from a dog , Make a training set , Mark the picture label of the cat as $0$ , The picture label of the dog is $1$ , So after convolution neural network , It's close to $0$ The picture of the cat , It's close to $1$ The picture of the dog , This distinguishes the cat from the dog .

CNN The Internet extracts the features of photos by itself , Then classify —— This is also the origin of deep learning , That is, the machine extracts features autonomously —— Get the results .CNN The shortcomings are obvious , First of all, there is only one object in the photo ,CNN No segmentation of the photos , Secondly, the gradient will disappear with the increase of network layers , Make the whole network redundant .

3 R-CNN

R-CNN Is for CNN An improvement on the defect of ,R-CNN Multiple bounding boxes are created in the image , Check if there are any target objects in these boxes ,R-CNN Use selective search to extract these borders from an image . Selective search will determine these features of the object in the picture , Then highlight different areas based on these features . The process is as follows ：

Enter a photo ;
Generate the original segmentation image ;
Based on color 、 structure 、 Size 、 shape , Merge similar areas into larger areas ;
Generate ROI(Region of Interest).

use RCNN The steps for detecting the target object are as follows ：

Take a pre trained convolutional neural network ;
According to the number of target categories to be detected , The last layer of the training network ;
Get the of each picture ROI, Redevelop these areas , To make it conform to CNN Input dimension requirements for ;
obtain CNN After output , Training SVM( Support vector machine ) To identify the target object and the background . For each category , We all have to train a binary SVM.

such R-CNN The target object can be detected , The same shortcoming is obvious . In the above paragraph $3$ Step ,ROI The area needs $2000$ Multiple , And every one ROI Need one CNN, This leads to training a R-CNN Models are very expensive , Based on selective search , To extract each picture $2000$ Separate areas , use CNN Extract the features of each region . Suppose we have $N$ A picture , that CNN The characteristic is $N * 2000$ , A new photo needs $40 - 50$ Second to predict , For large data, it is basically impossible to process .

4 Fast R-CNN

R-CNN The basic process is to first generate ROI, Then for each ROI The extracted features , Finally, these features are classified to get the desired target , The main time consumed is CNN On .Fast R-CNN The creator of , If each photo is only used once CNN Can be faster . therefore , take R-CNN The execution steps of have been adjusted , First use... For the whole picture CNN, Get feature mapping (feature map, matrix $A$ A matrix is formed by convolution $B$ , $B$ by $A$ Feature mapping of ), Delimit by feature mapping ROI, To use ROI Pool layer will ROI Correct the size to the appropriate size , Then enter the full connection layer for classification . Such a picture only needs one time CNN, Reduce the time . however R-CNN Other shortcomings ,Fast R-CNN Still not solved , It also uses selective search as a way to find regions of interest , The process is usually slow .Fast R-CNN It takes about $2$ second , But this speed is still not ideal for large data sets .

5 Faster R-CNN

Faster R-CNN yes Fast R-CNN Optimized version of , Mainly to optimize ROI Generation ,Fast RCNN Using selective search , and Faster RCNN It's using Region Proposal Network(RPN).RPN working process ：

Enter photos and GT(ground truth, That is, the correct border of the target to be found );
Each feature mapped pixel is a Anchor( Anchor point ), With Anchor Generate... For the center $k$ Three rectangles of different sizes and proportions are used as anchor boxes, $k$ The default is $9$ ;
By calculation $I O U$ ( $I n t e r s e c t i o n o v e r U n i o n$ ), Label the positive sample with a GT Have the highest IOU Overlapping Anchor box. With any one of them GT Of IOU The overlap is greater than $0.7$ Of Anchor box . Negative sample label is given to all GT Of IOU The overlap is less than $0.3$ Of Anchor box. Return to the target proposal (object proposals,OP) And the corresponding score .
application ROI Pooling layer , Will all proposals Fixed to the same size , Return to the full connectivity layer , Generate the frame of the target object .

At this time OP And GT There is still a big gap , Need to be corrected . Translation and regression , Translation is the translation of a positive sample Anchor translation , That is, move the midpoint of the positive sample ; Regression is fixed Anchor, change box Size . Whether translation or regression , It's all about bringing Anchor box near GT.

And then through the loss function softmax loss(cross entropy loss, Cross entropy loss ) Calculate , The supervision label here is GT( $1, 0$ ), Keep telling whether the training results are right or not . This is the training until the final classification training is completed , Generate the bounding box of the target object .

RPN In short, it's in CNN Then the feature mapping is used as RPN The input of , Training , It uses a network to find the frame of the object ,RPN Separate the foreground from the background for object recognition ,RPN Don't solve what this object is , Then we classify the objects according to the features in their borders , Get what this object is .

At this time ,Faster R-CNN The detection time of a picture can be reduced to $0.2$ second . The problem lies in , Accuracy is not high enough ,CNN The number of layers is generally limited to $5$ layer .

6 AlexNet

When CNN Increase the number of layers , The gradient disappears , That is, the back-propagation algorithm changes the parameters of the neural network near the input layer slowly or even stagnates , Like gradient explosion, it is mainly caused by parameter multiplication . Gradient vanishing makes the training of deep neural network as effective as using only the last few layers , This hinders the development of deep neural networks .AlexNet Is the first method proposed for gradient vanishing .

AlexNet It contains several relatively new technical points , Also for the first time in CNN Successfully applied ReLU、Dropout and LRN Other methods . meanwhile AlexNet Also used. GPU Speed up the operation .AlexNet take LeNet To carry forward the thought of , hold CNN The basic principle is applied to a very deep and wide network .AlexNet The main new technologies used are as follows ：

Successful use of ReLU As CNN The activation function of , And verify its effect in the deeper network than Sigmoid, Successfully solved Sigmoid The problem of gradient dispersion in deep network . although ReLU The activation function was proposed a long time ago , But until AlexNet The emergence of the will carry it forward .
Use during training Dropout Ignore some neurons at random , To avoid over fitting the model .Dropout Although there is a separate paper on , however AlexNet Put it to use , Its effect has been proved by practice . stay AlexNet It is mainly used in the last several full connection layers Dropout.
stay CNN Maximum pooling using overlap in . before CNN Average pooling is commonly used in ,AlexNet All use maximum pooling , Avoid the blurring effect of average pooling . also AlexNet The step size is smaller than that of the pool core , In this way, there will be overlap and coverage between the outputs of the pooling layer , Enhance the richness of features .
Put forward LRN layer , Create a competitive mechanism for the activity of local neurons , Make the value with larger response relatively larger , And inhibit other neurons with smaller feedback , It enhances the generalization ability of the model .
Use CUDA Accelerate the training of deep convolution network , utilize GPU Powerful parallel computing power , When dealing with neural network training, a lot of matrix operations .AlexNet Used two pieces GTX 580 GPU Training , Single GTX 580 Only 3GB memory , This limits the maximum size of the network that can be trained . So the author will AlexNet Distributed in two GPU On , At every GPU Half of the neuron's parameters are stored in the video memory . because GPU Easy communication between , They can access each other's memory , It doesn't need to go through host memory , So use multiple pieces at the same time GPU It's also very efficient . meanwhile ,AlexNet The design of Jean GPU The communication between them is only in some layers of the network , Control the performance loss of communication .
Data to enhance , Randomly from $256 * 256$ In the original image $224 * 224$ Size area （ And a mirror image that flips horizontally ）, It's equivalent to an increase in $2*(256-224)^2=2048$ Times the amount of data . If there is no data enhancement , Only the amount of original data , There are many parameters CNN Will fall into over fitting , Using data enhancement can greatly reduce over fitting , Improve the generalization ability . When making predictions , Take the four corners of the picture plus the middle 5 A place , And flip left and right , Total obtained $10$ A picture , Make predictions about them and about $10$ Find the mean of the secondary results . meanwhile ,AlexNet It's mentioned in the paper that there will be RGB Data processing PCA Handle , And make a standard deviation for the principal component as $0.1$ Gaussian perturbation of , Add some noise , This method can reduce the error rate $1\%$ .

7 ResNet

ResNet Is the second method proposed for the disappearance of the gradient .ResNet Through the flying line across the layers , Transfer the information layer by layer , In order to resist the disappearance of gradients , hypothesis ResNet Fly the line every two layers , Here's the picture ：

chart 1

In this way, the number of layers of the neural network can be increased to $152$ layer , This increase in the number of layers leads to a significant increase in accuracy .

8 Mask R-CNN

Prior to Faster R-CNN On the basis of , Added more powerful functions . After the previous analysis ,Faster R-CNN in RPN Only objects can be divided , But semantic segmentation is not possible (Semantic Segmentation) And instance segmentation (Instance Segmentation).Mask R-CNN take Faster R-CNN And semantic segmentation algorithm FCN combination ,Faster R-CNN It can detect the target quickly and accurately ,FCN It can accurately complete semantic segmentation . Because of the discovery ROI Pooling The problem of pixel deviation in , The corresponding ROIAlign Strategy , add FCN Precise pixels MASK, This makes it possible to obtain high accuracy . in other words Mask R-CNN yes Faster R-CNN + ROIAlign(RPN Improve on the basis of ) + FCN. Even though Mask R-CNN Than Faster R-CNN complex , But the speed is still the same Faster R-CNN equally .

The general process is as follows ：

Enter a picture you want to work with , Then carry out the corresponding pretreatment operation , Or pre processed pictures ;
Input it into a pre trained neural network (ResNet etc. ) Get the corresponding feature map;
For this feature map Set a pre-set... At each point in ROI, So as to obtain multiple candidates ROI;
Put these candidates ROI Send in RPN Network binary classification ( Foreground or background ) and BB Return to , Filter out some of the candidates ROI;
For the rest of this ROI Conduct ROIAlign operation ( That is to say, the original picture and feature map Of pixel correspond , And then feature map And fixed feature correspond );
For these ROI To classify ( $N$ Category classification )、BB Regression and MASK Generate ( In every one of them ROI Carry out inside FCN operation ).

Mask R-CNN The biggest improvement is the use of ROI Align, stay Faster RCNN in , There are two integers ：

region proposal Of $x, y, w, h$ It's usually a decimal , But for the convenience of operation, it will be integer .
Divide the integral boundary area into $k * k$ A unit , The boundary of each element is integer .

in fact , After the above two integers , At this time, the candidate box has deviated from the original position , This deviation will affect the accuracy of detection or segmentation . To solve this problem ,ROI Align Method cancels the integer operation , Keep decimals , The bilinear interpolation method introduced above is used to obtain the image value on the pixel with floating-point coordinates . But in practice ,ROI Align It is not simply to supplement the coordinate points on the boundary of the candidate region , Then pool , It's a redesign .

Adopt bilinear difference , Interpolate the pixel points of the feature map , use 4 Sample points determine a point in a cell , After that, the cell $4$ I'll do it at two points maxpooling. Here's the picture ：

chart 2

The dotted line is the feature mapping , Each unit ( A small box divided by a red line in the black box ) The point in is determined by 4 The points of feature mapping are double interpolated ( obtain $4$ An arrow points to a point ), And then do the following for each unit maxpooling Get the value of each cell .

9 YOLO

The previous network separated the process of object recognition and location , So it's called two stages (two-stage) object detection , and YOLO Use one stage (one-stage) object detection ,YOLO Combine these two processes , Complete the identification and detection at one time .YOLO Mainly to improve the detection speed , however mAP No, Faster R-CNN high .

YOLO The main idea is to divide a picture into $S * S$ Cells , Subsequent output is done in cells ：

If one object The center of the falls on a cell , So this cell is responsible for predicting this object .
Each cell needs to predict $B$ individual bbox value (bbox Values include coordinates and width and height ), At the same time for each bbox Value predicts a confidence (confidence scores). That is, each cell needs to be predicted $B \times (4 + 1)$ It's worth .
Each cell needs to predict $C$ ( Number of object types ) Conditional probability values .
therefore , Finally, the output dimension of the network is $S * S * (B * 5 + C)$ , Although each cell is responsible for predicting an object ( There may be problems when there are small objects ), But each cell can predict multiple bbox value .

YOLO Speed increased to... Per second $45$ A picture , But the accuracy drops , And there are obvious shortcomings ： The effect is not good when facing small objects .

10 SSD

SSD Is in YOLO Join on the basis of Faster R-CNN Of anchor box thought , Observe YOLO, It is found that there are two full connection layers behind , After the full connection layer , The entire image is observed for each output , Not very reasonable . however SSD Remove the full connection layer , Change to convolution layer , Each output will only feel the information around the target , Including context , Increase rationality .

YOLO The output of the last layer is Fast NMS Network input , and SSD stay YOLO Several convolutions are added to the , The output of these convolution layers is spliced as Fast NMS Network input .SSD Increase the speed to... Per second $58$ Zhang , And improved mAP, however SSD There are also YOLO The shortcomings of .

11 RetinaNet

RetinaNet The main idea is to change the calculation method of the loss function ,Focal loss It is based on the cross entropy loss function . After people's observation and statistics, it is found that , Most of the targets to be detected are less than the background, or sometimes the sample deviates , For example, the number of positive samples far exceeds the number of negative samples . The network effect trained in this way is very bad . therefore RetinaNet The author of this paper increases the weight of a sample in the loss function , For example, when the number of positive samples far exceeds the number of negative samples , Increase the weight of the negative sample loss function , Decrease the positive sample , Another example is that the background is much larger than the target to be detected , At this time, the weight of the loss function of the background is reduced , Reduce the influence of the background . This can not only cope with the large difference between the number of positive and negative samples , And it can also improve YOLO and SSD The shortcomings of .

原网站

版权声明
本文为[FishPotatoChen]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/03/202203010554400890.html