当前位置:网站首页>CV - baseline summary (development history from alexnet to senet)

CV - baseline summary (development history from alexnet to senet)

2022-06-12 23:35:00 A deep slag who loves learning

One 、 The original

Deep learning starts from 2015 Since then , The model is also iteratively optimized ;

Now many new models often stand on the shoulders of giants , I want to record here baseline A process of model development , And constantly updated content ; Each layer of the model will not be split , Instead, key innovation points are recorded , Of course, it also includes some personal understanding and thinking , If there is something wrong or incomplete, you are welcome to communicate , I also hope to improve my understanding of the model in future work ;

BaseLine What do you mean ?

Usually we also call BaseLine Model is classified model , It is often used to implement a classification task when getting started . But these models do not just appear in classification tasks , But the whole CV The cornerstone of the field . Detection of rear contact 、 Division 、 Key point regression and other tasks , Are inseparable from a key step —— feature extraction ; This key step is based on the feature map after feature extraction ,BaseLine Models are essential for these tasks , And choose different models , It also has a great impact on the performance of the whole task ( Of course, we do not rule out designing our own models for feature extraction )

Two 、AlexNet

Address of thesis

significance : A model belonging to the originator of the opening , The prologue of convolution network dominating computer vision ;

This model is relatively simple , Take a look directly at Network structure chart

 Insert picture description here

This network has a feature , Lies in two GPU Parallel training and final fusion , To use more training resources , The idea of setting this network structure is also involved in the new network , You can start here first Mark once ;

Key concepts

One 、ReLU The proposal of activation function , What are the advantages ?

First ReLU and Sigmoid Equal activation functions are nonlinear activation functions , Because if you use a linear activation function , So there is no difference between a neural network with multiple hidden layers and a single layer , It will lose the characteristics of convolution network ;

Here's a comparison Sigmoid Functions and ReLU function , Let's first look at the formula , This must be remembered ;

Sigmoid Calculation formula :
y = 1 1 + e − x y=\frac{1}{1+e^{-x}} y=1+ex1
The gradient formula :
y ′ = y ∗ ( 1 − y ) y^{\prime}=y *(1-y) y=y(1y)
ReLu Calculation formula :
y = m a x ( 0 , x ) y = max(0, x) y=max(0,x)
The gradient formula :
y ′ = { 1 , x > 0  undefined,  x = 0 0 , x < 0 y^{\prime}=\left\{\begin{array}{cc} 1, & x>0 \\ \text { undefined, } & x=0 \\ 0, & x<0 \end{array}\right. y=1, undefined, 0,x>0x=0x<0
 Insert picture description here

 Insert picture description here

As can be seen from the two pictures above ,sigmoid Function in x When the value is large or small , The gradient is almost 0, Will cause the gradient to disappear ;

ReLU There are the following points :

1、 Make web training faster ( Because the calculation is simple and the gradient is greater than 0 The area gradually increases )

2、 Prevent the gradient from disappearing ;

3、 Make the network sparse ( When x It's a negative number , The gradient of 0, Neurons do not participate in training )

reflection :

ReLU In less than 0 It's about , The gradient is also 0, Is there room for improvement ?

Leaky relu It's right ReLU An improvement of , Solved the above problems ;

Calculation formula :
y = m a x ( 0.01 x , x ) y = max(0.01x, x) y=max(0.01x,x)

Two 、 Calculation of convolution layer output size ?

The parameters we pass into the convolution are : Enter picture size I * I、 Convolution kernel size K * K、 step S、 fill Padding by P

Output size calculation formula :
O = ( I − K + 2 P ) / S + 1 O = (I - K+2P) / S +1 O=(IK+2P)/S+1

3、 ... and 、Dropout The concept of

Random inactivation of neurons , The function is to reduce over fitting , Improve the generalization ability of the network ;

For a more detailed explanation, please refer to this article :https://zhuanlan.zhihu.com/p/77609689

 Insert picture description here

Be careful : The test requires multiplying the output of the neuron by the inactivation ratio p;

3、 ... and 、VGG

Address of thesis

significance :VGG So far, it has been used as the backbone structure by other networks , It opens the small convolution kernel 、 The age of deep convolution ;

 Insert picture description here

The figure is given in the paper VGG An evolving process , At present, the main use is VGG16【D】 and VGG19【E】;

Key concepts

One 、 The stack 3x3 A function of convolution ? Why is small convolution kernel better than large convolution kernel ?

In fact, the function of convolution is to continuously extract the features of the input image , stay AlexNet Using a large convolution kernel , In this way, the sampling speed is faster , The number of layers of the model is not much . stay VGG Use in 3x3 After convolution kernel , Usually it is 2 Times the probability of down sampling , It helps to deepen the number of layers of the network ;

advantage :

1、 Increase the receptive field

2 individual 3 * 3 Stacking is equivalent to 1 individual 5 * 5, They feel the same way ;

2、 Reduce computation

Three 3x3 The parameter quantity required for convolution kernel is 27C

One 7x7 The parameter quantity required for convolution kernel is 49C

Both have the same receptive field , But with 3x3 The convolution kernel parameters are reduced 44%

Expand your thinking :**1x1 What does convolution kernel do ?** The next network will explain ;

Two 、 How to set the network to any size input ?

When there is a full connection layer in the model , The input to the model is fixed , Can the full connection layer be replaced ?

Replace the last full connection layer with the convolution layer , As shown in the figure below :

 Insert picture description here

The idea here is very important , With the development of the Internet , Due to the large amount of computation and the limitation on the input size of the full connection layer , Slowly replaced by convolution ;

Of course, the most important thing is , In the subsequent FCN in , It is also in the form of full convolution , Network structure to realize encoding and decoding ;

Four 、GoogleNet

Address of thesis

here GoogleNet It is divided into V1、V2、V3、V4 Four versions , Because it is not commonly used nowadays , Here we mainly introduce its key ideas ;

There are no versions here , Mainly explain the important concepts ,GoogleNet What is important is that some of them trick;

The first thing to know is Inception modular

 Insert picture description here

characteristic : Improve the utilization of computing resources , Increase the depth and width of the network , A small increase in parameters ;

Key concepts

One 、1x1 The function of convolution ?

This concept belongs to a very important one , stay Inception There are many uses in the module , And in ResNet Also used in ;

characteristic : Only change the number of output channels , Do not change the width and height of the output ;

effect :

1、 Play the role of dimension increase or dimension decrease , Can be used to compress the thickness , It can be used for the final classification output ;

2、 Increase nonlinearity , You can add one layer while keeping the number of channels unchanged 1x1 Convolution ;

3、 Reduce computation , Reduce the number of channels , The amount of calculation is naturally reduced ;

Two 、 What is the concept of ancillary loss ?

This concept is not common in subsequent networks , The main purpose is to extract the middle layer information for classification ;

Realization : A loss is computationally calculated at the middle layer output , Weighted with the loss of the final output ;

Personal understanding :

This trick It's not really necessary , You want to use the information of the middle layer in the subsequent network , The method of feature fusion can be adopted , use concat perhaps add The way , It is more effective than auxiliary loss ;

3、 ... and 、 A definition of sparse matrix ?

We often hear a concept , Make the network sparse , So what about sparsity , What are the advantages ?

sparse matrix :

The value is in the matrix 0 The number of elements is far more than that of non - elements 0 Element quantity , And it's irregular ;

Dense matrix :

The values in the matrix are not 0 There are far more elements than 0 Number of elements , And it's irregular ;

advantage : Sparse matrix can be decomposed into dense matrix to accelerate convergence , The simple conclusion is that it can not only reduce the memory but also improve the training speed ;

Four 、BN What is the role of layers ?

Reference article :https://blog.csdn.net/weixin_42080490/article/details/108849715

Its full name is Batch Normalization, Mainly to solve the problem that as the network deepens , Training is getting slower , The problem of slower and slower convergence ;

The cause of the problem :

With the stacking of network layers , The update of the parameters of each layer will lead to the change of the input data distribution of the upper layer , In this way, the high-level input distribution will change dramatically , So that the high-level has to constantly adapt to the low-level parameter update ; This condition is also called internal covariate offset , That is to say ICS;

In the training process of neural network, the input of each layer of neural network keeps the same distribution ;

Practical use : Change the data to a mean of 0 The standard deviation is 1 The standard normal distribution of , Make the input value of the activation function fall in the sensitive area , The output of the network will not be very large , You can get a larger gradient , The problem of gradient disappearance is avoided , It also accelerates training ;

The main role :

1、 Speed up the training and convergence of the network ( here GoogleNetV2 The mid version is faster than the previous version 10 times )

2、 Control the gradient explosion to prevent the gradient from disappearing ;

3、 Prevent over fitting ;

Be careful :

Used BN After the layer , We can abandon dropout This operation , And there is no need to pay too much attention to the initialization of weights , We can also use a larger learning rate to accelerate the convergence of the model , There are really many benefits , Basically, every network will add such a structure ;

5、 ... and 、 What strategy is convolution decomposition ?

1、 The large convolution kernel is decomposed into small convolution kernel stacks ;

Here is also the two 3x3 Convolution instead of a 5x5 Convolution ;

2、 It's broken down into asymmetric convolutions ;

That is to say nxn The convolution of is decomposed into a 1xn And a nx1 The stack of convolutions ;

This is mainly to reduce network parameters , But this strategy is only useful when the resolution of the network feature map is small , So it is not common in subsequent networks

6、 ... and 、 What is the label smoothing strategy ?

problem :

Conventional One-hot There is a problem with coding , That is, overconfidence , It is easy to cause over fitting ;

resolvent :

Put the label smooth , take One-hot The confidence in the code is 1 The item of is attenuated , Avoid overconfidence , The part that will decay confience Average into each category ;

Example :

The original tag value is (0,1,0,0) ——>>(0.00025,0.99925,0.00025,0.00025)

Then it is passed into the loss function ;

5、 ... and 、ResNet

Address of thesis

ResNet As and VGG It is also the most popular convolutional neural network structure in industry , Its important modules are shown in the figure below :

0c94820e7971c002d65685c6a2ed10eb.png

Its significance is to promote the network to a deeper level of development , Successfully trained hundreds of layers of network for the first time ;

Key concepts

One 、 What is the above structure called ? What does it do ?

Reference article :https://zhuanlan.zhihu.com/p/80226180

The structure in the above figure is called residual structure (Residual learning)

What problems will arise from deepening the network layer ?

1、 It is easy to produce gradient disappearance or gradient explosion ( Can be BN Layer and regularization are added to solve )

2、 Network degradation : The performance of the network is gradually saturated with the increase of the number of layers , And then it goes down rapidly ( Network optimization is difficult )

The residual structure is also a process of network identity mapping , The formula is as follows :
F ( x ) = W 2 ∗ R e L U ( W 1 ∗ x ) F(x) = W2*ReLU(W1*x) F(x)=W2ReLU(W1x)

H ( x ) = F ( x ) + x = W 2 ∗ R e L U ( W 1 ∗ x ) + x H(x) = F(x) + x = W2*ReLU(W1*x) + x H(x)=F(x)+x=W2ReLU(W1x)+x

When F(x) by 0 when ,H(x)= x, Therefore, the identity mapping of the network is realized ;

effect :

1、 Conducive to gradient propagation , So that the gradient does not disappear or explode , The network can be stacked to thousands of layers ;

2、 The jump connection structure is introduced , It plays the role of quoting upper level information , There is also a concept of integrated learning ;

Of course , In order to reduce the amount of calculation , It also introduces the previously improved 1x1 Convolution for ascending and descending dimensions :

 Insert picture description here

6、 ... and 、ResNeXt

Address of thesis

The main idea : be based on ResNet The Internet , Introduced Inception The concept of aggregation transformation in ;

 Insert picture description here

Key concepts

One 、 What is block convolution ? What's the role ?

 Insert picture description here

Like the structure above , Is the structure of block convolution , The idea comes from AlexNet Split the convolution into two GPU On this idea ;

Implementation strategy : Divide the input feature map into C Group , Normal convolution is performed inside each group , Then, the output characteristic map is obtained by channel splicing ;

effect :

1、 Use fewer parameters to get the same characteristic graph ;

2、 Let the network learn more different characteristics , Get more information ;

Be careful : Although block convolution is similar to deep separable convolution , But it's not the same , Then we will introduce ;

7、 ... and 、DenseNet

Address of thesis

significance : For the convolution network short path and feature resuse Rethinking , Achieve better results ;

Whole DenseNet The network is divided into three parts , That is, head convolution 、Dense Block The stack , Fully connected output classification probability ;

Header convolution is the operation of convolution and pooling to control the output dimension , This time we will mainly explain Dense This structure ;

 Insert picture description here

Key concepts

One 、 What is dense connection ? What are the advantages ?

DenseNet The most important structure in is shown in the figure above , Call it dense connection ;

Mainly in a Block in , The input of each layer comes from the characteristic diagram of all the layers before it , The output of each layer is directly connected to the input of all layers behind it ;

effect :

1、 Get more features with fewer parameters , Reduced the amount of parameters ,DenseNet Only by the ResNet1/3 The accuracy of the parameters is similar ;

2、 Low level feature reuse , More features ;

3、 Stronger gradient flow , More hop structures , Gradients are easier to propagate forward ;

Be careful :

Dense connections actually play the role of regularization on small data , There is a saying that dense connections are more suitable for small data sets ;

shortcoming :

DenseNet Your training is a huge drain on storage resources , Because a large number of characteristic graphs need to be reserved for back propagation calculation ;

Two 、 Feature fusion add and concat What's the difference ? Which is better? ?

add: The values of the characteristic graph are added , The number of channels remains unchanged ;

concat: It refers to feature map splicing , The number of channels increases ;

Personal understanding : In fact, the difference between the two is add Is the addition of two characteristic graphs , There is a risk of losing information ,concat It is feature map splicing , Keep all original information , So use concat The way will be better , but add The required memory and parameters will be less than concat;

8、 ... and 、SENet

Address of thesis

significance : The attention mechanism was first introduced into convolutional neural networks , And the mechanism is a plug and play module ;

 Insert picture description here

The above figure is the most important module structure in the network ——SE block, It embodies the process of compression and fusion ;

Key concepts

One 、 What is the core idea of attention mechanism ?

Attention mechanism can be understood as , Design some weight values of network output layer , Use weight value and characteristic graph to calculate , Realize the transformation of characteristic graph , Make the model strengthen the eigenvalue of the concerned area ;

Two 、SE—block The principle of structure , How to realize the attention mechanism ?

From the picture above, we can see ,SE—block It is mainly divided into three parts :Squeeze、Excitation、Scale;

Squeeze: Through global average pooling (GAP) Compress the feature map to vector form (1x1xC);

Excitation: Two fully connected layers map and transform the feature vector , Finally, a sigmoid Limit the scope to [0, 1];

Scale: The weight vector obtained is multiplied by each original channel to obtain the weighted characteristic graph ;

summary :

The above is how the attention mechanism is implemented , Of course, attention mechanism can not only act on channels , It can also act on HxW, Some frontier applications of attention mechanism are also used for VIT in , If you are interested, you can understand Attation Development and application of ;

Last

This will also CV—Baseline The basic models in , But only from some characteristics of the network , Some small trick It is not explained in this article , If you are interested, you can read the paper ;

Of course , Understanding the principles of a model is not a mastery of the model , Code implementation is also a key step , It is hoped that each module and structure can be implemented with a familiar framework , Especially for the official open source code , principle + Only when you are proficient in the code can you really master ;

 Insert picture description here

原网站

版权声明
本文为[A deep slag who loves learning]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/163/202206122330183629.html

随机推荐