当前位置：网站首页>Deep learning classification network -- alexnet

Deep learning classification network -- alexnet

2022-07-02 06:00:00 【occasionally.】

Deep learning classification network Summary

AlexNet
VGGNet

Preface

AlexNet By Hinton And its students Alex Designed by , To obtain the 2012 ILSVRC（ImageNet Large Scale Visual Recognition Challenge） The champion of the competition ,top5 Error rate for 15.3%, Far ahead of the second place 26.2%, The superiority of convolutional neural network in image recognition task is proved ^[1].

One 、 Network structure

Insert picture description here
Quote the network structure diagram in the original paper , It can be seen that AlexNet share 8 layer ,5 Convolutions and 3 All connection layers . Use torchstat Tool printing pytorch Officially realized AlexNet Input and output dimensions of each layer in 、 Parameter quantity （parameters） And the amount of computation （FLOPs,MACs）：

import torchvision.models as models
from torchstat import stat

alexnet = models.alexnet()
stat(alexnet, (3,224,224))

#  attach Pytorch Officially realized AlexNet structure , Different from the original paper 
AlexNet(
  (features): Sequential(
    (0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
    (1): ReLU(inplace=True)
    (2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (4): ReLU(inplace=True)
    (5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
    (6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (7): ReLU(inplace=True)
    (8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (9): ReLU(inplace=True)
    (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
    (11): ReLU(inplace=True)
    (12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
  (classifier): Sequential(
    (0): Dropout(p=0.5, inplace=False)
    (1): Linear(in_features=9216, out_features=4096, bias=True)
    (2): ReLU(inplace=True)
    (3): Dropout(p=0.5, inplace=False)
    (4): Linear(in_features=4096, out_features=4096, bias=True)
    (5): ReLU(inplace=True)
    (6): Linear(in_features=4096, out_features=1000, bias=True)
  )
)

Insert picture description here
Briefly summarize the input and output dimensions 、 Calculation method of parameter quantity and calculation quantity ：

1. Input and output dimensions

The output size of convolution layer and pooling layer can be calculated according to the following formula ： $W_{out}=\frac{W_{in}-K+2P}{S}+1$ In the above formula ： $W_{in}$ For input feature size ,K Is the convolution kernel size ,P by padding size , $W_{out}$ Is the dimension of the output characteristic drawing .
Take the first convolution layer for example ： Input shape by （3,224,224）, Convolution kernels for 11×11, In steps of 4,padding by （2,2）, Parameters can be substituted ： $W_{out}=\frac{224-11+4}{4}+1=55.25$ Round down to 55, The number of convolution kernels is 64, So output shape by （64,55,55）.

2. Parameter quantity

Convolution layer parameter quantity = $K*K*C_{in})*C_{out}+C_{out}$ , The full connection layer is regarded as K=1 Convolution layer of .
Take the first convolution layer for example ：params = 11×11×3×64+64=23296
Take the first fully connected layer as an example ：params = 1×1×9216×4096+4096=37752832

3. Amount of computation （FLOPs,MACs/MAdd）

FLOPs（Float Point Operations）： Number of floating point operations , Used to measure the complexity of an algorithm or model , Each plus 、 reduce 、 ride 、 Divide and calculate a floating-point operation .
MACs（Multiply-Accumulate Operations）： Multiply and add cumulative operation times , A multiply add accumulate operation includes a multiply operation and an add operation . therefore , There are usually FLOPs=2*MACs This relationship .

FLOPs computing method ： Take convolution as an example , Suppose after convolution, output $C_{out}$ A dimension for $H_{out}×W_{out}$ Characteristic graph , Each value in a single characteristic graph is calculated by convolution , So convoluted total FLOPs= $H_{out}×W_{out}×C_{out}$ × Convoluted FLOPs. A convolution calculation can be simplified to $y = w x + b$ , there $y$ Is to output a value in the characteristic graph , $w$ by $K×K×C_{in}$ The weight matrix of , $w x$ contain $K×K×C_{in}$ Multiplication operations and $K×K×C_{in}-1$ An addition operation , $+ b$ contain 1 An addition operation , So the convolution is FLOPs= $K×K×C_{in})+(K×K×C_{in}-1)+1=2*K^2*C_{in}$ . So convoluted total FLOPs It can be calculated according to the following formula ：
$=H_{out}×W_{out}×C_{out}×2×K^2×C_{in} \approx2×params×H_{out}×W_{out}$ （torchstat Medium FLOPs and MAdd It seems to be the reverse ？）

Two 、 Bright spot

1. ReLU Activation function

The author mentioned in his paper “ For the training time of gradient descent ,sigmoid and tanh Such a saturated nonlinear function is better than an unsaturated nonlinear function ReLU Much slower ”, The reason for the slow speed is sigmoid and tanh Function involves exponential operation , So in AlexNet Choose from ReLU Function as activation function .

#  Draw the active function image 
import numpy as np
import matplotlib.pyplot as plt

plt.rc('font',family='Times New Roman', size=15)

x = np.linspace(-10,10,500)
sigmoid = 1 / (1+np.exp(-x))
tanh = (np.exp(x)-np.exp(-x)) / (np.exp(x)+np.exp(-x))
relu = np.where(x<0, 0, x)

fig = plt.figure()
ax = fig.add_subplot(211)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_position(('data',0))
ax.spines['bottom'].set_position(('data',0))
plt.plot(x, sigmoid, label='sigmoid')
plt.plot(x, tanh, label='tanh')
plt.grid(linestyle='-.')
plt.legend()
ax2 = fig.add_subplot(212)
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.spines['left'].set_position(('data',0))
ax2.spines['bottom'].set_position(('data',0))
plt.plot(x, relu, label='ReLU')
plt.grid(linestyle='-.')
plt.legend()
plt.tight_layout()
plt.show()

Insert picture description here

The author proves through experiments that ReLU Compared to the function tanh The advantages of functions ： Use ReLU Activate the four layers of the function CNN stay CIFAR-10 Reached on dataset 25% The speed ratio of error rate tanh The activation function is about fast 6 times .( The solid line in the figure below is ReLU, The dotted line is tanh)
Except for the speed ,ReLU The activation function can also avoid the gradient disappearance problem caused by the saturation function .

2. GPU Parallel training

There is one mentioned in the paper GTX 580GPU Only 3GB Of memory , It limits the maximum size of the network that can be trained with it , So split the network into two GPU On , Two GOU You can directly read and write to each other's memory , Without going through the host memory , double gpu Network comparison gpu The network spends less training time . The parallelization scheme adopted by the author is to put half of the neurons in a single GPU On , Another technique is used ：GPU Only communicate in a specific layer .

3. Local response normalization （LRN）

In neurobiology “ Lateral inhibition ” Concept （ An excited neuron can reduce the activity of surrounding neurons ） Inspired by the , The author proposed the local response normalization , The calculation method is as follows ：
Insert picture description here
In style ：
$b_{x,y}^{i}$ It means the first one $i$ Feature map in position $(x, y)$ The value after local response normalization at
$a_{x,y}^{i}$ It means the first one $i$ Feature map in position $(x, y)$ The value before local response normalization at
$k,\alpha, \beta$ Is a super parameter
$N$ Is the total number of characteristic graphs , $n$ Indicates how many adjacent feature graphs are taken , Use them in position $(x, y)$ Place the value of the （ That is, in the formula $a_{x,y}^{j}$ ） In sum
Pytorch There is one difference between the implementation method in and the original paper , take $\alpha$ Change it to $\frac{\alpha}{n}$ ：
Insert picture description here

#  test Pytorch Medium LRN layer 
import torch
import torch.nn as nn
import torch.nn.functional as F

lrn = nn.LocalResponseNorm(size=3, alpha=1, beta=1, k=1) 
input_tensor = F.relu(torch.randn((1,4,3,3))) # (batchsize, C, H, W), 4 individual size by 3*3 Characteristic graph 
output_tensor = lrn(input_tensor)

#  Before normalization 
tensor([[[[0.0000, 0.0000, 0.0000],
          [1.5463, 1.2684, 1.5114],
          [0.6285, 1.6448, 0.0000]],

         [[0.5272, 0.0000, 0.3121],
          [0.9505, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000]],

         [[1.1392, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000],
          [0.1394, 0.0000, 0.5774]],

         [[1.0331, 1.0747, 0.0000],
          [1.0267, 0.9921, 0.0000],
          [0.0000, 0.0000, 0.0000]]]])
#  After normalization 
tensor([[[[0.0000, 0.0000, 0.0000],
          [0.7370, 0.8256, 0.8580],
          [0.5554, 0.8649, 0.0000]],

         [[0.3457, 0.0000, 0.3023],
          [0.4530, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000]],

         [[0.6056, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000],
          [0.1385, 0.0000, 0.5197]],

         [[0.5777, 0.7760, 0.0000],
          [0.7598, 0.7470, 0.0000],
          [0.0000, 0.0000, 0.0000]]]])

Calculate it LRN Calculation process ： Take the second characteristic graph in (0,0) Value at position 0.5272 For example , The lower limit of the summation channel is $m a x (0, 1 - 3 / 2)$ =0, Cap of $m i n (4 - 1, 1 + 3 / 2) = 2$ , So consider 0,1,2 Three channels , The substitution formula has ：
$b_{0,0}^1 = \frac{0.5272}{1+(0+0.5272^2+1.1392^2)/3}=0.3457$

4. Overlapping pooling

Use steps smaller than the size of the pooled window , Be able to obtain richer features ,top1 and top5 The error rates went down 0.4% and 0.3%.

5. Prevent over fitting

5.1 Data to enhance

from 256×256 Crop immediately in the image 224×224, And translate and horizontally flip the cropped image . During the test TenCrop, Average 10 Picture of softmax Output as final output .
change RGB Channel strength ： Execute... On the image PCA, Then in each pixel of the original image （RGB Three channels ） Add the value calculated by the following formula
In the above formula ： $p_i$ and $\lambda_i$ For each RGB Pixel value of 3×3 Eigenvectors and eigenvalues of covariance matrix , $p_i$ The dimensions are 3×1, Therefore, the calculation result of the above formula is 3×1, Three values are added to the pixel value RGB On three channels . $\alpha_i$ Is subject to a mean of 0 And the standard deviation is 0.1 Gaussian variable of , Each training image is randomly generated once $\alpha_i$ .

5.2 Dropout

Dropout technology ： In each iteration of the neural network , With a certain probability $p$ Randomly discard neurons , Make it not participate in forward and backward propagation in the current iteration .

Integrating the prediction results of multiple models can effectively reduce the test error , But the cost of training multiple models is too high , Especially on large data sets .
When... Is introduced into the network Dropout when , Each time new training data is received, a new network structure will be sampled , Train one subnet at a time , Multiple rounds of training is equivalent to integrating multiple sub networks . At testing time , All neurons are activated , But you need to multiply their output by the discard probability $p$ .

6. Training strategy

6.1 SGD with momentum and weight decay

Small batch random gradient descent using large momentum and weight decay ：batch size by 128, momentum by 0.9,weight decay by 0.0005.
Insert picture description here

6.2 Weight initialization

Proper weight initialization can speed up the learning of the network in the initial stage , In the text, all layers weights Initialize to $w\sim N(0,0.01)$ ;bias Hierarchical initialization , second 、 Four 、 Five convolution layers and all fully connected layers bias Initialize to 1, Other layers are initialized to 0.