当前位置:网站首页>Deep learning classification network -- alexnet
Deep learning classification network -- alexnet
2022-07-02 06:00:00 【occasionally.】
Deep learning classification network Summary
- AlexNet
- VGGNet
Catalog
- Deep learning classification network Summary
- Preface
- One 、 Network structure
- Two 、 Bright spot
- Reference material
- Pit to be filled ...
Preface
AlexNet By Hinton And its students Alex Designed by , To obtain the 2012 ILSVRC(ImageNet Large Scale Visual Recognition Challenge) The champion of the competition ,top5 Error rate for 15.3%, Far ahead of the second place 26.2%, The superiority of convolutional neural network in image recognition task is proved [1].
One 、 Network structure

Quote the network structure diagram in the original paper , It can be seen that AlexNet share 8 layer ,5 Convolutions and 3 All connection layers . Use torchstat Tool printing pytorch Officially realized AlexNet Input and output dimensions of each layer in 、 Parameter quantity (parameters) And the amount of computation (FLOPs,MACs):
import torchvision.models as models
from torchstat import stat
alexnet = models.alexnet()
stat(alexnet, (3,224,224))
# attach Pytorch Officially realized AlexNet structure , Different from the original paper
AlexNet(
(features): Sequential(
(0): Conv2d(3, 64, kernel_size=(11, 11), stride=(4, 4), padding=(2, 2))
(1): ReLU(inplace=True)
(2): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
(3): Conv2d(64, 192, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
(4): ReLU(inplace=True)
(5): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
(6): Conv2d(192, 384, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(7): ReLU(inplace=True)
(8): Conv2d(384, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(9): ReLU(inplace=True)
(10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(11): ReLU(inplace=True)
(12): MaxPool2d(kernel_size=3, stride=2, padding=0, dilation=1, ceil_mode=False)
)
(avgpool): AdaptiveAvgPool2d(output_size=(6, 6))
(classifier): Sequential(
(0): Dropout(p=0.5, inplace=False)
(1): Linear(in_features=9216, out_features=4096, bias=True)
(2): ReLU(inplace=True)
(3): Dropout(p=0.5, inplace=False)
(4): Linear(in_features=4096, out_features=4096, bias=True)
(5): ReLU(inplace=True)
(6): Linear(in_features=4096, out_features=1000, bias=True)
)
)

Briefly summarize the input and output dimensions 、 Calculation method of parameter quantity and calculation quantity :
1. Input and output dimensions
The output size of convolution layer and pooling layer can be calculated according to the following formula : W o u t = W i n − K + 2 P S + 1 W_{out}=\frac{W_{in}-K+2P}{S}+1 Wout=SWin−K+2P+1 In the above formula : W i n W_{in} Win For input feature size ,K Is the convolution kernel size ,P by padding size , W o u t W_{out} Wout Is the dimension of the output characteristic drawing .
Take the first convolution layer for example : Input shape by (3,224,224), Convolution kernels for 11×11, In steps of 4,padding by (2,2), Parameters can be substituted : W o u t = 224 − 11 + 4 4 + 1 = 55.25 W_{out}=\frac{224-11+4}{4}+1=55.25 Wout=4224−11+4+1=55.25 Round down to 55, The number of convolution kernels is 64, So output shape by (64,55,55).
2. Parameter quantity
Convolution layer parameter quantity = ( K ∗ K ∗ C i n ) ∗ C o u t + C o u t (K*K*C_{in})*C_{out}+C_{out} (K∗K∗Cin)∗Cout+Cout, The full connection layer is regarded as K=1 Convolution layer of .
Take the first convolution layer for example :params = 11×11×3×64+64=23296
Take the first fully connected layer as an example :params = 1×1×9216×4096+4096=37752832
3. Amount of computation (FLOPs,MACs/MAdd)
- FLOPs(Float Point Operations): Number of floating point operations , Used to measure the complexity of an algorithm or model , Each plus 、 reduce 、 ride 、 Divide and calculate a floating-point operation .
- MACs(Multiply-Accumulate Operations): Multiply and add cumulative operation times , A multiply add accumulate operation includes a multiply operation and an add operation . therefore , There are usually FLOPs=2*MACs This relationship .
FLOPs computing method : Take convolution as an example , Suppose after convolution, output C o u t C_{out} Cout A dimension for H o u t × W o u t H_{out}×W_{out} Hout×Wout Characteristic graph , Each value in a single characteristic graph is calculated by convolution , So convoluted total FLOPs= H o u t × W o u t × C o u t H_{out}×W_{out}×C_{out} Hout×Wout×Cout × Convoluted FLOPs. A convolution calculation can be simplified to y = w x + b y=wx+b y=wx+b, there y y y Is to output a value in the characteristic graph , w w w by K × K × C i n K×K×C_{in} K×K×Cin The weight matrix of , w x wx wx contain K × K × C i n K×K×C_{in} K×K×Cin Multiplication operations and K × K × C i n − 1 K×K×C_{in}-1 K×K×Cin−1 An addition operation , + b +b +b contain 1 An addition operation , So the convolution is FLOPs= ( K × K × C i n ) + ( K × K × C i n − 1 ) + 1 = 2 ∗ K 2 ∗ C i n (K×K×C_{in})+(K×K×C_{in}-1)+1=2*K^2*C_{in} (K×K×Cin)+(K×K×Cin−1)+1=2∗K2∗Cin. So convoluted total FLOPs It can be calculated according to the following formula :
F L O P s ( c o n v ) = H o u t × W o u t × C o u t × 2 × K 2 × C i n ≈ 2 × p a r a m s × H o u t × W o u t FLOPs(conv) =H_{out}×W_{out}×C_{out}×2×K^2×C_{in} \approx2×params×H_{out}×W_{out} FLOPs(conv)=Hout×Wout×Cout×2×K2×Cin≈2×params×Hout×Wout(torchstat Medium FLOPs and MAdd It seems to be the reverse ?)
Two 、 Bright spot
1. ReLU Activation function
The author mentioned in his paper “ For the training time of gradient descent ,sigmoid and tanh Such a saturated nonlinear function is better than an unsaturated nonlinear function ReLU Much slower ”, The reason for the slow speed is sigmoid and tanh Function involves exponential operation , So in AlexNet Choose from ReLU Function as activation function .
# Draw the active function image
import numpy as np
import matplotlib.pyplot as plt
plt.rc('font',family='Times New Roman', size=15)
x = np.linspace(-10,10,500)
sigmoid = 1 / (1+np.exp(-x))
tanh = (np.exp(x)-np.exp(-x)) / (np.exp(x)+np.exp(-x))
relu = np.where(x<0, 0, x)
fig = plt.figure()
ax = fig.add_subplot(211)
ax.spines['right'].set_visible(False)
ax.spines['top'].set_visible(False)
ax.spines['left'].set_position(('data',0))
ax.spines['bottom'].set_position(('data',0))
plt.plot(x, sigmoid, label='sigmoid')
plt.plot(x, tanh, label='tanh')
plt.grid(linestyle='-.')
plt.legend()
ax2 = fig.add_subplot(212)
ax2.spines['right'].set_visible(False)
ax2.spines['top'].set_visible(False)
ax2.spines['left'].set_position(('data',0))
ax2.spines['bottom'].set_position(('data',0))
plt.plot(x, relu, label='ReLU')
plt.grid(linestyle='-.')
plt.legend()
plt.tight_layout()
plt.show()

- The author proves through experiments that ReLU Compared to the function tanh The advantages of functions : Use ReLU Activate the four layers of the function CNN stay CIFAR-10 Reached on dataset 25% The speed ratio of error rate tanh The activation function is about fast 6 times .( The solid line in the figure below is ReLU, The dotted line is tanh)
- Except for the speed ,ReLU The activation function can also avoid the gradient disappearance problem caused by the saturation function .

2. GPU Parallel training
There is one mentioned in the paper GTX 580GPU Only 3GB Of memory , It limits the maximum size of the network that can be trained with it , So split the network into two GPU On , Two GOU You can directly read and write to each other's memory , Without going through the host memory , double gpu Network comparison gpu The network spends less training time . The parallelization scheme adopted by the author is to put half of the neurons in a single GPU On , Another technique is used :GPU Only communicate in a specific layer .
3. Local response normalization (LRN)
In neurobiology “ Lateral inhibition ” Concept ( An excited neuron can reduce the activity of surrounding neurons ) Inspired by the , The author proposed the local response normalization , The calculation method is as follows :
In style :
b x , y i b_{x,y}^{i} bx,yi It means the first one i i i Feature map in position ( x , y ) (x,y) (x,y) The value after local response normalization at
a x , y i a_{x,y}^{i} ax,yi It means the first one i i i Feature map in position ( x , y ) (x,y) (x,y) The value before local response normalization at
k , α , β k,\alpha, \beta k,α,β Is a super parameter
N N N Is the total number of characteristic graphs , n n n Indicates how many adjacent feature graphs are taken , Use them in position ( x , y ) (x,y) (x,y) Place the value of the ( That is, in the formula a x , y j a_{x,y}^{j} ax,yj) In sum
Pytorch There is one difference between the implementation method in and the original paper , take α \alpha α Change it to α n \frac{\alpha}{n} nα:
# test Pytorch Medium LRN layer
import torch
import torch.nn as nn
import torch.nn.functional as F
lrn = nn.LocalResponseNorm(size=3, alpha=1, beta=1, k=1)
input_tensor = F.relu(torch.randn((1,4,3,3))) # (batchsize, C, H, W), 4 individual size by 3*3 Characteristic graph
output_tensor = lrn(input_tensor)
# Before normalization
tensor([[[[0.0000, 0.0000, 0.0000],
[1.5463, 1.2684, 1.5114],
[0.6285, 1.6448, 0.0000]],
[[0.5272, 0.0000, 0.3121],
[0.9505, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000]],
[[1.1392, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000],
[0.1394, 0.0000, 0.5774]],
[[1.0331, 1.0747, 0.0000],
[1.0267, 0.9921, 0.0000],
[0.0000, 0.0000, 0.0000]]]])
# After normalization
tensor([[[[0.0000, 0.0000, 0.0000],
[0.7370, 0.8256, 0.8580],
[0.5554, 0.8649, 0.0000]],
[[0.3457, 0.0000, 0.3023],
[0.4530, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000]],
[[0.6056, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000],
[0.1385, 0.0000, 0.5197]],
[[0.5777, 0.7760, 0.0000],
[0.7598, 0.7470, 0.0000],
[0.0000, 0.0000, 0.0000]]]])
Calculate it LRN Calculation process : Take the second characteristic graph in (0,0) Value at position 0.5272 For example , The lower limit of the summation channel is m a x ( 0 , 1 − 3 / 2 ) max(0,1-3/2) max(0,1−3/2)=0, Cap of m i n ( 4 − 1 , 1 + 3 / 2 ) = 2 min(4-1,1+3/2)=2 min(4−1,1+3/2)=2, So consider 0,1,2 Three channels , The substitution formula has :
b 0 , 0 1 = 0.5272 1 + ( 0 + 0.527 2 2 + 1.139 2 2 ) / 3 = 0.3457 b_{0,0}^1 = \frac{0.5272}{1+(0+0.5272^2+1.1392^2)/3}=0.3457 b0,01=1+(0+0.52722+1.13922)/30.5272=0.3457
4. Overlapping pooling
Use steps smaller than the size of the pooled window , Be able to obtain richer features ,top1 and top5 The error rates went down 0.4% and 0.3%.
5. Prevent over fitting
5.1 Data to enhance
- from 256×256 Crop immediately in the image 224×224, And translate and horizontally flip the cropped image . During the test TenCrop, Average 10 Picture of softmax Output as final output .
- change RGB Channel strength : Execute... On the image PCA, Then in each pixel of the original image (RGB Three channels ) Add the value calculated by the following formula

In the above formula : p i p_i pi and λ i \lambda_i λi For each RGB Pixel value of 3×3 Eigenvectors and eigenvalues of covariance matrix , p i p_i pi The dimensions are 3×1, Therefore, the calculation result of the above formula is 3×1, Three values are added to the pixel value RGB On three channels . α i \alpha_i αi Is subject to a mean of 0 And the standard deviation is 0.1 Gaussian variable of , Each training image is randomly generated once α i \alpha_i αi.
5.2 Dropout
Dropout technology : In each iteration of the neural network , With a certain probability p p p Randomly discard neurons , Make it not participate in forward and backward propagation in the current iteration .
- Integrating the prediction results of multiple models can effectively reduce the test error , But the cost of training multiple models is too high , Especially on large data sets .
- When... Is introduced into the network Dropout when , Each time new training data is received, a new network structure will be sampled , Train one subnet at a time , Multiple rounds of training is equivalent to integrating multiple sub networks . At testing time , All neurons are activated , But you need to multiply their output by the discard probability p p p.

6. Training strategy
6.1 SGD with momentum and weight decay
Small batch random gradient descent using large momentum and weight decay :batch size by 128, momentum by 0.9,weight decay by 0.0005.
6.2 Weight initialization
Proper weight initialization can speed up the learning of the network in the initial stage , In the text, all layers weights Initialize to w ∼ N ( 0 , 0.01 ) w\sim N(0,0.01) w∼N(0,0.01);bias Hierarchical initialization , second 、 Four 、 Five convolution layers and all fully connected layers bias Initialize to 1, Other layers are initialized to 0.
6.3 The learning rate has decreased
The initial learning rate of all layers is 0.01, When the error rate on the verification set no longer decreases, the learning rate decreases 1/10.
7. Convolution kernel Visualization
Reference material
[1] Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]// International Conference on Neural Information Processing Systems. Curran Associates Inc. 2012:1097-1105.
[2] 【 Deep learning theory Convolutional neural networks 02】 General knowledge of convolution ( Calculate the shape of the output result according to the convolution kernel size and step size )
[3] CNN The computational power required for the model flops What is it? ? How to calculate ?
[4] 【 Local response normalization 】Local Response Normalization
[5] Srivastava N, Hinton G, Krizhevsky A, et al. Dropout: a simple way to prevent neural networks from overfitting[J]. The journal of machine learning research, 2014, 15(1): 1929-1958.
Pit to be filled …
- Summary of activation functions in Neural Networks
- Summary of normalization techniques in deep learning
- Summary of over fitting solutions
- Commonly used optimizers in deep learning
- Common weight initialization methods in neural networks
- Learning rate decay Strategy Summary
边栏推荐
- 【LeetCode】Day92-盛最多水的容器
- Mathematical statistics and machine learning
- Redis Key-Value数据库【初级】
- Go 学习笔记整合
- STC8H8K系列汇编和C51实战——串口发送菜单界面选择不同功能
- Test case
- 数据库学习总结5
- Go language web development is very simple: use templates to separate views from logic
- php数组转化为xml
- Stc8h8k Series Assembly and c51 Real combat - NIXIE TUBE displays ADC, Key Series port reply Key number and ADC value
猜你喜欢

Practice C language advanced address book design

Stick to the big screen UI, finereport development diary

Spark概述

"Simple" infinite magic cube

Oled12864 LCD screen

Redis Key-Value数据库【初级】

Lantern Festival gift - plant vs zombie game (realized by Matlab)

Keepalived installation, use and quick start

Gcnet: non - local Networks meet Squeeze excitation Networks and Beyond

死磕大屏UI,FineReport开发日记
随机推荐
【C语言】简单实现扫雷游戏
[leetcode] day92 container with the most water
脑与认知神经科学Matlab Psytoolbox认知科学实验设计——实验设计四
PHP read file (read JSON file, convert to array)
[PHP是否安装了 SOAP 扩]对于php实现soap代理的一个常见问题:Class ‘SoapClient‘ not found in PHP的处理方法
PHP gets CPU usage, hard disk usage, and memory usage
Summary of MySQL constraints
php获取cpu使用率、硬盘使用、内存使用
[C language] screening method for prime numbers
神机百炼3.53-Kruskal
Zabbix Server trapper 命令注入漏洞 (CVE-2017-2824)
格式校验js
Stc8h8k series assembly and C51 actual combat - serial port sending menu interface to select different functions
mock-用mockjs模拟后台返回数据
[whether PHP has soap extensions installed] a common problem for PHP to implement soap proxy: how to handle class' SoapClient 'not found in PHP
1036 Boys vs Girls
JS determines whether the mobile terminal or the PC terminal
Matplotlib double Y axis + adjust legend position
《CGNF: CONDITIONAL GRAPH NEURAL FIELDS》阅读笔记
OLED12864 液晶屏