当前位置:网站首页>CV fully connected neural network
CV fully connected neural network
2022-06-23 18:57:00 【Bachuan Xiaoxiaosheng】
All connected neural networks
Cascade multiple transforms
For example, two-layer full connection
f = w 2 m a x ( 0 , w 1 x + b 1 ) + b 2 f=w_{2}max(0,w_{1}x+b_{1})+b_{2} f=w2max(0,w1x+b1)+b2
Nonlinear operation is necessary
It is different from the linear classifier in that it can deal with linear non separable cases
structure
- Input
- Hidden layer
- Output
- A weight
In addition to the number of input layers, there are several neural networks
Activation function
Remove non-linear operation ( Activation function ) The neural network degenerates into a linear classifier
Commonly used
- sigmoid
1 ( 1 + e − x ) \frac{1}{(1+e^{-x})} (1+e−x)1
Compress the value to 0~1 - ReLU
m a x ( 0 , x ) max(0,x) max(0,x) - tanh
e x − e − x e x + e − x \frac{e^{x}-e^{-x}}{e^{x}+e^{-x}} ex+e−xex−e−x
Compress the value to -1~1 - Leaky ReLU
m a x ( 0.1 x , x ) max(0.1x,x) max(0.1x,x)
The structure design
- Width
- depth
The more neurons there are , The stronger the nonlinearity , The interface is more complex
But not more is better , The more complicated the better
You should choose according to the difficulty of the task
softmax
First index and then normalize
The output can be transformed into a probability distribution
Cross entropy
It is necessary to measure the predicted distribution and the real distribution
The true distribution is generally one-hot Formal
H ( p , q ) = − ∑ x p ( x ) l o g ( q ( x ) ) H(p,q)=-\sum_{x}p(x)log(q(x)) H(p,q)=−x∑p(x)log(q(x))
H(p,q)=KL(p||q)+H(p)
However, the truth value is generally one-hot,H(p) Usually it is 0
Sometimes I use KL The divergence
Calculation chart
Directed graph , Express input , Output , Calculation relationship between intermediate variables , Each node corresponds to an operation
The computer can use the chain rule to calculate the gradient of each position of any complex function
The calculation diagram has the problem of granularity
The gradient disappears 、 Gradient explosion
sigmoid The derivative of is very small in a large range , Plus the multiplication of the chain rule , Cause the derivative to return to 0, It's called gradient vanishing
It can also cause a gradient explosion , It can be solved by gradient clipping
(Leakly)ReLU The derivative of a function is greater than 0 Time is always one , Will not cause the gradient to disappear / The explosion , Gradient flow is smoother , Convergence is faster
Improved gradient algorithm
- gradient descent —— All samples need to be calculated at one time , It takes too much time
- Stochastic gradient descent —— Sometimes there is noise , Low efficiency
- Small batch gradient descent —— compromise
There are still problems
Encounter Valley , Oscillate in one direction , Slow descent in the other direction
Doing too much useless work in the direction of vibration
Adjusting the step size alone cannot solve
Solution
Momentum method
Use historical accumulation
The oscillation directions cancel each other out , Acceleration in the flat descent direction
It can also break through local minimum and saddle point
Pseudo code
Initialization speed v=0
loop :
---- Calculate the gradient g
---- Speed update v = μ v + g v=\mu v+g v=μv+g
---- Update the weights w = w − ε v w=w-\varepsilon v w=w−εv
μ \mu μ Value [0,1), commonly 0.9
Adaptive gradient method
Reduce the vibration direction step , Increase the flat direction step size
The larger square of gradient amplitude is the direction of oscillation , The smaller is the flat direction
Pseudo code
Initialize the cumulative variable r=0
loop :
---- Calculate the gradient g
---- Cumulative square gradient r=r+g*g
---- Update the weights w = w − ε r + δ g w=w-\frac{\varepsilon}{\sqrt{r}+\delta}g w=w−r+δεg
δ \delta δ Prevent division by zero , Usually it is 1 0 − 5 10^{-5} 10−5
The defect is that the accumulation time is too long, and the velocity in the flat direction will also be suppressed
RMSProp
The solution is to mix historical information
The improvement is to replace the cumulative square gradient with r = ρ r + ( 1 − ρ ) g ∗ g r=\rho r+(1-\rho )g*g r=ρr+(1−ρ)g∗g
ρ \rho ρ take [0,1) It's usually 0.999
ADAM
Use both momentum and adaptation
Pseudo code
Initialize the cumulative variable r=0,v=0
loop :
---- Calculate the gradient g
---- Cumulative gradient v = μ v + ( 1 − μ ) g v=\mu v+(1-\mu)g v=μv+(1−μ)g
---- Cumulative square gradient r = ρ r + ( 1 − ρ ) g ∗ g r=\rho r+(1-\rho)g*g r=ρr+(1−ρ)g∗g
---- Correct the deviation v ^ = v 1 − μ t , r ^ = r 1 − ρ t \hat{v}=\frac{v}{1-\mu^{t}},\hat{r}=\frac{r}{1-\rho^{t}} v^=1−μtv,r^=1−ρtr
---- Update the weights w = w − ε r + δ v ^ w=w-\frac{\varepsilon}{\sqrt{r}+\delta}\hat{v} w=w−r+δεv^
Decay rate ρ \rho ρ Momentum coefficient μ \mu μ The suggestion is 0.999 and 0.9
Correcting the deviation can alleviate the initial cold start problem of the algorithm
use ADAM There is no need to manually adjust parameters
Weight initialization
All zero initialization
All neurons have the same output , Parameter update is the same , Can't train
Random weight initialization
The weight is Gaussian distribution
But the output will be saturated or close to zero in the process of propagation , Can't train
Xavier initialization
The goal is to keep the activation value and local gradient variance of each layer of the network consistent as much as possible in the propagation process , Search for w The distribution of makes the input and output variances consistent
When var(w)=1/N when , The input and output variances are consistent
HE initialization
Apply to ReLU, The weights are sampled at N ( 0 , 2 / N ) \mathcal{N}(0,2/N) N(0,2/N)
Batch of normalization
Direct batch normalization of neuron output
Usually after full connection , Before nonlinear operation
It can prevent the output from falling in the small gradient area , It can solve the problem of gradient disappearance
Mean and variance can be learned by network
Under fitting
The ability of model description is too weak , The model is too simple
Over fitting
Good performance in training set but poor performance in real scene
Only remember the training samples without extracting features
The fundamental problem of machine learning
- Optimize Get the best performance on the training set
- generalization Get good performance on unknown data
At the beginning of training : Optimize generalization synchronization
Later in training : Generalization decreases , There has been a fit
resolvent
- Optimal scheme —— Get enough data
- Suboptimal scheme —— The tuning model allows you to store the size of the information
- Adjust model size
- Constraint model weights , Use regular terms
Random deactivation
Give hidden neurons a chance not to be activated , Use... During training Dropout, Make some neurons output randomly 0
- Reduced model capacity
- Encourage weight dispersion , There is regularization
- It can be regarded as an integration model
Existing problems
The test phase does not use random deactivation, resulting in a different output range from the training
Solution
Multiply the activated neuron output by a factor during training
Hyperparameters
Learning rate
- Too big Unable to converge
- Larger Shock , Unable to reach the optimum
- Too small Convergence time is too long
- Moderate Fast convergence , The result is good
Hyperparametric search method
- The grid search
- Random search —— More combinations of super parameters , General choice
First rough search and then fine search
Generally in log Spatial search
边栏推荐
- 杰理之播 MP3 提示音功能【篇】
- When Jerry's serial port is set up, it prints garbled code, and the internal crystal oscillator is not calibrated [chapter]
- Database migration tool flyway vs liquibase (I)
- How far is the rise of cloud native industry applications from "available" to "easy to use"?
- 杰理之.强制升级【篇】
- 【One by One系列】IdentityServer4(二)使用Client Credentials保护API资源
- Shunted Self-Attention | 源于 PvT又高于PvT,解决小目标问题的ViT方法
- Definition and model of indicators (complex indicators)
- Yapi installation
- Borui data attends Alibaba cloud observable technology summit, and digital experience management drives sustainable development
猜你喜欢
随机推荐
Improving efficiency or increasing costs, how should developers understand pair programming?
【Qt】第三、四章:窗口部件、布局管理
元宇宙大杀器来了!小扎祭出4款VR头显,挑战视觉图灵测试
杰理之播 MP3 提示音功能【篇】
vPROM笔记
可编程数据平面(论文阅读)
【One by One系列】IdentityServer4(六)授权码流程原理之SPA
物流服务与管理主要学什么
【One by One系列】IdentityServer4(二)使用Client Credentials保护API资源
诺亚财富通过聆讯:年营收43亿 汪静波有49%投票权,红杉是股东
高级计网笔记(四)
涂鸦智能通过聆讯:拟回归香港上市 腾讯是重要股东
【Qt】第十章:数据库
【NOI2014】15. Difficult to get up syndrome [binary]
halcon知识:区域(Region)上的轮廓算子(1)
Heavyweight: the domestic ide was released, developed by Alibaba, and is completely open source! (high performance + high customization)
Nanxin semiconductor rushes to the scientific innovation board: its annual revenue is RMB 980 million. Sequoia Xiaomi oppo is the shareholder
Machine learning jobs
Basic knowledge of penetration test
微机原理第五章笔记整理









