当前位置:网站首页>CV fully connected neural network

CV fully connected neural network

2022-06-23 18:57:00 Bachuan Xiaoxiaosheng

All connected neural networks

Cascade multiple transforms

For example, two-layer full connection
f = w 2 m a x ( 0 , w 1 x + b 1 ) + b 2 f=w_{2}max(0,w_{1}x+b_{1})+b_{2} f=w2max(0,w1x+b1)+b2

Nonlinear operation is necessary
It is different from the linear classifier in that it can deal with linear non separable cases

structure

  • Input
  • Hidden layer
  • Output
  • A weight

In addition to the number of input layers, there are several neural networks

Activation function

Remove non-linear operation ( Activation function ) The neural network degenerates into a linear classifier
Commonly used

  • sigmoid
    1 ( 1 + e − x ) \frac{1}{(1+e^{-x})} (1+ex)1
    Compress the value to 0~1
  • ReLU
    m a x ( 0 , x ) max(0,x) max(0,x)
  • tanh
    e x − e − x e x + e − x \frac{e^{x}-e^{-x}}{e^{x}+e^{-x}} ex+exexex
    Compress the value to -1~1
  • Leaky ReLU
    m a x ( 0.1 x , x ) max(0.1x,x) max(0.1x,x)

The structure design

  • Width
  • depth
    The more neurons there are , The stronger the nonlinearity , The interface is more complex

But not more is better , The more complicated the better
You should choose according to the difficulty of the task

softmax

First index and then normalize
The output can be transformed into a probability distribution

Cross entropy

It is necessary to measure the predicted distribution and the real distribution

The true distribution is generally one-hot Formal

H ( p , q ) = − ∑ x p ( x ) l o g ( q ( x ) ) H(p,q)=-\sum_{x}p(x)log(q(x)) H(p,q)=xp(x)log(q(x))

H(p,q)=KL(p||q)+H(p)
However, the truth value is generally one-hot,H(p) Usually it is 0
Sometimes I use KL The divergence

Calculation chart

Directed graph , Express input , Output , Calculation relationship between intermediate variables , Each node corresponds to an operation

The computer can use the chain rule to calculate the gradient of each position of any complex function

The calculation diagram has the problem of granularity

The gradient disappears 、 Gradient explosion

sigmoid The derivative of is very small in a large range , Plus the multiplication of the chain rule , Cause the derivative to return to 0, It's called gradient vanishing

It can also cause a gradient explosion , It can be solved by gradient clipping

(Leakly)ReLU The derivative of a function is greater than 0 Time is always one , Will not cause the gradient to disappear / The explosion , Gradient flow is smoother , Convergence is faster

Improved gradient algorithm

  • gradient descent —— All samples need to be calculated at one time , It takes too much time
  • Stochastic gradient descent —— Sometimes there is noise , Low efficiency
  • Small batch gradient descent —— compromise

There are still problems

Encounter Valley , Oscillate in one direction , Slow descent in the other direction
Doing too much useless work in the direction of vibration
Adjusting the step size alone cannot solve

Solution

Momentum method

Use historical accumulation
The oscillation directions cancel each other out , Acceleration in the flat descent direction

It can also break through local minimum and saddle point

Pseudo code

Initialization speed v=0
loop :
---- Calculate the gradient g
---- Speed update v = μ v + g v=\mu v+g v=μv+g
---- Update the weights w = w − ε v w=w-\varepsilon v w=wεv

μ \mu μ Value [0,1), commonly 0.9

Adaptive gradient method

Reduce the vibration direction step , Increase the flat direction step size
The larger square of gradient amplitude is the direction of oscillation , The smaller is the flat direction
Pseudo code

Initialize the cumulative variable r=0
loop :
---- Calculate the gradient g
---- Cumulative square gradient r=r+g*g
---- Update the weights w = w − ε r + δ g w=w-\frac{\varepsilon}{\sqrt{r}+\delta}g w=wr+δεg

δ \delta δ Prevent division by zero , Usually it is 1 0 − 5 10^{-5} 105

The defect is that the accumulation time is too long, and the velocity in the flat direction will also be suppressed

RMSProp

The solution is to mix historical information
The improvement is to replace the cumulative square gradient with r = ρ r + ( 1 − ρ ) g ∗ g r=\rho r+(1-\rho )g*g r=ρr+(1ρ)gg
ρ \rho ρ take [0,1) It's usually 0.999

ADAM

Use both momentum and adaptation
Pseudo code

Initialize the cumulative variable r=0,v=0
loop :
---- Calculate the gradient g
---- Cumulative gradient v = μ v + ( 1 − μ ) g v=\mu v+(1-\mu)g v=μv+(1μ)g
---- Cumulative square gradient r = ρ r + ( 1 − ρ ) g ∗ g r=\rho r+(1-\rho)g*g r=ρr+(1ρ)gg
---- Correct the deviation v ^ = v 1 − μ t , r ^ = r 1 − ρ t \hat{v}=\frac{v}{1-\mu^{t}},\hat{r}=\frac{r}{1-\rho^{t}} v^=1μtv,r^=1ρtr
---- Update the weights w = w − ε r + δ v ^ w=w-\frac{\varepsilon}{\sqrt{r}+\delta}\hat{v} w=wr+δεv^

Decay rate ρ \rho ρ Momentum coefficient μ \mu μ The suggestion is 0.999 and 0.9
Correcting the deviation can alleviate the initial cold start problem of the algorithm

use ADAM There is no need to manually adjust parameters

Weight initialization

All zero initialization
All neurons have the same output , Parameter update is the same , Can't train

Random weight initialization
The weight is Gaussian distribution

But the output will be saturated or close to zero in the process of propagation , Can't train

Xavier initialization
The goal is to keep the activation value and local gradient variance of each layer of the network consistent as much as possible in the propagation process , Search for w The distribution of makes the input and output variances consistent
When var(w)=1/N when , The input and output variances are consistent
HE initialization
Apply to ReLU, The weights are sampled at N ( 0 , 2 / N ) \mathcal{N}(0,2/N) N(0,2/N)

Batch of normalization

Direct batch normalization of neuron output
Usually after full connection , Before nonlinear operation
It can prevent the output from falling in the small gradient area , It can solve the problem of gradient disappearance
Mean and variance can be learned by network

Under fitting

The ability of model description is too weak , The model is too simple

Over fitting

Good performance in training set but poor performance in real scene
Only remember the training samples without extracting features

The fundamental problem of machine learning

  • Optimize Get the best performance on the training set
  • generalization Get good performance on unknown data

At the beginning of training : Optimize generalization synchronization
Later in training : Generalization decreases , There has been a fit

resolvent

  • Optimal scheme —— Get enough data
  • Suboptimal scheme —— The tuning model allows you to store the size of the information
    • Adjust model size
    • Constraint model weights , Use regular terms

Random deactivation

Give hidden neurons a chance not to be activated , Use... During training Dropout, Make some neurons output randomly 0

  • Reduced model capacity
  • Encourage weight dispersion , There is regularization
  • It can be regarded as an integration model

Existing problems
The test phase does not use random deactivation, resulting in a different output range from the training
Solution
Multiply the activated neuron output by a factor during training

Hyperparameters

Learning rate

  • Too big Unable to converge
  • Larger Shock , Unable to reach the optimum
  • Too small Convergence time is too long
  • Moderate Fast convergence , The result is good

Hyperparametric search method

  • The grid search
  • Random search —— More combinations of super parameters , General choice
    First rough search and then fine search
    Generally in log Spatial search
原网站

版权声明
本文为[Bachuan Xiaoxiaosheng]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/174/202206231737023802.html