当前位置：网站首页>2.4 activation function

2.4 activation function

2022-07-01 09:08:00 【Enzo tried to smash the computer】

Catalog

One 、 Activation function
Two 、 Common activation functions

One 、 Activation function

Activation functions have many functions in neural networks , Its main function is to provide the neural network with Nonlinear modeling ability . If there is no activation function , Then the multilayer neural network can only deal with linear separable problems .

therefore , Neural networks use activation functions to add nonlinear factors , Improve the expression ability of the model

Two 、 Common activation functions

1. Sigmoid function

Sigmoid The function is also called Logistic function , For hidden layer neuron output , The value range is (0,1), It maps a real number to (0,1) The range of , It can be used for secondary classification . The effect is better when the feature difference is more complex or the difference is not particularly large .sigmoid Is a very common activation function , The expression of the function is as follows ：

$f_{(x)} =\frac 1 {1+e^{-x}}$
The image is similar to a S Shape curve
Insert picture description here
Under what circumstances is it suitable to use Sigmoid What about the activation function ？

Sigmoid The output range of the function is 0 To 1. Because the output value is limited to 0 To 1, So it makes a comparison of the output of each neuron normalization ;
Is used to Forecast probability as output Model of . Because the range of probability is 0 To 1, therefore Sigmoid The function fits perfectly ;
Gradient smoothing , avoid 「 jumping 」 Output value ;
Functions are differentiable . It means Can find any two points sigmoid The slope of the curve ;
A clear prediction , That is very close to 1 or 0.

Sigmoid Shortcomings of activation function ：

The gradient disappears ： Be careful ：Sigmoid Function approach 0 and 1 The rate of change will flatten out , in other words ,Sigmoid The gradient of is close to 0. Neural networks use Sigmoid When the function is activated for back propagation , Output close to 0 or 1 The gradient of the neurons is close to 0. These neurons are called Saturated neurons . therefore , The weights of these neurons don't update . Besides , The weights of the neurons connected to these neurons are also updated very slowly . The problem is called gradient disappearance . therefore , Imagine , If a large neural network contains Sigmoid Neuron , And many of them are saturated , Then the network cannot perform back propagation .
Not zero centered ：Sigmoid The output is not zero centered ,, Output constant greater than 0, The non-zero centered output will bias the input of the next layer of neurons （Bias Shift）, And further make the convergence speed of gradient descent slow down .
It's expensive to calculate ：exp() Compared with other nonlinear activation functions , It's expensive to calculate , The computer runs slowly .

2. Tanh/ Hyperbolic tangent activation function

Tanh Activation function is also called hyperbolic tangent activation function （hyperbolic tangent activation function）. And Sigmoid Function similar to ,Tanh The function also uses the truth value , but Tanh The function compresses it to -1 To 1 Within the range of . And Sigmoid Different ,Tanh The output of the function is zero centered , Because the interval is -1 To 1 Between .

Function expression ：
$f_{(x)}=tanh(x) = \frac {e^x - e^{-x}} {e^x + e^{-x}} = \frac {2} {1 + e^{-2x}} -1$
We can find out Tanh The function can be seen as zooming in and translating Logistic function , Its value range is (−1, 1).Tanh And sigmoid The relationship is as follows ：
$t a n h (x) = 2 s i g m o i d (2 x) - 1$
tanh The image of the activation function is also S shape , As a hyperbolic tangent function ,tanh Functions and sigmoid The curves of functions are relatively similar . But it's better than sigmoid Functions have some advantages .
Insert picture description here
You can take Tanh Think of the function as two Sigmoid Functions together . In practice ,Tanh Functions take precedence over Sigmoid function . Negative inputs are treated as negative values , The mapping of zero input values is close to zero , Positive input is treated as positive ：

When the input is large or small , The output is almost smooth and the gradient is small , This is not conducive to weight update . The difference between the two is the output gap ,tanh The output interval of is 1, And the whole function takes 0 Centered , Than sigmoid Functions are better ;
stay tanh In the figure , Negative input will be strongly mapped to negative , And zero input is mapped to near zero .

tanh The inadequacies of being ：

And sigmoid similar ,Tanh The function also has the problem of the disappearance of the gradient , So at saturation （x Very large or very small ） It will be 「 Kill 」 gradient .

Be careful ： In the general binary classification problem ,tanh Function is used to hide layers , and sigmoid Function for the output layer , But it's not fixed , It needs to be adjusted to specific problems .

3. ReLU Activation function

ReLU Function is also called modified linear element （Rectified Linear Unit）, It's a piecewise linear function , It makes up for sigmoid Function and tanh Gradient vanishing problem of function , It is widely used in the current deep neural network .ReLU A function is essentially a ramp （ramp） function , The formula and function image are as follows ：

4. Leaky ReLU

5. Parametric ReLU Activation function

6. ELU Activation function

7. SeLU Activation function

8. Softmax Activation function

Softmax Is an activation function for multi class classification problems , In the multi class classification problem , More than two class tags require class membership . For length is K Any real vector of ,Softmax It can be compressed to a length of K, Values in （0,1） Within the scope of , And the sum of the elements in the vector is 1 The real vector of .

The function expression is as follows ：
$S_i = \frac {e^i} {\sum_{ {j\in group}}{e^j}}$

Insert picture description here
Softmax And normal max Functions are different ：max The function only outputs the maximum value , but == Softmax Make sure that smaller values have smaller probabilities , And they don't just throw it away .== We can think of it as argmax The probability version of the function or 「soft」 edition .

Softmax The denominator of the function combines all the factors of the original output value , It means Softmax The probabilities obtained by functions are related to each other .

Softmax Deficiency of activation function ：

It's nondifferentiable at zero ;
The gradient of negative input is zero , This means that for the activation of the region , Weights are not updated during back propagation , So it produces dead neurons that never activate .