当前位置:网站首页>Summary of the activation function sigmoid relu tanh Gelu in machine learning and deep learning
Summary of the activation function sigmoid relu tanh Gelu in machine learning and deep learning
2022-06-24 04:05:00 【Goose】
0. background
This blog mainly summarizes the commonly used activation function formulas and their advantages and disadvantages , Include sigmoid relu tanh gelu
1. sigmoid
- sigmoid Function can smoothly map real number field to 0,1 Space .
- The value of a function can be interpreted as a positive probability ( The range of probability is 0~1), The center is 0.5.
- sigmoid Function monotonically increasing , Continuous derivable , Derivative form is very simple , Is a more appropriate function
advantage : smooth 、 Easy to find
shortcoming :
- The activation function is computationally expensive ( Both forward propagation and back propagation contain power operation and division );
- When calculating the error gradient by back propagation , Derivation involves division ;
- Sigmoid The derivative range is 0, 0.25, Due to the of neural network back propagation “ The chain reaction ”, It's easy to see the gradient disappear . For example, for a 10 Layer network , according to 0.25^10 Very small , The first 10 The error of the layer is relative to the parameter of the convolution of the first layer W1 The gradient of will be a very small value , That's what's called “ The gradient disappears ”.
- Sigmoid The output of is not 0 mean value ( namely zero-centered); This will cause the neurons of the latter layer to get the non output of the previous layer 0 Mean signal as input , With the deepening of the network , Will change the original distribution of the data .
deduction :https://zhuanlan.zhihu.com/p/24967776
2. tanh
tanh Is a hyperbolic tangent function , Its English reading is Hyperbolic Tangent.tanh and sigmoid be similar , All belong to saturation activation function , The difference is that the output value range consists of (0,1) Change into (-1,1), You can put tanh The function is seen as sigmoid The result of translation and stretching down
tanh Characteristics as activation function :
comparison Sigmoid function ,
- tanh When the output range of (-1, 1), It's solved Sigmoid Function is not zero-centered Output problems ;
- The problem of power operation still exists ;
- tanh The derivative range is (0, 1) Between , comparison sigmoid Of (0, 0.25), The gradient disappears (gradient vanishing) The problem will be alleviated , But there will still be
DNN Use... In the front tanh Last use sigmoid
3. relu
Relu(Rectified Linear Unit)—— Correction of linear element function : The form of this function is relatively simple ,
The formula :relu=max(0, x)
ReLU Characteristics as activation function :
- comparison Sigmoid and tanh,ReLU Abandon complex calculations , Improved computing speed .
- Solved the problem of gradient disappearance , The convergence rate is faster than Sigmoid and tanh function , But guard against ReLU Gradient explosion
- It's easy to get a better model , But we should also prevent models from appearing in training ‘Dead’ situation .
ReLU Force will x<0 Part of the output is set to 0( Set as 0 Is to mask the feature ), It may cause the model to fail to learn effective features , So if the learning rate is set too large , It may cause most neurons of the network to be in ‘dead’ state , So use ReLU Network of , The learning rate cannot be set too large .
Leaky ReLU The formula in is a constant , General Settings 0.01. This function is usually better than Relu The activation function works better , But the effect is not very stable , So in practice Leaky ReLu Not much is used .
PRelu( Parameterized modified linear element ) As a learnable parameter , It will be updated during training .
RReLU( Random correction of linear elements ) It's also Leaky ReLU A variation of . stay RReLU in , The slope of negative value is random in training , In later tests it became fixed .RReLU The highlight is , In the training session ,aji It's from a uniform distribution U(I,u) A random number from .
4. Gelu
gelu(gaussian error linear units) It's what we often call a Gaussian error linear element , It is a high-performance neural network activation function , because gelu The nonlinear variation of is a random canonical transformation that meets the expectation , The formula is as follows :
among Φ(x) refer to x xx Cumulative distribution of Gaussian normal distribution , The complete form is as follows :
The calculation result is about :
Or it can be expressed as :
Thus we can see that , probability P ( X ≤ x ) (x It can be regarded as the activation value input of the current neuron ), namely X Gaussian normal distribution of ϕ(X) Cumulative distribution of Φ(x) With x Changed by , When x increase ,Φ(x) increase , When x Reduce ,Φ(x) Reduce , When x The smaller it is , When the current activation function is activated , The more likely the activation result is 0, That is, the neuron is dropout, And when x The bigger, the more likely it is to be preserved .
Using skills :
1. When used during training gelus When training as an activation function , It is recommended to use one with momentum (momentum) The optimizer for , And take it as a standard of deep learning network .
2. In the use of gelus In the process of , The formula (3) Of σ The choice of function is very important , It is generally necessary to use a function approximating the cumulative distribution of the normal distribution , Generally, you can choose a function that is more similar to the cumulative distribution of normal distribution sigmoid(x)=1/(1+e^{(-x)}) As σ function .
advantage :
- comparison Relu, Add nonlinear factors to the network model
- Relu Will be less than 0 Data mapped to 0, Will be bigger than the 0 Giving of be equal to Mapping operations , Although the performance ratio sigmoid good , But the lack of statistical characteristics of the data , and Gelu It's in relu Statistical features are added on the basis of . It is mentioned in the paper that it is superior to... In several deep learning tasks Relu The effect of .
Ref
边栏推荐
- flutter系列之:flutter中的offstage
- There is such a shortcut to learn a programming language systematically
- LeetCode 1281. Difference of sum of bit product of integer
- 你了解TLS协议吗?
- Gaussian beam and its matlab simulation
- The practice of tidb slow log in accompanying fish
- Tsingsee Qingxi video easycvr integrated Dahua face recognition equipment
- What is pseudo static? How to configure the pseudo static server?
- Web penetration test - 5. Brute force cracking vulnerability - (3) FTP password cracking
- 祝贺钟君成为 CHAOSS Metric Model 工作组的 Maintainer
猜你喜欢

共建欧拉社区 共享欧拉生态|携手麒麟软件 共创数智未来

Brief ideas and simple cases of JVM tuning - how to tune

讲讲我的不丰富的远程办公经验和推荐一些办公利器 | 社区征文

Halcon knowledge: contour operator on region (2)

web技术分享| 【地图】实现自定义的轨迹回放

Common content of pine script script
![[Numpy] Numpy对于NaN值的判断](/img/aa/dc75a86bbb9f5a235b1baf5f3495ff.png)
[Numpy] Numpy对于NaN值的判断

Modstartcms enterprise content site building system (supporting laravel9) v4.2.0

ClickHouse(02)ClickHouse架构设计介绍概述与ClickHouse数据分片设计

多任务视频推荐方案,百度工程师实战经验分享
随机推荐
Download files and close the enhanced module security configuration to visit the website for the first time using IE browser
Mac CentOS installation phpredis
Structure size calculation of C language struct
How to select a high-performance amd virtual machine? AWS, Google cloud, ucloud, Tencent cloud test big PK
flutter系列之:flutter中的offstage
web技术分享| 【地图】实现自定义的轨迹回放
How much space does structure variable occupy in C language
Gpt/gpt2/dialogpt detailed explanation comparison and application - text generation and dialogue
What is a 1U server? What industries can 1U servers be used in?
well! Do you want to have a romantic date with the shining "China Star"?
Kubernetes 资源拓扑感知调度优化
The first 2021 Western cloud security summit is coming! See you in Xi'an on September 26!
What is pseudo static? How to configure the pseudo static server?
How to draw the flow chart of C language structure, and how to draw the structure flow chart
Unable to access the CVM self built container outside the TKE cluster pod
An open source monitoring data collector that can monitor everything
How to monitor the operation of easygbs service in real time?
Flutter series: offstage in flutter
应用实践 | Apache Doris 整合 Iceberg + Flink CDC 构建实时湖仓一体的联邦查询分析架构
Discussion on the introduction principle and practice of fuzzy testing