当前位置:网站首页>On weight decay and discarding method
On weight decay and discarding method
2022-07-28 03:14:00 【LiterMa】
List of articles
This article is the learning notes of Teacher Li Mu and teacher Wang Mu's video
Weight decline
The more parameters of a general model, the larger the capacity of the model ( The degree to which the model can fit the data ), In order to prevent the model from over fitting, sometimes we need to reduce the capacity of the model , For example, by limiting the range of parameter values to achieve the purpose of reducing the model capacity .
m i n l ( w , b ) s u b j e c t t o ∣ ∣ w ∣ ∣ 2 ≤ θ min\; \mathscr{l}(w,b) \quad subject\; to \quad ||w||^2 \leq \theta minl(w,b)subjectto∣∣w∣∣2≤θ
θ \theta θ The smaller the regular term is, the smaller the regular term is
This is a constraint of ∣ ∣ w ∣ ∣ 2 ≤ θ ||w||^2\leq \theta ∣∣w∣∣2≤θ Conditional extremum problem of , Use Lagrange multiplier method to solve , So structure :
m i n l ( w , b ) + λ 2 ( ∣ ∣ w ∣ ∣ 2 − θ ) min\;\mathscr{l}(w,b)+\cfrac{\lambda}{2}\big(||w||^2-\theta\big) minl(w,b)+2λ(∣∣w∣∣2−θ)
And for λ \lambda λ and θ \theta θ Knowing one can relieve the other , So it can be equivalent to :
m i n l ( w , b ) + λ 2 ∣ ∣ w ∣ ∣ 2 min\;\mathscr{l}(w,b)+\cfrac{\lambda}{2}||w||^2 minl(w,b)+2λ∣∣w∣∣2
It can be proved that λ → ∞ w ∗ → 0 \lambda\rightarrow\infty\quad w^*\rightarrow 0 λ→∞w∗→0 Use λ \lambda λ control θ \theta θ
As shown in the figure below C C C representative θ \theta θ The coordinate axes of the two directions are w w w Size , so C C C The bigger it is w w w The smaller it is, the more it controls the capacity of the model .
Then the gradient calculation and parameter update become :
gradient :
∂ ∂ w ( l ( w , b ) + λ 2 ∣ ∣ w ∣ ∣ 2 ) = l ( w , b ) ∂ w + λ w \cfrac{\partial}{\partial w}\big ( \mathscr l(w,b)+\cfrac{\lambda}{2}||w||^2 \big )\;=\; \cfrac{\mathscr l (w,b)}{\partial w}+\lambda w ∂w∂(l(w,b)+2λ∣∣w∣∣2)=∂wl(w,b)+λw
After extracting the common factor, the parameters are updated and publicized :
w t + 1 = ( 1 − η λ ) w t − η ∂ l ( w t , b t ) ∂ w t w_{t+1}=(1-\eta\lambda)w_t-\eta\cfrac{\partial \mathscr l(w_t,b_t)}{\partial w_t} wt+1=(1−ηλ)wt−η∂wt∂l(wt,bt)
When λ η < 1 \lambda\eta<1 λη<1 It is called weight decay .
The law of abandonment
Discard the output elements x i x_i xi Do the following disturbance :
{ 0 , p r o b a b i l i t y p x i 1 − p , o t h e r i s e \left\{\begin{matrix}0,\quad probability \;p \\\cfrac{x_i}{1-p},\quad otherise\end{matrix}\right. ⎩⎨⎧0,probabilityp1−pxi,otherise
After this disturbance, for each x i x_i xi Expectations remain x i x_i xi
E ( x i ′ ) = 0 ⋅ p + ( 1 − p ) ⋅ x i 1 − p = x i E(x_i^{'})=0\cdot p+(1-p)\cdot \cfrac{x_i}{1-p}=x_i E(xi′)=0⋅p+(1−p)⋅1−pxi=xi
The discard method randomly sets the output of some hidden layers to 0, So as to control the complexity of the model , Its discard probability is a super parameter that controls the complexity of the model .
dropout Only enable during training , Used to adjust parameters , It is not used in reasoning .
边栏推荐
- els 定时器
- [2022 Niuke Game 2 J question link with arithmetic progress] three part set three part / three part extreme value / linear equation fitting least square method
- Building of APP automation environment (I)
- GAMES101复习:光线追踪(Ray Tracing)
- Record of a cross domain problem
- vscode debug显示多列数据
- C#实现弹出一个对话框的同时,后面的form不可用
- Promise object
- There is no way to predict the rise and fall of tomorrow
- [QNX Hypervisor 2.2用户手册]9.10 pass
猜你喜欢

【stream】stream流基础知识

Design and practice of unified security authentication for microservice architecture

四、固态硬盘存储技术的分析(论文)

分布式事务——Senta(一)

Unexpected harvest of epic distributed resources, from basic to advanced are full of dry goods, big guys are strong!

Record of a cross domain problem

Distributed transaction Senta (I)

意外收获史诗级分布式资源,从基础到进阶都干货满满,大佬就是强!

每日刷题巩固知识

综合 案例
随机推荐
vi命令详解
方案分享 | 高手云集 共同探索重口音AI语音识别
Promise object
vscode debug显示多列数据
Unexpected harvest of epic distributed resources, from basic to advanced are full of dry goods, big guys are strong!
为什么登录时,明明使用的是数据库里已经有的账号信息,但依旧显示“用户不存在”?
数字孪生智慧楼宇可视化平台实现对园区企业、公众服务一体化
MySQL essay
Design and practice of unified security authentication for microservice architecture
Web server
Data center construction (III): introduction to data center architecture
【2022牛客多校2 K Link with Bracket Sequence I】括号线性dp
Industry insight | is speech recognition really beyond human ears?
基于OpenCV的轮廓检测(3)
Comprehensive case
一次跨域问题的记录
Games101 review: ray tracing
从硬件编程到软件平台的ci/cd
Building of APP automation environment (I)
C#WinForm开发:如何将图片添加到项目资源文件(Resources)中