当前位置：网站首页>Tsinghua University product: penalty gradient norm improves generalization of deep learning model

Tsinghua University product: penalty gradient norm improves generalization of deep learning model

2022-07-04 02:44:00 【Ghost road 2022】

1 introduction

The structure of neural network is simple , Insufficient training sample size , It will lead to the low classification accuracy of the trained model ; The structure of neural network is complex , Training sample size is too large , It will lead to over fitting of the model , Therefore, how to train neural network to improve the generalization of model is a very core problem in the field of artificial intelligence . I recently read an article related to this problem , In this paper, the author improves the generalization of the deep learning model by adding the constraint of the gradient norm of the regularization term in the loss function . The author expounds and verifies the methods in this paper in detail from two aspects of principle and experiment . $\mathrm{Lipschitz}$ Continuous learning is a very important and common mathematical tool in the theoretical analysis of deep learning , This paper is based on neural network loss function $\mathrm{Lipschitz}$ Mathematical derivation with continuity as the starting point . In order to facilitate readers to more smoothly appreciate the author's beautiful mathematical proof ideas and processes , This paper supplements the details of mathematical proof that is not carried out in the paper .

Thesis link ：https://arxiv.org/abs/2202.03599

2 $\mathrm{Lipschiz}$ continuity

Given a training data set $\mathcal{S}=\{(x_i,y_i)\}_{i=0}^n$ Obey the distribution $\mathcal{D}$ , One with parameters $\theta \in \Theta$ The neural network of $f(\cdot;\theta)$ , The loss function is $L_{\mathcal{S}}=\frac{1}{N}\sum\limits_{i=1}^N l(\hat{y_i,y_i ,\theta})$ When it is necessary to constrain the gradient norm in the loss function , Then there is the following loss function $L(\theta)=L_{\mathcal{S}}+\lambda \cdot \|\nabla_\theta L_{\mathcal{S}}(\theta)\|_p$ among $\|\cdot \|_p$ Express $p$ norm , $\lambda\in \mathbb{R}^{+}$ Is the gradient penalty coefficient . In general , The loss function introduces the regularization term of gradient, which will make it have smaller local in the optimization process $\mathrm{Lipschitz}$ constant , $\mathrm{Lipschitz}$ The smaller the constant , That means the smoother the loss function , The smooth area of the flat loss function is easy to optimize the weight parameters of the loss function . Further, it will make the trained deep learning model have better generalization .
A very important and common concept in deep learning is $\mathrm{Lipschitz}$ continuity . Given a space $\Omega \subset \mathbb{R}^n$ , For the function $h:\Omega \rightarrow \mathbb{R}^m$ , If there is a constant $K$ , about $\forall \theta_1,\theta_2 \in \Omega$ If the following conditions are met, it is called $\mathrm{Lipschitz}$ continuity $\|h(\theta_1)-h(\theta_2)\|_2 \le K \cdot \|\theta_1 - \theta_2\|_2$ among $K$ It means $\mathrm{Lipschitz}$ constant . If for parameter space $\Theta \subset \Omega$ , If $\Theta$ There is a neighborhood $\mathcal{A}$ , And $h|_{\mathcal{A}}$ yes $\mathrm{Lipschitz}$ continuity , said $h$ It's local $\mathrm{Lipschitz}$ continuity . Intuitive to see , $\mathrm{Lipschitz}$ A constant describes an upper bound of the output with respect to the rate of change of the input . For a small $\mathrm{Lipschitz}$ Parameters , In the neighborhood $\mathcal{A}$ Any two points given in , Their output changes are limited to a small range .
According to the differential mean value theorem , Given a minimum point $\theta_i$ , For any point $\forall \theta_i^{\prime}\in \mathcal{A}$ , Then the following formula holds $\||L(\theta_i^{\prime})-L(\theta_i)\|_2 = \|\nabla L (\zeta) (\theta_i^{\prime}-\theta_i)\|_2$ among $\zeta=c \theta_i + (1-c)\theta^\prime_i, c \in [0,1]$ , according to $\mathrm{Cauchy\text{-}Schwarz}$ We can see that $\||L(\theta_i^{\prime})-L(\theta_i)\|_2 \le \|\nabla L (\zeta)\|_2 \|(\theta_i^{\prime}-\theta_i)\|_2$ When $\theta_i^{\prime}\rightarrow \theta$ when , Corresponding $\mathrm{Lipschiz}$ Constant approach $\|\nabla L(\theta_i)\|_2$ . Therefore, we can reduce $\|\nabla L(\theta_i)\|$ The numerical value of makes the model converge more smoothly .

3 Paper method

The gradient of the loss function with gradient norm constraint can be obtained
$\nabla_\theta L(\theta)=\nabla_\theta L_{\mathcal{S}}(\theta)+\nabla_\theta(\lambda \cdot \|\nabla_\theta L_{\mathcal{S}}(\theta)\|_p)$ In this paper , The author made $p = 2$ , At this time, there is the following derivation process $\begin{aligned}\nabla_\theta \|\nabla_\theta L_\mathcal{S}(\theta)\|_2&=\nabla_\theta[\nabla^{\top}_\theta L_{\mathcal{S}}(\theta)\cdot \nabla_\theta L_\mathcal{S}(\theta)]^{\frac{1}{2}}\\&=\frac{1}{2}\cdot \nabla^2_\theta L_{\mathcal{S}}(\theta)\frac{\nabla_\theta L_\mathcal{S}(\theta)}{\|\nabla_\theta L_\mathcal{S}(\theta)\|_2}\end{aligned}$ This result is brought into the loss function of gradient norm constraint , Then there is the following formula
$\nabla_\theta L(\theta)=\nabla_\theta L_{\mathcal{S}}(\theta)+\lambda \cdot \nabla^2_\theta L_{\mathcal{S}}(\theta) \frac{\nabla_\theta L_{\mathcal{S}}(\theta)}{\|\nabla_\theta L_{\mathcal{S}}(\theta)\|_2}$ You can find , The above formula involves $\mathrm{Hessian}$ Matrix calculation , In deep learning , To calculate the parameters of $\mathrm{Hessian}$ Matrix will bring high computational cost , So we need to use some approximate methods . The author expands the loss function by Taylor expansion , Among them, the order is $H=\nabla^2_\theta L_\mathcal{S}(\theta)$ , Then there are $L_\mathcal{S}(\theta+\Delta \theta)=L_\mathcal{S}(\theta)+\nabla^{\top}_{\theta}L_\mathcal{S}(\theta)\cdot \Delta \theta + \frac{1}{2} \Delta \theta^{\top} H \Delta \theta +\mathcal{O}(\|\Delta \theta\|_2^2)$ Then there are $\begin{aligned}\nabla_\theta L_\mathcal{S}(\theta+\Delta \theta)&=\nabla_{\Delta\theta} L_\mathcal{S} (\theta + \Delta\theta)=\nabla_\theta L_{\mathcal{S}}(\theta)+ H \Delta \theta + \mathcal{O}(\|\Delta \theta\|^2_2)\end{aligned}$ Among them, the order is $\Delta \theta=r v$ , $r$ Represents a small number , $v$ It's a vector , If you bring in the above formula, you have $=\frac{\nabla_\theta L_{\mathcal{S}}(\theta + r v)-\nabla_\theta L_{\mathcal{S}}(\theta)}{r}+\mathcal{O}(r)$ If you make $v=\frac{\nabla_{\theta}L_\mathcal{S}(\theta)}{\|\nabla_\theta L_\mathcal{S}(\theta)\|}$ , Then there are $\frac{\nabla_{\theta}L_{\mathcal{S}}(\theta)}{\|\nabla_\theta L_{\mathcal{S}}(\theta)\|_2}\approx \frac{\nabla_\theta L(\theta + r\frac{\nabla_\theta L_{\mathcal{S}}(\theta)}{\|\nabla_\theta L_{\mathcal{S}}(\theta)\|_2})-\nabla_\theta L(\theta)}{r}$
in summary , After finishing, you can get
$\begin{aligned}\nabla_\theta L(\theta)&=\nabla_\theta L_\mathcal{S} (\theta)+\frac{\lambda}{r}\cdot (\nabla_\theta L_{\mathcal{S}}(\theta + r \frac{\nabla_\theta L_\mathcal{S}(\theta)}{\|\nabla_\theta L_\mathcal{S}(\theta)\|_2})-\nabla_\theta L_\mathcal{S}(\theta))\\&=(1-\alpha)\nabla_\theta L_\mathcal{S} (\theta)+\alpha \nabla_\theta L_\mathcal{S}(\theta+r \frac{\nabla_\theta L_\mathcal{S}(\theta)}{\|\nabla_\theta L_\mathcal{S}(\theta)\|_2})\end{aligned}$ among $\alpha=\frac{\lambda}{r}$ , call $\alpha$ Is the equilibrium coefficient , The value range is $\le \alpha \le 1$ . In order to avoid when calculating the gradient approximately , The gradient of the second Necklace rule in the above formula needs to be calculated $\mathrm{Hessian}$ matrix , After making the following approximation, there is $\nabla_\theta L_\mathcal{S}(\theta+r \frac{\nabla_\theta L_\mathcal{S}(\theta)}{\|\nabla_\theta L_\mathcal{S}(\theta)\|_2})\approx \nabla_\theta L_\mathcal{S} (\theta)|_{\theta =\theta +r \frac{\nabla_\theta L_\mathcal{S}(\theta)}{\|\nabla_\theta L_\mathcal{S}(\theta)\|_2}}$ The following algorithm flow chart summarizes the training methods of this paper

4 experimental result

The following table shows in $\mathrm{Cifar10}$ and $\mathrm{Cifar100}$ These two datasets are different $\mathrm{CNN}$ Network structure in standard training , $\mathrm{SAM}$ Comparison of test error rate between the three training methods with gradient constraint in this paper . You can intuitively find , The method proposed in this paper has the lowest test error rate in most cases , This also verifies from the side that the training of the paper method can improve $\mathrm{CNN}$ Generalization of model .

The author of the paper is also in the current very popular network structure $\mathrm{Vision \text{ } Transformer}$ Experiments were carried out . The following table shows in $\mathrm{Cifar10}$ and $\mathrm{Cifar100}$ These two datasets are different $\mathrm{ViT}$ Network structure in standard training , $\mathrm{SAM}$ Comparison of test error rate between the three training methods with gradient constraint in this paper . Similarly, it can also be found that the test error rate of the method proposed in this paper is the lowest in all cases , This shows that the method in this paper can also be mentioned $\mathrm{Vision \text{ } transformer}$ Generalization of model .

原网站

版权声明
本文为[Ghost road 2022]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202141813162555.html

当前位置：网站首页>Tsinghua University product: penalty gradient norm improves generalization of deep learning model

Tsinghua University product: penalty gradient norm improves generalization of deep learning model

1 introduction

2 $\mathrm{Lipschiz}$ continuity

3 Paper method

4 experimental result

边栏推荐

猜你喜欢

随机推荐

当前位置：网站首页>Tsinghua University product: penalty gradient norm improves generalization of deep learning model

Tsinghua University product: penalty gradient norm improves generalization of deep learning model

1 introduction

2 L i p s c h i z \mathrm{Lipschiz} Lipschiz continuity

3 Paper method

4 experimental result

边栏推荐

猜你喜欢

随机推荐

2 $\mathrm{Lipschiz}$ continuity