当前位置：网站首页>Gan's training skills: alchemist cultivation plan - generative confrontation network training, participation and improvement

Gan's training skills: alchemist cultivation plan - generative confrontation network training, participation and improvement

2022-07-27 02:08:00 【Medium coke with ice】

Catalog
One 、 Mode crash ： The result pattern produced by the generator is relatively simple
1.1、 Improve training methods
1.2、 Improve the objective function
1.3、 Improve network architecture
Two 、 Slow training ： The gradient disappears
3、 ... and 、 Nonconvergence ： Unstable training , Slow convergence
Four 、 Over fitting
5、 ... and 、 Find failure as soon as possible
6、 ... and 、 Some training skills
Last

Generative antagonistic network （GAN:Generative adversarial networks） It is an important generative model in the field of deep learning , Two networks （ Generators and discriminators ） Train at the same time and minimize the maximum algorithm （minimax） Compete in . This confrontation method avoids some difficulties in the practical application of some traditional generation models , Skillfully approximate some insoluble loss functions through confrontation learning .
Insert picture description here

We introduced GAN Principle ： Explain profound theories in simple language understand GAN The principles of mathematics in ,GAN The most important thing is to find D And G Nash equilibrium between , But in practice, we will find GAN Unstable training , Poor training methods are prone to problems such as mode collapse , This article will record some training skills , Not necessarily suitable for your model , There may also be omissions and mistakes , For learning reference , Welcome to correct and add .

One 、 Mode crash ： The result pattern produced by the generator is relatively simple

In a narrow sense, the pattern collapse phenomenon is that the generator only produces a single or limited pattern to deceive the discriminator , Just to get the lowest discriminator loss D_loss, But it ignores the distribution of data sets , For example, an animal image data set ,GAN During training, I found that the effect of producing cats and dogs is very good , Generate cattle 、 sheep 、 The effect of monkeys is very poor , Whole G Just make cats and dogs , Don't learn to generate other animal images at all , It will result in a single image . Mode collapse is essentially GAN Training optimization problem , Even the best GAN Researchers are also struggling with pattern collapse .

There are many ways to solve the mode crash , as follows :

1.1、 Improve training methods

Small batch discriminator （mini-batch discriminator）： Because the discriminator can only process one sample at a time , The gradient information obtained by the generator on each sample is lacking “ Unified coordination ”, All point in the same direction . therefore Small batch makes the discriminator no longer consider a sample independently , Instead, consider all samples in a small batch at the same time , The specific implementation can be seen in ： How does the small batch discriminator solve the problem of mode crash .
Experience replay ： Show the old false samples to the discriminator at regular intervals , It can minimize the jumping between modes . This prevents the discriminator from becoming too easy to use , But only the patterns that the generator has explored in the past .
adjustment GAN Learning speed of （ Learning rate ）： Overcome this obstacle by changing this specific super parameter , Use a smaller learning rate , And training from scratch , Learning speed is one of the most important super parameters , Even if it's not the most important super parameter , Even small changes in it may lead to fundamental changes in the training process .
Feature matching ： Feature matching changes the generator cost function, To minimize the statistical difference between the features of the real image and the generated image , Measure the difference between the mean values of their eigenvectors L2 distance .
Pack multiple samples belonging to the same category , Then it is passed to the discrimination network D.
Expect a counterattack ： When the generator is updated , Consider not only the current state of the generator , Additional consideration will be given K The state of the discriminator after the second update , Combine the two information to make the optimal solution , That is, the parameter is updated continuously by gradient descent K Time , Improve the performance of the generator “ prophetic vision ”, Thus avoiding short-sighted behavior . First, change the parameter update method to adopt gradient descent method for continuous update K Time , as follows ：
$\begin{aligned} \theta_{D}^{0} &=\theta_{D} \\ & \ldots \cdots \\ \theta_{D}^{K} &=\theta_{D}^{K-1}+\eta \frac{\partial f\left(\theta_{G}, \theta_{D}^{K-1}\right)}{\partial \theta_{D}^{K-1}} \end{aligned}$
The optimization goal of the generator is changed to ： $\theta_{G}=\arg \min _{\theta_{G}} f\left(\theta_{G}, \theta_{D}^{K}\left(\theta_{G}, \theta_{D}\right)\right)$ , Change the gradient to ： $\frac{d f_{K}\left(\theta_{G}, \theta_{D}\right)}{d \theta_{G}}=\frac{\partial f\left(\theta_{G}, \theta_{D}^{K}\left(\theta_{G}, \theta_{D}\right)\right)}{\partial \theta_{G}}+\frac{\partial f\left(\theta_{G}, \theta_{D}^{K}\left(\theta_{G}, \theta_{D}\right)\right)}{\partial \theta_{D}^{K}\left(\theta_{G}, \theta_{D}\right)} \frac{\partial \theta_{D}^{K}\left(\theta_{G}, \theta_{D}\right)}{\partial \theta_{G}}$

1.2、 Improve the objective function

Feature matching ： Change the loss function of the generator ;
use Wassernstein Distance instead of JS The divergence ;
Add a penalty term to the gradient ：WGAN-GP、DRAGAN;
introduce pixel Level loss, Especially in the early stage of training , Such as L1, L2 etc. ;
Add a regular term to the loss function , help GAN Find more diversity ;
Use mean square loss ( mean squared loss ) Replace logarithmic loss ( log loss ).

1.3、 Improve network architecture

Use multiple generators , Simply accept GAN Only a subset of the schema in the dataset is covered , And train multiple generators for different modes , Instead of the collapse of the confrontation mode , Generate images together , In this way, a variety of images can be generated ;
Self attention mechanism ： Global information ( Long distance dependence ) Will be used to generate better images .

Two 、 Slow training ： The gradient disappears

The network uses residual structure ： Adaptive network depth , At the same time, avoid the gradient disappearing ;
softmax+CrossEntropy loss: The loss function is used to offset the gradient disappearance effect caused by the derivation of the activation function
Use Adam Optimizer ;
Don't train the discriminator too well , In order to avoid the failure of training due to the disappearance of gradient in the later stage , The task of the discriminator is to assist in learning a certain distance between the essential probability distribution of the data set and the implicit probability distribution defined by the generator , The task of the generator is to minimize this distance ;
For models with too many layers , Try to avoid using full connection layer .

3、 ... and 、 Nonconvergence ： Unstable training , Slow convergence

When the loss of generator or discriminator suddenly increases or decreases , Don't stop training at will , The loss function tends to rise or fall randomly , There is nothing wrong with this phenomenon , When encountering sudden instability , Do more training , Focus on the quality of the generated image , Visual understanding is usually more meaningful than some lost numbers ;
Add noise ： Adding noise is conducive to improving the overall diversity and stability of the system , In real data and synthetic data ( For example, the image generated by the generator ) Add noise to ; In the field of Mathematics , This should be effective , Because it helps to provide a certain stability for the data distribution of two competing networks ;
Soft tags or tags with noise ： If the label of the real image is set to 1, We change it to a lower value , such as 0.9. This solution prevents the discriminator from being too sure of its classification label , Or to put it another way , It does not rely on a very limited set of features to determine whether the image is true or false .

Four 、 Over fitting

stay GAN in , If the discriminator relies on a small set of features to detect real images , Then the generator can generate only these features to utilize only the discriminator . Optimization may become too greedy and will not produce long-term benefits ;

Use regularization to avoid over fitting , Commonly used L1、L2 Two algorithms , If already used , Adjust its parameter size ;
dropout： Let some neurons stop working with a certain probability . Select a subset randomly from hidden layer neurons and delete it temporarily , Then the part of parameters that are not deleted during training is updated , The deleted neuron parameter keeps the result before being deleted , Repeat the process over and over again ;
Soft tags or tags with noise （ Ibid. III ）.

5、 ... and 、 Find failure as soon as possible

D Of loss Always close to 0, Directly declare failure . The discriminator is too strong , The generator can no longer produce better false data , It can also be considered that the gradient has disappeared , This situation is very common because it is usually easier to identify true and false samples than to forge real samples ;
D Of loss stay in a high position without going down , The generated image is very blurred , Most likely, it has failed . Poor network judgment ability , Randomly distinguish the true from the false , Even mistake the true for the false , False is mistaken for true , The generator cannot get from the discriminator D I learned something there ;
Observe the image and find that the generated image is single , A mode crash occurred , The generating network happens to be particularly handy in generating a certain kind of real samples , perhaps , The discrimination network has relatively poor discrimination ability for certain types of samples , Then the generated network will make full use of its advantages and circumvent its disadvantages , Generate as many samples as possible ;
In a certain way epoch After observing the image, it is found that the generated image is blurred , It's all noise , Most likely, it has failed , Gradient updates have begun to be meaningless , Further training will not improve , So don't waste your time on nothing , Ill conditioned gradient update ;
GAN in loss It reflects the discrimination ability of the discriminator , The overall change should be down and up 、 To ascend , Finally, it tends to be stable . The decrease is due to the improved performance of the discriminator , The reason for the rise is that the generator generation ability has improved .

6、 ... and 、 Some training skills

Scale the image pixel value to -1 To 1 Between ,tanh As the output layer of the generator ;
Use Adam Optimizer Usually better than others ;
Use PixelShuffle And transpose convolution Sample up ;
Use Batch Normalization, It can improve the network generalization ability , Use BN You can also ignore the in fitting drop out and L2 Regularization parameter selection ;
Before inputting the image into the discriminator , take noise Add to the actual image and the generated image ;
Use noise as much as possible Normal distribution Instead of evenly distributed ;
Gradient penalty ;
Activate function using LeakyRelu;
Two Timescale Update Rule (TTUR)： Different learning rates , Low speed update rules are used to generate networks G , Discrimination network D Use High speed update rules , Select the learning rate of the discriminator as 0.0004, Select the learning rate of the generator as 0.0001 Maybe it can achieve good results ;
Reverse label , Deliberately confuse black and white on some samples , This let go kid may stimulate GAN Don't go all the way to the dark ;
In some cases, disrupt the data set , Otherwise, it will lead to prejudice in the learning process of the network ;
priority ： Adjustable parameter > Replace the loss function > Adjust the network structure ;
Don't use the early stop method , Believe in miracles , Unless the discriminator loss quickly approaches 0;
Don't give up , Some minor changes will determine your GAN Whether the model can be successfully trained .

Part of the reference is from ：
https://arxiv.org/pdf/1606.03498.pdf
https://towardsdatascience.com/gan-ways-to-improve-gan-performance-acf37f9f59b
https://www.zhihu.com/people/xiaomizhou94/posts

Last

Personal profile ： Graduate students in the field of artificial intelligence , At present, I mainly focus on text generation and image generation （text to image） Direction

Personal home page ： Medium coke with more ice

Time limited free subscription ： Text generated images T2I special column

Stand by me ： give the thumbs-up + Collection ️+ Leaving a message.

If this article helps you a lot , I hope you can click below to reward me with a coke ！ Add more ice

原网站

版权声明
本文为[Medium coke with ice]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/208/202207262241048673.html