当前位置：网站首页>Machine learning notes - in depth Learning Skills Checklist

Machine learning notes - in depth Learning Skills Checklist

2022-06-11 09:08:00 【Sit and watch the clouds rise】

One 、 Data processing

1、 Data to enhance

Deep learning models usually require a large amount of data to be properly trained . It is often useful to use data enhancement techniques to extract more data from existing data . The main findings are summarized in the table below . More precisely , Given the following input image , Here are the techniques we can apply ：

2、 Batch standardization

It is a super parameter $\gamma, \beta$ One step of , Used to normalize batches x_i . By writing down mu_B , sigma_B^2 We want to correct the mean and variance of the batch , As shown below ：

$x_i\longleftarrow\gamma\frac{x_i-\mu_B}{\sqrt{\sigma_B^2+\epsilon}}+\beta$

It is usually fully connected / After convolution layer and before nonlinear layer , Designed to allow a higher learning rate and reduce dependence on initialization .

Two 、 Training neural network

1、 Definition

Epoch： In the context of the training model ,epoch It's a term , Used to refer to an iteration in which the model sees the entire training set to update its weight .

Small batch gradient descent ： In the training phase , Due to computational complexity or noise problems , Updating weights is usually not based on the entire training set or one data point at a time . contrary , The update step is completed in small batches , The number of data points in one batch is a super parameter that we can adjust .

Loss function ： To quantify the execution of a given model , Loss function Usually used to evaluate model output Correctly predict the actual output The degree of .

Cross entropy loss ： In the context of binary classification of neural networks , Cross entropy loss L(z,y) It is commonly used. , The definition is as follows ：

$L(z,y)=-\Big[y\log(z)+(1-y)\log(1-z)\Big]$

2、 Find the best weight

Back propagation ： Back propagation is a method to update the weights in neural networks by considering the actual output and the expected output . Use the chain rule to calculate each weight The derivative of .

Use this method , Each weight is updated with the following rules ：

$w\longleftarrow w-\alpha\frac{\partial L(z,y)}{\partial w}$

Update weights ： In the neural network , The weights are updated as follows ：

step 1： Acquire a batch of training data and perform forward propagation to calculate the loss .

step 2： Back propagate the loss to obtain the gradient of the loss relative to each weight .

The first 3 Step ： Use the gradient to update the weight of the network .

3、 ... and 、 Parameter tuning

1、 Weight initialization

Xavier initialization：Xavier Initialization does not initialize weights in a purely random way , Instead, it enables the initial weights to take into account the unique characteristics of the architecture .

The migration study ： Training deep learning models requires a lot of data , More importantly, it takes a lot of time . In a few days / It is often useful to take advantage of pre training weights on a large dataset of weeks of training and use them for our use cases . According to how much data we have , Here are different ways to use it ：

2、 Optimization convergence

Learning rate ： Learning rate , It's usually written as $\alpha$ Or sometimes $\eta$ , Indicates the speed of weight update . It can be changed fixedly or adaptively . At present, the better method in practice is called Adam, This is a way to adapt to the learning rate .

Adaptive learning rate ： Change the learning rate when training the model , Can reduce training time , Improve the numerical optimal solution . although Adam The optimizer is the most commonly used technology , But other techniques are also useful . They are summarized in the table below ：

Method	Explanation	Update of w	Update of b
Momentum	Dampens oscillations Improvement to SGD 2 parameters to tune	$w-\alpha v_{dw}$	$b-\alpha v_{db}$
RMSprop	Root Mean Square propagation Speeds up learning algorithm by controlling oscillations	$w-\alpha\frac{dw}{\sqrt{s_{dw}}}$	$b\longleftarrow b-\alpha\frac{db}{\sqrt{s_{db}}}$
Adam	Adaptive Moment estimation Most popular method 4 parameters to tune	$w-\alpha\frac{v_{dw}}{\sqrt{s_{dw}}+\epsilon}$	$b\longleftarrow b-\alpha\frac{v_{db}}{\sqrt{s_{db}}+\epsilon}$

Four 、 Regularization

Dropout：Dropout Is a technique used in neural networks , By dropping probability p> 0 To prevent overfitting training data . It forces the model to avoid over reliance on specific feature sets .

Weight regularization ： To ensure that the weights are not too large and that the model does not over fit the training set , The regularization technique is usually performed on the model weights . The main conclusions are as follows ：

LASSO	Ridge	Elastic Net
• Shrinks coefficients to 0 • Good for variable selection	Makes coefficients smaller	Tradeoff between variable selection and small coefficients

$...+\lambda\|\|\theta\|\|_1$ $\lambda\in\mathbb{R}$	$...+\lambda\|\|\theta\|\|_2^2$ $\lambda\in\mathbb{R}$	$...+\lambda\Big[(1-\alpha)\|\|\theta\|\|_1+\alpha\|\|\theta\|\|_2^2\Big]$ $\lambda\in\mathbb{R},\alpha\in[0,1]$

Early stop ： Once the loss has stabilized or started to increase , This regularization technique will stop the training process .

5、 ... and 、 Better practice

Overfitting small batch： When debugging the model , It is often useful to do a quick test to see if there are any major problems with the architecture of the model itself . especially , To ensure that the model can be trained correctly , Pass a message within the network mini-batch To see if it fits . If not , It means that the model is either too complex , Or it's not complicated enough , You can't even fit in small batches , Not to mention a normal size training set .

Gradient checking： Gradient checking is a method used in performing back propagation of neural networks . It compares the value of the analytical gradient with the numerical gradient at a given point , And play the role of correctness inspection .

Type	Numerical gradient	Analytical gradient
Formula	$\frac{df}{dx}(x) \approx \frac{f(x+h) - f(x-h)}{2h}$	$\frac{df}{dx}(x) = f'(x)$
Comments	• Expensive; loss has to be computed two times per dimension • Used to verify correctness of analytical implementation • Trade-off in choosing $h$ not too small (numerical instability) nor too large (poor gradient approximation)	• 'Exact' result • Direct computation • Used in the final implementation