当前位置：网站首页>Cs231n notes (top) - applicable to 0 Foundation

k It's worth choosing ： Select a smaller one through cross validation K Values start , Increasing K Value , Then calculate the variance of the validation set , Finally find a more suitable K value .

Linear classifier

f(xi,W,b)=Wxi+b

The linear classifier calculates 3 Matrix multiplication of the values and weights of all pixels in a color channel , So as to get the classification score .
According to the value we set for the weight , For certain colors at certain positions in the image , Functions that show a preference or dislike （ Depending on the sign of each weight ）.

Loss function Loss function

The parameter of the scoring function is the weight matrix W, We can adjust the parameter of weight matrix , Make the scoring function in the correct classification position should get the highest score . Use the loss function to measure our dissatisfaction with the results . Directly speaking , When The greater the difference between the output result of the scoring function and the real result , The larger the loss function output , The smaller the vice .

SVM Support vector machine loss

SVM The loss function of wants SVM The score on the correct classification is always one boundary value higher than the score on the incorrect classification Δ

Regularization , Based on weights only

Suppose there is a data set and a weight set W Be able to correctly classify each data （ There can be many similar W Can correctly classify all data ）, When λ >1 when , Multiply any number by λW Can make the loss value 0
We need to add another term to the original loss function Regularization term , The most common regularization term is L2 norm , It will impose a high penalty on the feature weight with a large range ：

λ Is a super parameter , Cross validation is required to obtain ,N As the training set

About hyperparameters Δ ： Hyperparameters Δ and λ It looks like two different super parameters , But they actually control the same trade-off together ： That is, the trade-off between data loss and regularization loss in the loss function .

softmax classifier

Using cross entropy loss , The classification task is added to the last layer , Calculate the probability （ contrast svm Calculated value ）

optimization - Find the gradient problem

Fine tune the slope in different dimensions , Traverse all of W.

mini-batch ： A small batch of data is sampled in the data set to calculate the loss value , Fine tune parameters through back propagation

Back propagation - Chain derivative

Gate units communicate with each other through gradient signals , Just let their input change along the gradient , No matter how much their own output value rises or falls , To make the output value of the whole network higher .

The essence ： Continuous partial derivative

In the whole calculation circuit diagram , Each gate cell gets some input and immediately calculates two things ：1. The output value of this gate , and 2. The local gradient of the output value with respect to the input value . Once forward propagation is complete , In the process of back propagation , The gate unit gate will finally obtain the gradient of the final output value of the whole network on its own output value

neural network

Add nonlinear element , The formula is s=w_2max(0,w_1x) , Parameters w_1,w_2 Will learn by random gradient descent .

Activation function

Each activation function （ Or nonlinear function ） The input of is a number , And then do some fixed mathematical operation on it .

Sigmoid Nonlinear functions , Compress real numbers to [0,1] Between
tanh function , Compress real numbers to [-1,1]
ReLU It has a great accelerating effect on the convergence of random gradient descent , You need to set the learning rate

Neural network structure

Fully connected layer

The neurons in the whole connecting layer are completely connected in pairs with the neurons in the front and back layers , But there is no connection between neurons in the same fully connected layer

The forward propagation of the full connection layer is generally a matrix multiplication , Then add the offset and use the activation function .

Determine the network size
The first network has 4+2=6 Neurons （ Input layer not counted ）,[3x4]+[4x2]=20 A weight , also 4+2=6 A bias , common 26 A learnable parameter .
The second network has 4+4+1=9 Neurons ,[3x4]+[4x4]+[4x1]=32 A weight ,4+4+1=9 A bias , common 41 A learnable parameter .

Output layer

Neurons in the output layer generally do not have an activation function , The last output layer is mostly used to represent the classification score value , So it's a real number with any value , Or the target number of some real value

Set the number and size of layers

Neural networks with more neurons can express more complex functions , More complex data can be classified , The deficiency is that it may cause over fitting of training data .（ Regularization techniques can be used to control over fitting ）

Data preprocessing

Mean subtraction （Mean subtraction） It is independent of each other in the data features Minus the average
normalization （Normalization） It refers to normalizing all dimensions of data , Make their numerical ranges approximately equal .（ In image processing , Because the numerical range of pixels is almost the same （ All in 0-255 Between ））
PCA And albinism （Whitening） First, zero centralize the data , Then calculate the covariance matrix , It shows the correlation structure in the data . The input of the whitening operation is the data on the feature datum , Then each dimension is divided by its eigenvalue to normalize the range of values .

It should be divided into training / verification / Test set , Just average the pictures from the training set , Then each episode （ Training / verification / Test set ） Subtract this average from the image in .
The recommended preprocessing operation is to zero center each feature of the data , Then normalize the numerical range to [-1,1] Within limits .
The standard deviation used is $\sqrt{2/n}$ Gaussian distribution to initialize the weight , among n Is the number of neurons input . For example numpy Can write ：w = np.random.randn(n) * sqrt(2.0/n).

Batch of normalization （Batch Normalization）

The method is to let the activation data pass through a network before the training starts , The network processes the data to make it obey the standard Gaussian distribution .

Regularization Regularization- By controlling the capacity of the neural network to prevent its over fitting

L1,L2, Maximum normal form constraint

Dropout： When it comes to forward propagation , Let the activation value of a neuron have a certain probability p Stop working , Make the generalization of the model stronger

Gradient check

Use the centralization formula $\frac{df(x)}{dx}=\frac{f(x+h)-f(x-h)}{2h}$
Use relative error
Use double precision

The first layer of visualization

An example of visualizing the weights of the first layer of neural network . On the left The features in are full of noise , This suggests that there may be a problem with the network ： The network does not converge , The learning rate is not set properly , The weight of regularization penalty is too low . Right picture It has good characteristics , smooth , Clean and diverse , It shows that the training process is well done .

Parameters are updated

Use back propagation to calculate analytical gradients , Gradient can be used to update parameters , The simplest form of updating is to change the parameters along the negative gradient （ Because the gradient points in the upward direction , But we usually want to minimize the loss function ）

Learning rate -dropout

When training deep networks , Annealing the learning rate over time is usually helpful . If the learning rate is high , The kinetic energy of the system is too large , The parameter vector will jump irregularly , Can not be stabilized to the deeper and narrower part of the loss function .

Parameter by parameter adaptive learning rate method

The two recommended update methods are SGD+Nesterov Momentum method , perhaps Adam Method .

Super parameter tuning - Random search is better than grid search

Use random search （ Do not use grid search ） To search for the optimal hyperparameter . In stages from rough （ A wide range of training parameters 1-5 A cycle ） To fine （ There are many cycles of narrow range training ） To search .

Model integration

In practice , There is always a way to improve the accuracy of neural networks by a few percentage points , It is to train several independent models during training , Then average their predicted results during the test .

原网站

版权声明
本文为[Small margin, rush]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/02/202202140512287874.html