当前位置：网站首页>Chapters 6 and 7 of Huawei Deep Learning Course

Chapters 6 and 7 of Huawei Deep Learning Course

2022-08-01 07:28:00 【swl.raven】

Table of Contents

Learning Links

https://education.huaweicloud.com/courses/course-v1:HuweiX+CBUCNXE088+Self-paced/courseware2aaefc7acffd43ab8eff5cacce98998b/8b8f8068bfa64d8bb6c4fefb98d46a6b/

Chapter 6 Initialization

First of all, what is initialization?

The so-called initialization is to find the initial value of the parameter W in the training network.

Then why initialize?Why is initialization so important?

Generally, the neural network needs to optimize a very complex nonlinear model. In order to find the optimal solution, the selection of the initial point plays an important role:

The choice of initial point sometimes determines whether the algorithm converges
When converging, the selection of the initial point determines the speed of learning
Oversized initialization leads to gradient explosion, and too small initialization leads to gradient disappearance

Now that we know the importance of initialization, we must find a good initialization method.

First, we need to know the criteria for good initialization:

The activation value of each layer of neurons will not be saturated
The activation value of each layer cannot be 0

Next, there are several initialization methods:

All-zero initialization: parameters are initialized to zero
Disadvantages: Neurons in the same layer will learn the same features and cannot destroy the symmetry properties of different neurons.

Random initialization: Initialize parameters to small random numbers.Typically random values are sampled from a Gaussian distribution, and each dimension of the final parameter comes from a multidimensional Gaussian distribution.
Disadvantage: Optimization gets bogged down once the random distribution is not chosen properly:
If the initial value of the parameter is too small, it will cause a small gradient during backpropagation. For a deep network, a single gradient will diffuse during backpropagation, and the convergence speed will decrease.
If the initial value is too large, it will easily lead to saturation.

Xavier initialization: ensure that the variance of the input and output of each layer is consistent, and initialize the parameters to
Disadvantage: The effect of activation function on data distribution is not considered

He initialization: divide by 2 on the basis of Xavier, that is, initialize the parameter to
Advantages: The influence of the ReLU function on the output data distribution is considered, so that the variance of the input and output is consistent

Chapter 7 Parameter Adjustment

The general parameter W is a parameter that is automatically updated during the model training process, and some parameters need to be manually set and adjusted, that is, hyperparameter adjustment.

So what are the hyperparameters?

Learning rate: determines the step size of parameter update changes
It can be seen that choosing the appropriate learning rate can find the optimal solution

Minibatch: batch size.If it is too small, the training speed will be very slow; if it is too large, the training speed will be very fast, but it will take up more memory and may also reduce the accuracy.
According to experience, 32, 64, 128, 256 is a good choice (usually it is a power of 2, because the internal processing of the computerThey are all binary, and the memory is generally a power of 2)

Momentum decay parameter β, number of neurons in hidden layer, hyperparameter of Adam optimization algorithm, layers (number of neural network layers), decay_rate (learning decay rate)

You probably know what hyperparameters there are, so you need to constantly adjust these parameters during the training process to find the optimal parameter combination. Here are a few training techniques to introduce how to adjust them:

Trial and error method: Follow the entire experimental process (from data collection to feature map visualization), then iterate sequentially on hyperparameters until time expires.
The disadvantage is that the time is too long and the efficiency is low.

Grid search: As the name suggests, it is to put the hyperparameters we need to adjust in a coordinate system (similar to a grid), and continuously adjust to find the optimal point (hyperparameter combination).
The disadvantage is that it is only suitable for cases where there are fewer hyperparameters to be adjusted. If there are too many, the more dimensions are required, and the more complicated the grid search is.

Random search: The difference from grid search is that the selection point is randomly searched in the configuration space. The advantage is that the optimal point can be found quickly with less resources. Experiments have proved the effect.

Whether it is grid search or random search, each new guess is independent of previous training, so there is a Bayesian optimization method:

Bayesian optimization: The process of the method is as follows
Build the model->select hyperparameters->train, evaluate->optimize the model, then return to the second step

原网站

版权声明
本文为[swl.raven]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/213/202208010718357448.html