当前位置：网站首页>batch_size of deep learning foundation

batch_size of deep learning foundation

2022-08-02 05:29:00 【hello689】

Background knowledge: batch size is a parameter that can only be used in batch learning.Statistical learning can be divided into two categories: online learning and batch learning.That is to say, deep learning can be divided into online learning and batch learning according to a certain division method. We usually use batch learning more.For details, please refer to "Statistical Learning and Methods" Second Edition, p13.

Online learning: accept one sample at a time, make predictions, and repeatedly learn the model;
Batch learning: It is an offline learning method (all samples need to be acquired in advance for training tasks), and all samples oris a partial sample for learning.

1. Why do I need batch size?

One answer is this statement in the picture below. As a beginner, it may be a little confusing to see it for the first time.

insert image description here

The second answer is as follows: Is this answer still a bit convoluted?

In the batch method of supervised learning, the adjustment of the salient weights of the multilayer perceptron is performed after all N examples of the training sample set have appeared, which constitutes a traininground.The cost function for batch learning is defined by the mean error energy.So batch training is required.

My summary:

Why batch size is needed, the core is not enough memory.If the memory is sufficient, the error of all samples is used, and then the optimal gradient direction of the parameter to be optimized can be calculated, that is, the globally optimal gradient direction.However, in practice, the amount of data is very large, and the video memory is not enough, so the batch training method is needed for training.If the batch is too small, the randomness of the gradient descent direction is large, and the model is difficult to converge stably (the convergence speed is slow), and many iterations are required.Therefore, the selection of batch size is generally enough to choose a suitable one, which not only ensures high memory utilization, but also ensures that the model converges quickly and achieves a balance.
In addition, batch normal also requires batch data to find the mean square error. If the batch size is 1, BN is basically useless.

2. The benefits of increasing batch size (three points)

The memory utilization is improved, and the parallelization efficiency of large matrix multiplication is improved.
The number of iterations required to run an epoch is reduced, and the processing speed is accelerated relative to the same amount of data
In a certain range, generally increase the batch size, the gradient descent direction is more accurate, and the convergence is stable

3. Disadvantages of increasing batch size (three points)

The memory utilization is improved, but the memory capacity may not be enough (buy more memory)
The number of iterations required to run an epoch is reduced. To achieve the same accuracy, it is necessary to train several more iterations, then increase the epoch and increase the training time.
When the batch size increases to a certain extent, its decreasing direction is difficult to change, and it may fall into a local optimum.(The ship is difficult to turn around), unless the batch is all data.

4. How does adjusting batch size affect the training effect?

The batch size is too small, the model performance is not good (error soars)
As the batch size increases, the speed of processing the same amount of data is faster.
As the batch size increases, the number of epochs required to achieve the same accuracy increases.
Due to the contradiction between the above two factors, when the batch size increases to a certain value, the optimal time is achieved.
Because the final convergence accuracy will fall into different local extremums, the batch size is increased to a certain time to achieve the optimal final convergence accuracy.

5. Why does batch_size need to be a multiple of 2?

Memory Alignment and Floating Point Efficiency;
One of the main arguments for choosing batch sizes as powers of 2 is that CPU and GPU memory architectures are powers of 2organized.Or more precisely, there is the concept of a memory page, which is essentially a contiguous block of memory.If you're on macOS or Linux, you can check the page size by executing getconf PAGESIZE in a terminal, which should return a power of 2 number.For details, please refer to this article.