当前位置：网站首页>Introduction to memory wall

Introduction to memory wall

2022-07-27 09:26:00 【Dusk vending machine】

Memory wall

The following is an excerpt from First class technology website

In recent years CV NLP The amount of computation in the field of speech recognition is every two years 15 Double the rate of growth , Promote AI Hardware development
Communication bandwidth bottleneck ： Inside the chip 、 Between chips 、AI Communication between hardware , Called many AI The bottleneck of application
The average size of the model is doubled every two years 240 times , however AI The memory size of the hardware only doubles every two years 2 times
Training AI The memory required by the model is several times more than that of the general model

Memory wall is not only related to memory capacity , It also includes the transmission bandwidth of memory .

Involving multiple levels of memory data transmission ：
for example , Between computing logic units and on-chip memory , Or between computing logic unit and main memory , Or data transfer across different processors in different slots

In all of the above cases , Capacity and data transmission speed are far behind the hardware computing capacity

The distributed strategy is adopted to expand the training to multiple AI On the hardware , So as to break through the limitations brought by the memory capacity and bandwidth of the hardware . AI Hardware will encounter communication bottlenecks , Slower than on-chip data handling
The horizontal expansion of distributed strategy is only in the case of little traffic and data transmission , It's the right solution for computing intensive problems

Promising solutions to break memory walls

In order to continue to innovate and “ Break the memory wall ”, We need to rethink the design of AI models . Here are a few key points :

At present, the design methods of artificial intelligence models are mostly temporary , Or just rely on very simple amplification rules .（ I don't understand ）
Need more effective data method training AI Model , Current network training is very inefficient . Learning requires a lot of data and hundreds of thousands of iterations
Existing optimization and training methods , A lot of parameters need to be adjusted , Hundreds of trial and error are needed before setting parameters and training is successful .
SOTA The scale of this kind of network is very large , Deploying them alone is extremely challenging .AI The hardware involved mainly focuses on improving the calculation example , And less focus on improving memory .

Training algorithm

difficult ： Violent exploratory mediation
About Microsoft Zero This paper introduces a promising work ： You can delete / The state parameters of the sliced redundancy optimizer [21, 3], On the premise of keeping memory consumption unchanged , Training 8 Times the size of the model . If the introduction of these higher-order methods overhead The problem can be solved , So it can significantly reduce the total cost of training large models .
Improve the local properties of the optimization algorithm , And reduce memory usage , But it will increase the amount of calculation
Sufficiently robust design 、 Use the optimization algorithm with low precision training

Efficient deployment

Prune out redundant parameters in the model
quantitative , Reduce precision

Conclusion

at present NLP Medium SOTA Transformer Computing power requirements of class models , Every two years 750 Double the rate of growth , The number of model parameters is based on every two years 240 Double the rate of growth . by comparison , The peak growth rate of hardware computing power is every two years 3.1 times .DRAM And the growth rate of hardware interconnection bandwidth is every two years 1.4 times , Has been gradually left behind by demand . Think deeply about the numbers , In the past 20 During the year, the peak of hardware computing power increased 90000 times , however DRAM/ Hardware interconnection bandwidth has only increased 30 times . In this trend , The data transfer , In particular, data transmission within or between chips will quickly become a large-scale training platform AI The bottleneck of the model . So we need to rethink AI Model training , Deployment and the model itself , Think about it , How to design AI hardware under this increasingly challenging memory wall .

原网站

版权声明
本文为[Dusk vending machine]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/208/202207270916588423.html