当前位置:网站首页>Explain AI accelerator in detail: why is this the golden age of AI accelerator?

Explain AI accelerator in detail: why is this the golden age of AI accelerator?

2022-07-04 03:06:00 Computer Vision Research Institute

selected from Medium

author :Adi Fuchs

Heart of machine compilation

Machine center editorial department

stay Last article in , Former Apple Engineer 、 Dr. Princeton University Adi Fuchs Explained AI The motivation for the birth of accelerators . In this article , We will follow the author's ideas to review the whole development process of processors , have a look AI Why can accelerators become the focus of the industry .

from 《 Almost Human 》

e436d385533e3677826c28d39c5f29bd.png

This is the second in a series of blogs , We come to the key of the whole series . When promoting a new company or project , Venture capitalists or executives often ask a basic question :「 Why now ?」

To answer this question , We need to briefly review the development history of processors , Look at what major changes have taken place in this field in recent years .

What is a processor ?

In short , The processor is the part of the computer system that is responsible for the actual numerical calculation . It receives user input data ( Expressed as a number ), And generate new data according to the user's request , That is, perform a set of arithmetic operations desired by the user . The processor uses its arithmetic unit to generate calculation results , This means running the program .

20 century 80 years , Processors are commercialized in personal computers . They gradually become an indispensable part of our daily life , laptop 、 They are found in mobile phones and the global infrastructure computing structure that connects billions of cloud and data center users . With the increasing popularity of computing intensive applications and the emergence of a large number of new user data , Contemporary computing systems must meet the growing demand for processing power . therefore , We always need better processors . For a long time ,「 Better 」 signify 「 faster 」, But now it can also be 「 More efficient 」, Spend the same time , But it uses less energy , Less carbon footprint .

Processor evolution

The evolution of computer system is one of the most outstanding engineering achievements of mankind . We spent about 50 It took years to reach such a height : The computing power of an ordinary smart phone in a person's pocket is that of a room sized computer used in the Apollo moon landing mission 100 ten thousandfold . The key to this evolution lies in the semiconductor industry , And how it improves the speed of the processor 、 Power and cost .

f23b8a22cb05463e09e6dd11c3faa7a1.png

Intel 4004: The first commercial microprocessor , Published on 1971 year .

The processor is called 「 The transistor 」 Of electronic components . Transistors are logic switches , Used as a function from the original logic ( Such as and 、 or 、 Not ) To complex arithmetic ( Floating point addition 、 Sine function )、 Memory ( Such as ROM、DRAM) Building blocks for everything . Over the years , Transistors have been shrinking .

ebf1f8345f6c6456f8700bda51257236.png

1965 year , Gordon · Moore found , The number of transistors in integrated circuits doubles every year ( Later, it was updated to every 18-24 Months ). He predicts that this trend will continue for at least ten years . Although some people think , This is not so much a 「 The laws of 」, It's more like a 「 Industry trend 」, But it did last for about 50 year , It is one of the longest lasting man-made trends in history .

5f0261cb5f6330c4996b65804b265f0e.png

The electrical characteristics of transistor scaling .

But apart from Moore's law , There is also a less famous but equally important law . It is called 「 Leonard's law of scaling 」, By Robert · Donald is 1974 in . Although Moore's law predicts that transistors will shrink year by year , But dunnard asked :「 In addition to being able to install more transistors on a single chip , What are the practical benefits of having smaller transistors ?」 His observation is , When the transistor is in k Reduce the number of times , The current will also decrease . Besides , Because electrons move closer , The transistor we finally got is fast k times , most important of all —— Its power drops to 1/k^2. therefore , in general , We can pack more k^2 A transistor , The logical function will be about k times , But the power consumption of the chip will not increase . 

The first stage of processor development : Frequency era (1970-2000 years )

3451eac17321a6feb168c2d44dd35656.png

Evolution of microprocessor frequency rate .

In the early , The microprocessor industry is mainly concentrated in CPU On , because CPU It was the main force of the computer system at that time . Microprocessor manufacturers make full use of the scaling law . say concretely , Their goal is to improve CPU The frequency of , Because faster transistors enable the processor to perform the same calculations at a higher rate ( Higher frequency = More calculations per second ). This is a somewhat simple way of looking at things ; Processors have many architectural innovations , But in the end , In the early , Frequency contributes a lot to performance , From Intel 4004 Of 0.5MHz、486 Of 50MHz、 Galloping 500MHz To Pentium 4 Series of 3–4GHz.

ac5bd79b90717bcf20fe51b9274d5367.png

Evolution of power density .

Around the 2000 year , Leonard's scaling law began to collapse . say concretely , As the frequency increases , The voltage stops falling at the same rate , So is the power density rate . If this trend continues , The problem of chip heating cannot be ignored . However , Powerful heat dissipation scheme is not mature . therefore , Suppliers cannot continue to rely on improvement CPU Frequency to get higher performance , We need to think about other ways .

The second stage of processor development : The age of multi-core (2000 years - 2010 The mid - )

Stagnant CPU Frequency means that it becomes very difficult to improve the speed of a single application , Because a single application is written in the form of a continuous instruction stream . however , As Moore's law says , every 18 Months , The transistors in our chip will double . therefore , The solution this time is not to speed up a single processor , Instead, the chip is divided into multiple identical processing cores , Each kernel executes its instruction stream .

6d5085d4560d296985c67c1845dd20a1.png

CPU and GPU Evolution of kernel number .

about CPU Come on , It's natural to have multiple cores , Because it is already executing multiple independent tasks concurrently , For example, your Internet browser 、 Word processor and sound player ( More precisely , The operating system does a good job in creating this abstraction of concurrent execution ). therefore , An application can run on a kernel , Another application can run on another kernel . Through this practice , Multi core chips can perform more tasks in a given time . However , To speed up a single program , Programmers need to parallelize it , This means that the instruction stream of the original program is decomposed into multiple instructions 「 subflow 」 or 「 Threads 」. In short , A set of threads can run concurrently on multiple cores in any order , No thread will interfere with the execution of another thread . This practice is called 「 Multithreaded programming 」, It is the most common way for a single program to improve performance from multi-core execution .

A common form of multi-core execution is in GPU in . although CPU It consists of a small number of fast and complex cores , but GPU Rely on a lot of simpler kernels . Generally speaking ,GPU Focus on graphic applications , Because graphics and images ( For example, the image in the video ) It consists of thousands of pixels , It can be handled independently by a series of simple and predetermined calculations . Conceptually speaking , Each pixel can be assigned a thread , And execute a simple 「 Mini program 」 To calculate its behavior ( Such as color and brightness level ). High pixel level parallelism makes it natural to develop thousands of processing cores . therefore , In the next round of processor evolution ,CPU and GPU Suppliers did not speed up individual tasks , Instead, Moore's law is used to increase the number of cores , Because they can still get and use more transistors on a single chip .

cece13d841617ada1fe87eed5fed4311.png

Unfortunately , here we are 2010 Around the year , Things get more complicated : Leonard's scaling law has come to an end , Because the voltage of the transistor is close to the physical limit , Cannot continue to shrink . Although it was previously possible to increase the number of transistors while maintaining the same power budget , But doubling the number of transistors means doubling the power consumption . The demise of Leonard's scaling law means that contemporary chips will encounter 「 Use the wall (utilization wall)」. here , It doesn't matter how many transistors we have on our chip —— As long as there is power consumption limit ( Limited by the cooling capacity of the chip ), We can't use more than a given part of the transistor in the chip . The rest of the chip must be powered off , This phenomenon is also called 「 Dark silicon 」.

The third stage of processor development : Accelerator era (2010 Age to date )

Dark silicon is essentially 「 Moore's law ends 」 A preview of —— For processor manufacturers , The times have become challenging . One side , Computing demand is growing rapidly : Smartphones have become ubiquitous , And it has powerful computing power , ECS needs to handle more and more services ,「 The worst part is 」—— Artificial intelligence is back on the stage of history , And devour computing resources at an amazing speed . On the other hand , In this unfortunate era , Dark silicon has become an obstacle to the development of transistor chips . therefore , When we need to improve our processing capacity more than ever , This matter has become more difficult than ever .

6705b1065602ba19781041d14e237bc1.png

Training SOTA AI The amount of calculation required for the model .

Since the new generation of chips was bound by dark silicon , The computer industry began to focus on hardware accelerators . Their idea is : If you can't add more transistors , Then make good use of the existing transistors . How do you do that ? The answer is : Specialization .

Conventional CPU Designed to be universal . They use the same hardware structure to run all our applications ( operating system 、 Word processor 、 Calculator 、 Internet browser 、 Email client 、 Media player, etc ) Code for . These hardware structures need to support a large number of logical operations , And capture many possible patterns and program induced behaviors . This is equivalent to good hardware availability , But the efficiency is quite low . If we only focus on certain applications , We can narrow the problem area , And then remove a lot of structural redundancy from the chip .

b20edabbb8dbc021fb535a1c2ccc58fc.png

Universal CPU vs. Accelerators for specific applications .

Accelerators are chips that are specifically designed for specific applications or fields , in other words , They won't run all applications ( For example, do not run the operating system ), Instead, a very narrow range is considered at the hardware design level , because :1) Their hardware structure only meets the operation of specific tasks ;2) The interface between hardware and software is simpler . say concretely , Because the accelerator runs in a given domain , The code of accelerator program should be more compact , Because it encodes less data .

for instance , If you want to open a restaurant , But the area 、 The electricity budget is limited . Now you have to decide what dishes this restaurant does , It's Pizza 、 Vegetarian diet 、 Hamburger 、 Sushi is all made (a) Or just pizza (b)?

If elected a, Your restaurant can really satisfy many customers with different tastes , But your chef has to cook a lot of dishes , And not all of them are good at it . Besides , You may also need to buy multiple refrigerators to store different ingredients , And pay close attention to which ingredients are used up , What went bad , Different ingredients may also mix , The management cost is greatly increased .

But if you choose b, You can hire a top pizza expert , Prepare a small amount of ingredients , Buy a custom oven to make pizza . Your kitchen will be very tidy 、 Efficient : A table to make dough , A table with sauce and cheese , Put ingredients on a table . But at the same time , There are also risks to this approach : What if no one wants pizza tomorrow ? What if you can't make the pizza you want in your customized oven ? You have spent a lot of money to build this specialized kitchen , Now it's a dilemma : If you don't transform the kitchen, you may face closing the store , The transformation will cost a lot of money , And after the change , The customer's taste may have changed again .

Back to the processor world : Analogy to the above example ,CPU It's equivalent to options a, Domain specific accelerators are options b, The store size limit is equivalent to the silicon budget . How will you design your chip ? obviously , The reality is not so polarized , Instead, there is a transition region similar to the spectrum . In this spectrum , People more or less trade versatility for efficiency . Early hardware accelerators were designed for specific areas , Such as digital signal processing 、 Network processing , Or as the Lord CPU Auxiliary coprocessor .

from CPU The first shift to major acceleration applications is GPU. One CPU There are several complex processing cores , Each core uses various techniques , Such as branch predictor and out of order execution engine , To speed up single threaded jobs as much as possible .GPU The structure of is different .GPU It consists of many simple kernels , These kernels have simple control flows and run simple programs . first ,GPU For graphic applications , Such as computer games , Because these applications contain images composed of thousands or millions of pixels , Each pixel can be calculated independently in parallel . One GPU Programs usually consist of some kernel functions , be called 「 kernel (kernel)」. Each kernel contains a series of simple calculations , And in different data sections ( Such as one pixel or several pixels patch) Thousands of times . These attributes make graphics applications the target of hardware acceleration . They behave simply , Therefore, there is no need for complex instruction control flow in the form of branch predictor ; They require only a few operations , Therefore, there is no need for complex arithmetic units ( For example, calculate sine function or 64 Bit floating-point division unit ). People later found that , These attributes are not only applicable to graphic applications ,GPU Its applicability can also be extended to other fields , Such as linear algebra or scientific applications . Now , Accelerated computing is no longer limited to GPU. From fully programmable but inefficient CPU To efficient but limited programmability ASIC, The concept of accelerated computing is everywhere .

7b0971fd2d28d5cae4dda087545fb9ea.png

Processing alternatives to deep Neural Networks . source : Microsoft .

Now , As more and more people show 「 good 」 Feature applications become the target of acceleration , Accelerators are getting more and more attention : Video codec 、 Database processor 、 Cryptocurrency miner 、 Molecular dynamics , Of course, there is artificial intelligence .

What makes AI Become an acceleration target ?

Commercial feasibility

Designing a chip is a laborious 、 Things that cost money —— You need to hire industry experts 、 Use expensive tools for chip design and verification 、 Developing prototypes and making chips . If you want to use cutting-edge processes ( For example, today's 5nm CMOS), The cost will reach tens of millions of dollars , Whether success or failure . Fortunately, , For AI , Spending money is not a problem .AI The potential benefits are huge ,AI The platform is expected to generate trillions of dollars in revenue in the near future . If your idea is good enough , You should be able to easily find money for this job .

AI It's a 「 Can accelerate 」 Application fields of

AI The program has all the attributes that make it suitable for hardware acceleration . First and foremost , They are massively parallel : Most calculations are spent on tensor operations , Such as convolution or self attention operator . If possible , You can also add batch size, So that the hardware can process multiple samples at a time , Improve hardware utilization and further promote parallelism . The main factor that drives the fast running ability of hardware processors is parallel computing . secondly ,AI Calculation is limited to a few types of operations : Mainly multiplication and addition of linear algebraic kernel 、 Some nonlinear operators , For example, simulate synaptic activation ReLU, And based on softmax Index operation of classification . The narrow problem space enables us to simplify the computing hardware , Focus on certain operators .

Last , because AI The program can be expressed as a calculation diagram , So we can know the control flow at compile time , It's like having a known number of iterations for The cycle is the same , Communication and data reuse patterns are also quite limited , Therefore, we can characterize which network topologies we need to communicate data between different computing units and software defined temporary storage , To control the storage and arrangement of data .

AI The algorithm is built in a hardware friendly way

Not long ago , If you want to innovate in the field of computing architecture , You might say :「 I have a new idea of architecture improvement , It can significantly improve something , however —— What I need to do is to slightly change the programming interface and let programmers use this function .」 At that time, this idea will not work . The programmer's API Is inaccessible , And use destructive programs 「 clean 」 It is difficult to burden programmers with low-level details of semantic flow .

Besides , Mixing the details of the underlying architecture with programmer oriented code is not a good habit . First of all, it is not portable , Because some architectural features change between chip generations . Second, it may be programmed incorrectly , Because most programmers have no deep understanding of the underlying hardware .

Although you can say GPU And multicore CPU Because of multithreading ( Sometimes even —— Memory wall ) It deviates from the traditional programming model , But because single thread performance is no longer exponential growth , We can only resort to multithreaded programming , Because this is our only choice . Multithreaded programming is still difficult to master , A lot of education is needed . Fortunately, , When people write AI The program , They will use neural layers and other well-defined blocks to build computational diagrams .

Advanced program code ( for example TensorFlow or PyTorch The code in ) It has been written in a way that can mark parallel blocks and build data flow diagrams . So theoretically , You can build a rich software library and a sufficiently sophisticated compiler tool chain to understand the semantics of the program and effectively reduce it to hardware representation , Without any involvement of programmers who develop applications , Let data scientists do their work , They don't care what hardware the task runs on . In practice , It will take time for the compiler to fully mature .

There are few other options

AI is everywhere , Big data centers 、 A smart phone 、 sensor , It can be found in robots and autonomous vehicle . Each system has different practical limitations : People certainly don't want autonomous vehicle to be unable to detect obstacles because their computing power is too small , It's also unacceptable to spend thousands of dollars a day training a super large-scale pre training model because of low efficiency ,AI There is no saying that one chip is suitable for all scenarios , Computing needs are huge , Every bit of efficiency means spending a lot of time 、 Energy and cost . If there is no appropriate acceleration hardware to meet your AI demand , Yes AI The ability to experiment and discover will be limited .

Link to the original text :

https://medium.com/@adi.fu7/ai-accelerators-part-ii-transistors-and-pizza-or-why-do-we-need-accelerators-75738642fdaa

原网站

版权声明
本文为[Computer Vision Research Institute]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202141748492241.html

随机推荐