当前位置：网站首页>21 | pipeline oriented instruction design (Part 2): How did Pentium 4 fail?

21 | pipeline oriented instruction design (Part 2): How did Pentium 4 fail?

2022-06-13 07:41:00 【luoganttcc】
 Last time , I gave you a preliminary introduction CPU Pipeline technology . At first glance , Pipeline technology is a panacea to improve performance . It divides the operation of an instruction into more detailed steps , You can avoid CPU“ waste ”. The pipeline steps of each subdivision are very simple , So the time of our single clock cycle can be set to be shorter . This also makes CPU The main frequency of the increases very fast . 
 The advantages of this series , It also leads to the modern desktop CPU The last battle of , That is to say Intel Of Pentium 4 and AMD Of Athlon Competition between . On the technical , This war Intel It can be said that I lost completely ,Pentium 4 Series and subsequent Pentium D Series used NetBurst The architecture was completely abandoned , Out of the stage of history . But at the commercial level ,Intel But by far more than AMD Financial resources 、 It has a larger market share 、 Every means of competition , And finally give up the whole NetBurst framework , Finally, the new core brand defeated AMD. 
 After that , Whole CPU Focus of field competition , No more Intel and AMD Between the desktop CPU The battle of . stay ARM With the rapid popularization of smart phones , From behind , transcend Intel after , Mobile Age CPU The battle of , Became Qualcomm 、 Between Huawei Kirin and Samsung “ The romance of The Three Kingdoms ”. 
“ Main frequency war ” Bring the super long assembly line  We are  The first 3 speak  As I mentioned in , In fact, we can't simply pass CPU The main frequency of , Let's measure CPU And even the performance of the whole computer . Because it's different CPU The actual architecture and implementation are different . alike CPU Main frequency , The actual performance can vary greatly . therefore , In industry , A better measure is usually , use SPEC Such a running program , From many different practical application scenarios , To measure the performance of a computer . 
 however , Running points is still too complicated for consumers . stay Pentium 4 Of CPU Before coming out , The vast majority of consumers do not judge based on the score results CPU Performance of . Let's judge a CPU Performance of , Usually only look at CPU The main frequency of . and CPU Manufacturers are also constantly improving the main frequency , Take the dominant frequency as the core index of technology competition . 
 Intel Always in “ Main frequency war ” Keep ahead of the rest of the world , But at the turn of the century 1999 Year to 2000 year , Things have changed . 
 1999 year ,AMD Released based on K7 Architecturally Athlon processor , Its comprehensive performance exceeds that of the year Pentium III.2000 year , In most CPU still 500～850MHz When running at the frequency of ,AMD Launched the first generation Athlon 1000 processor , Become the first 1GHz The consumption level of the main frequency CPU. stay 2000 Around the year ,AMD Of CPU Not only performance and dominant frequency ratio Intel To be strong , The price is often only Intel Of 2/3. 
 Under great external pressure ,Intel stay 2001 A new generation of NetBurst framework CPU, That is to say Pentium 4 and Pentium D.Pentium 4 Of CPU One of the biggest characteristics , High dominant frequency .2000 Year of Athlon 1000 The dominant frequency of was the highest at that time ,1GHz, However Pentium 4 The highest dominant frequency of the design is 10GHz. 
 In order to achieve this 10GHz,Intel Our engineer made a big wrong decision , Is in the NetBurst Architecturally , Use super long pipeline . How long is this super long assembly line ？ Let's take it in Pentium 4 Before and after CPU Compare the numbers of , You will know . 
 Pentium 4 Previous Pentium III CPU, The depth of the pipeline is 11 level , That is, an instruction can be split into at most 11 There are two smaller steps to operate , and CPU At the same time, at most 11 Different instructions Stage. With the development of technology today , Your daily mobile phone ARM Of CPU perhaps Intel i7 Server's CPU, The depth of the pipeline is 14 level . 
 You can see , almost 20 Years have passed , Through technological progress , modern CPU Some pipeline depth has been added . that 2000 Published in Pentium 4 What is the depth of the pipeline ？ The answer is 20 level , Than Pentium III Almost doubled , And the code name is Prescott Of 90 Nano process processor Pentium 4,Intel It also increases the depth of the pipeline to 31 level . 
 Need to know , Increase pipeline depth , At the same dominant frequency , In fact, it reduces CPU Performance of .  Because a Pipeline Stage, It takes a clock cycle . So let's split the task into 31 Stages , Need 31 It takes a clock cycle to complete a task ; And split the task into 11 Stages , Just need 11 One clock cycle can complete the task . under these circumstances , 31 individual Stage Of 3GHz Dominant CPU, Actually sum 11 individual Stage Of 1GHz Dominant CPU, The performance is similar . in fact , Because of every Stage All need to have corresponding Pipeline Register overhead , This is the time , Deeper pipeline performance may be worse . 
 I also said in the last lecture ,  Pipelining does not shorten the length of a single instruction  response time  This performance index , However, you can increase the number of instructions when running many instructions  Throughput rate . Because of different instructions , The actual execution time is different . We can look at an example like this .  We execute these three instructions in sequence .
 The addition of an integer , need 200ps. 
 Multiplication of an integer , need 300ps. 
 Multiplication of a floating point number , need 600ps. 
 If we are in a single instruction cycle CPU Up operation , The most complex instruction is a floating point multiplication , It needs to 600ps. So these three instructions , Need to be 600ps. Execution time of three instructions , Need 1800ps.
 If we were to use the 6 Class assembly line CPU, every last Pipeline Of Stage All we need is 100ps. that , During the execution of these three instructions , At the command 1 One of the first 100ps Of Stage After that , The second instruction begins to execute . In the first of the second instruction 100ps Of Stage After that , The third instruction begins to execute . In this case , The total time required for the sequential execution of these three instructions , Namely 800ps. So in 1800ps Inside , Using pipeline CPU Than a single instruction cycle CPU You can execute more than twice the number of instructions . 
 Although the time of getting the result from the beginning to the end of each instruction does not change , That is, the response time does not change . But at the same time , The number of instructions completed has increased , That is, the throughput has increased . 
 New challenges ： Adventure and branch prediction  Then here you may have to ask , Doesn't that look good ？Intel Of CPU The instruction set supported is very large , We said before that there was 2000 Multiple instructions . Some instructions are simple , The implementation is also fast , For example, unconditional jump instruction , No need to pass ALU Make any calculations , Just update PC Just the contents of the register . And some instructions are complex , For example, the operation of floating point numbers , Exponential bit comparison is required 、 alignment , Then shift the significant bit , And then calculate . It is also normal that the execution time of the two is 20 or 30 times different . 
 In this case ,Pentium 4 The super long assembly line looks very reasonable , Why? Pentium 4 Eventually become Intel The big failure at the technical architecture level ？ 
 first , Naturally, we are in the first 3 The problem of power consumption mentioned in  . Increase the depth of the pipeline , Must be and promoted CPU The main frequency is carried out at the same time . Because in a single Pipeline Stage The functions that can be performed are simpler , That means less can be done in a single clock cycle .  therefore , Only raise the clock cycle ,CPU In order to maintain the same performance as before, the response time of instructions is the index .
 meanwhile ,  Due to the increase of pipeline depth , We need more circuits , That is, we use more transistors .
 The increase of the dominant frequency and the increase of the number of transistors make us CPU The power consumption of has increased . This problem leads to Pentium 4 Throughout the life cycle , Have become large consumers of electricity and heat dissipation . and Pentium 4 Is in 2000～2004 Years as a Intel My main focus CPU Appear on the market . This time period , It is the time for the rapid development of the notebook computer market . On a laptop , Power consumption and heat dissipation are a more serious problem than desktop computers . Even better performance , Other people's notebooks can be used 2 Hours , Your can only use 30 minute , I don't love anyone ！ 
 What's more? ,Pentium 4 The performance of is even worse .  This brings us to the second point , This is the performance improvement brought by the pipeline technology mentioned above , Is an ideal situation . In the actual program execution , It may not be possible to do  . 
 Let's go back to the example of the three instructions we just gave . If these three instructions , Here are three codes , What's going to happen ？ 
inta=10 + 5 ; // Instructions 1
intb= a * 2 ; // Instructions 2
floatc= b * 1.0f ; // Instructions 3
 We will find that , Instructions 2, You can't order 1 One of the first Stage After the execution is completed . Because instructions 2, Dependent instructions 1 Calculated results of . alike , Instructions 3 Also rely on instructions 2 Calculated results of . such , Even if we use pipeline technology , The time when the execution of these three instructions is completed , It's also 200 + 300 + 600 = 1100 ps, Instead of what I said before 800ps. And if the command 1 and 2 All floating point operations , need 600ps. Then this dependency will cause the time we need to become 1800ps, And single instruction cycle CPU It takes the same time . 
 This dependency problem , It is what we call in computer composition  adventure  （Hazard） problem . Here we only list the dependencies at the data level , That is, data adventure . in application , There will be  Structural adventure 、 Control risk  And other dependency issues . 
 In response to these risky questions , We also have  Disorderly execution  、  Branch prediction  And so on . We'll talk about it later , Will explain the corresponding knowledge in detail . 
 however , The longer our assembly line , The more difficult it is to solve this risky problem . This is because , There are too many instructions running at the same time .  If we only 3 Class assembly line , We can execute the instructions without dependency in the front . This is what we call out of Order Execution Technology . For example , We can expand the above 3 Line code , Add a few lines of code . 
inta=10 + 5 ; // Instructions 1
intb= a * 2 ; // Instructions 2
floatc= b * 1.0f ; // Instructions 3
intx=10 + 5 ; // Instructions 4
inty= a * 2 ; // Instructions 5
floatz= b * 1.0f ; // Instructions 6
into=10 + 5 ; // Instructions 7
intp= a * 2 ; // Instructions 8
floatq= b * 1.0f ; // Instructions 9
 We don't have to do it first 1、2、3 These three instructions , But in the assembly line , Execute first 1、4、7 Three instructions . There is no dependency between these three instructions . And then execute 2、5、8 as well as 3、6、9. such , We can make full use of CPU The computing power of . 
 however , If we had 20 Class assembly line ,  It means we have to make sure that 20 There is no dependency between instructions . The challenge suddenly became much bigger . After all, we usually write programs , Usually, the front and back codes have certain dependencies , Dozens of instructions without dependencies are hard to find . That's why , An important reason why the execution efficiency of ultra long pipeline is reduced .
 Summary extension  Believe it here , You are right about CPU Pipeline technology , With a deeper understanding of . You'll find that , Pipeline technology is the same as other technologies , They all pay attention to one “ compromise ”（Trade-Off）. A reasonable pipeline depth , Will promote us CPU The throughput of executing computer instructions . We usually use IPC（Instruction Per Cycle） To measure CPU Efficiency of executing instructions . 
 IPC Well ,  In fact, we were in the 3 Talk about CPI（Cycle Per Instruction） Reciprocal . in other words , IPC = 3 Corresponding CPI = 0.33.Pentium 4 and Pentium D Of IPC Are far lower than their previous generation Pentium III And competitors AMD Of Athlon CPU. 
 Too deep assembly line , Not only can it not improve the throughput of computer instructions , It will also increase the power consumption and heat dissipation of computing .Intel I am in the notebook computer market , And soon gave up Pentium 4, Instead, the main push is to use Pentium III The diagram of architecture CPU. 
 The throughput of pipeline is improved , Just a theoretical value in an ideal case . In the process of practical application , We also need to solve the problem of dependency between instructions . This makes our assembly line , In particular, the execution efficiency of very long pipeline becomes very low . If you want to solve  adventure  The problem of dependency , We need to introduce out of order execution 、 Branch prediction and other technologies , This is what I will explain in detail in the next few lectures . 
 Recommended reading  In addition to the previous Textbooks , I recommend that you read Modern Microprocessors, A 90-Minute Guide! This article . This article is written in a simple way , Introduced modern CPU Many aspects of design , Very suitable for a weekend reading , Quickly understand modern CPU The design of the . 
 After thinking about  In addition to the data level dependency we mentioned here , You can find us during the execution of the program , Other dependencies ？ What kind of risk do these dependency situations belong to ？ 
 Welcome to share your doubts and opinions with me . You can also put today's content , Share with your friends , Study and progress with him .