当前位置:网站首页>We should make clear the branch prediction
We should make clear the branch prediction
2022-07-02 01:13:00 【Basketball fans who write code】
The English name of branch prediction is 「Branch Prediction」
You can Google Search for this keyword on , You can see a lot about branch prediction , But find out how branch prediction works , That's the point .

The impact of branch prediction on the program
Let's take a look at the following two pieces of code
Code 1
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// !!! With this, the next loop runs faster.
//std::sort(data, data + arraySize);
// Test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i) {
for (unsigned c = 0; c < arraySize; ++c) { // Primary loop
if (data[c] >= 128) sum += data[c];
}
}
double elapsedTime = static_cast<double>(clock()-start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << '\n';
std::cout << "sum = " << sum << '\n';
}Execution results
@ubuntu:/data/study$ g++ fenzhi.cpp && ./a.out
21.6046
sum = 314931600000 Code 2
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// !!! With this, the next loop runs faster.
std::sort(data, data + arraySize);
// Test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i) {
for (unsigned c = 0; c < arraySize; ++c) { // Primary loop
if (data[c] >= 128) sum += data[c];
}
}
double elapsedTime = static_cast<double>(clock()-start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << '\n';
std::cout << "sum = " << sum << '\n';
}Execution results :
@ubuntu:/data/study$ g++ fenzhi.cpp && ./a.out
8.52157
sum = 314931600000After the first code generates a random array , No sorting , The second code sorts random arrays , There is a great difference in the execution time .

therefore , What happened to them ?
The reasons for their different results , Namely Branch prediction , Branch prediction is CPU A prediction of a program by a processor , and CPU Architecture related , Many processors now have branch prediction function .
CPU While executing this code
if (data[c] >= 128) sum += data[c];CPU There will be a prediction mechanism in advance , For example, the previous execution results are true, Then next time, judge if When , It will default to be true To deal with it , Let the following instructions enter the pre installation in advance .
Of course , This judgment will not affect the actual result output , This judgment is only for CPU Execute code in parallel .
CPU The execution of an instruction is divided into several stages

Since it is implemented in stages , That is what we normally say pipeline( Pipeline execution ).
The assembly line workers only need to complete the content they are responsible for , There is no need to care about other people to deal with .
If I have a piece of code , as follows :
int a = 0;
a += 1;
a += 2;
a += 3;

From this picture we can see , We think it's implementation a = 0 After the end , Will execute a+=1.
But the actual CPU It's execution a=0 After the implementation of the first article of , Go ahead immediately a+=1 The first instruction of .
That's why , The execution speed has been greatly improved .
But for if() Language , When there is no branch prediction , We need to wait if() The next code can only be executed after the execution results appear .

If there is branch prediction

By comparison, we can find that , If there is branch prediction , It makes the execution speed faster .

So if you predict Failure , Will it affect the execution time , The answer is yes .
In the previous example , Without sorting the array , Most branch predictions will fail , At this time, the instruction will be retrieved and executed again after the execution , It will seriously affect the execution efficiency .
And in the example after sorting , Branch prediction has been in a successful state ,CPU The execution speed of has been greatly improved .

If the performance degradation caused by branch prediction is solved
There must be a certain decrease in branch prediction , The way to improve performance is not to use this damn if sentence .
such as , The above code , We can change it to this
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
// !!! With this, the next loop runs faster.
//std::sort(data, data + arraySize);
// Test
clock_t start = clock();
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i) {
for (unsigned c = 0; c < arraySize; ++c) { // Primary loop
int t = (data[c] - 128) >> 31;
sum += ~t & data[c];
}
}
double elapsedTime = static_cast<double>(clock()-start) / CLOCKS_PER_SEC;
std::cout << elapsedTime << '\n';
std::cout << "sum = " << sum << '\n';
} such as , The absolute value code we see , It also uses such ideas
/**
* abs - return absolute value of an argument
* @x: the value. If it is unsigned type, it is converted to signed type first.
* char is treated as if it was signed (regardless of whether it really is)
* but the macro's return type is preserved as char.
*
* Return: an absolute value of x.
*/
#define abs(x) __abs_choose_expr(x, long long, \
__abs_choose_expr(x, long, \
__abs_choose_expr(x, int, \
__abs_choose_expr(x, short, \
__abs_choose_expr(x, char, \
__builtin_choose_expr( \
__builtin_types_compatible_p(typeof(x), char), \
(char)({ signed char __x = (x); __x<0?-__x:__x; }), \
((void)0)))))))
#define __abs_choose_expr(x, type, other) __builtin_choose_expr( \
__builtin_types_compatible_p(typeof(x), signed type) || \
__builtin_types_compatible_p(typeof(x), unsigned type), \
({ signed type __x = (x); __x < 0 ? -__x : __x; }), other)Of course , You can write like this
int abs(int i){
if(i<0)
return ~(--i);
return i;
} So , At the end of the computer is mathematics
Reference resources :
https://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-processing-an-unsorted-array/11227902#11227902
https://blog.csdn.net/loongshawn/article/details/118339009
https://blog.csdn.net/DBC_121/article/details/105360658
边栏推荐
- How can programmers better plan their career development?
- ACM tutorial - quick sort (regular + tail recursion + random benchmark)
- What skills does an excellent software tester need to master?
- gradle
- Bubble Sort Graph
- Global and Chinese market of safety detection systems 2022-2028: Research Report on technology, participants, trends, market size and share
- XMIND mind map
- Docker安装Oracle_11g
- 2022 safety officer-a certificate examination questions and online simulation examination
- Mitsubishi PLC FX3U pulse axis jog function block (mc_jog)
猜你喜欢

Since I understand the idea of dynamic planning, I have opened the door to a new world

Upgraded wechat tool applet source code for mobile phone detection - supports a variety of main traffic modes

Datawhale 社区黑板报(第1期)

2022 safety officer-a certificate examination questions and online simulation examination

【八大排序①】插入排序(直接插入排序、希尔排序)

SQL injection for Web Security (2)

【图像增强】基于Frangi滤波器实现血管图像增强附matlab代码

Docker安装Oracle_11g

ACM教程 - 快速排序(常规 + 尾递归 + 随机基准数)

New version of free mobile phone, PC, tablet, notebook four terminal Website thumbnail display diagram online one click to generate website source code
随机推荐
ACM tutorial - quick sort (regular + tail recursion + random benchmark)
[WesternCTF2018]shrine writeup
【八大排序④】归并排序、不基于比较的排序(计数排序、基数排序、桶排序)
DTL dephossite | prediction method of dephosphorylation sites based on Transfer Learning
PLC Analog input analog conversion FB s_ ITR (Mitsubishi FX3U)
Based on Simulink and FlightGear, the dynamic control of multi rotor UAV in equilibrium is modeled and simulated
S32Kxxx bootloader之UDS bootloader
Excel PivotTable
AIX存储管理之总结篇
MySQL application day02
Deb file installation
AIX存储管理之卷组属性的查看和修改(二)
UDS bootloader of s32kxxx bootloader
教你白嫖Amazon rds一年并搭建MySQL云数据库(只需10分钟,真香)
Leetcode 45 Jumping game II (2022.02.14)
AIX存储管理之卷组的创建(一)
2022 operation of simulated examination platform for melting welding and thermal cutting work license
How does schedulerx help users solve the problem of distributed task scheduling?
How does schedulerx help users solve the problem of distributed task scheduling?
Global and Chinese market of aircraft MRO software 2022-2028: Research Report on technology, participants, trends, market size and share