当前位置:网站首页>After reading the average code written by Microsoft God, I realized that I was still too young
After reading the average code written by Microsoft God, I realized that I was still too young
2022-07-05 01:15:00 【QbitAl】
Bowen From the Aofei temple
qubits | official account QbitAI
Rounding The average of unsigned integers , Flowers can come out ?
see , Microsoft God Raymond Chen A recent long article directly detonated the extranet technology platform , Sparked countless discussions :
Countless people click in with great confidence : Isn't it a simple programming problem for primary school students who divide by two after addition ?
unsigned average(unsigned a, unsigned b)
{
return (a + b) / 2;
}
But follow the great God and dig deep , But gradually stare at the dog ……
It's not that simple to average
Let's start with the method that primary school students will know at the beginning , This simple method has a fatal flaw :
If the length of an unsigned integer is 32 position , So if the two added values are half the maximum length , Then only when adding in the first step , It will happen out of memory .
That is to say average(0x80000000U, 0x80000000U)=0.
But there are many solutions , Most experienced developers First Can think of , Is to limit the length of the number added in advance , Avoid spillovers .
There are two ways :
1、 When you know the larger value of two unsigned integers added , Subtract the smaller value and divide by two , In advance Reduce the length :
unsigned average(unsigned low, unsigned high)
{
return low + (high - low) / 2;
}
2、 Divide two unsigned integers in advance , At the same time through Bitwise AND Correct the lower digits , Ensure that when both integers are odd , The results are still correct .
( By the way , This is a patented method ,2016 Year out )
unsigned average(unsigned a, unsigned b)
{
return (a / 2) + (b / 2) + (a & b & 1);
}
These two are common ideas , Many netizens also said , The fastest thing I can think of is 2016 Patented method .
There are also ways that can be quickly thought of by the majority of netizens SWAR(SIMD within a register):
unsigned average(unsigned a, unsigned b)
{
return (a & b) + (a ^ b) / 2;// variant (a ^ b) + (a & b) * 2
as well as C++ 20 In version std: : midpoint function .
Next , The author puts forward The second way of thinking :
If the unsigned integer is 32 Bit and the size of the local register is 64 position , Or the compiler supports multi word operation , You can force the addition value into long integer data .
unsigned average(unsigned a, unsigned b)
{
// Suppose "unsigned" is a 32-bit type and
// "unsigned long long" is a 64-bit type.
return ((unsigned long long)a + b) / 2;
}
however , Here's a point that needs special attention :
It must be ensured that 64 The first bit of the register 32 All for 0, It won't affect the rest 32 A value .
Like x86-64 and aarch64 These architectures will automatically 32 A value Zero expansion by 64 A value :
// x86-64: Assume ecx = a, edx = b, upper 32 bits unknown
mov eax, ecx ; rax = ecx zero-extended to 64-bit value
mov edx, edx ; rdx = edx zero-extended to 64-bit value
add rax, rdx ; 64-bit addition: rax = rax + rdx
shr rax, 1 ; 64-bit shift: rax = rax >> 1
; result is zero-extended
; Answer in eax
// AArch64 (ARM 64-bit): Assume w0 = a, w1 = b, upper 32 bits unknown
uxtw x0, w0 ; x0 = w0 zero-extended to 64-bit value
uxtw x1, w1 ; x1 = w1 zero-extended to 64-bit value
add x0, x1 ; 64-bit addition: x0 = x0 + x1
ubfx x0, x0, 1, 32 ; Extract bits 1 through 32 from result
; (shift + zero-extend in one instruction)
; Answer in x0
and Alpha AXP、mips64 And other architectures will 32 A value Symbol extension by 64 A value .
This time , You need to add an additional zeroing instruction , For example, through the deletion instruction of carry left two words rldicl:
// Alpha AXP: Assume a0 = a, a1 = b, both in canonical form
insll a0, #0, a0 ; a0 = a0 zero-extended to 64-bit value
insll a1, #0, a1 ; a1 = a1 zero-extended to 64-bit value
addq a0, a1, v0 ; 64-bit addition: v0 = a0 + a1
srl v0, #1, v0 ; 64-bit shift: v0 = v0 >> 1
addl zero, v0, v0 ; Force canonical form
; Answer in v0
// MIPS64: Assume a0 = a, a1 = b, sign-extended
dext a0, a0, 0, 32 ; Zero-extend a0 to 64-bit value
dext a1, a1, 0, 32 ; Zero-extend a1 to 64-bit value
daddu v0, a0, a1 ; 64-bit addition: v0 = a0 + a1
dsrl v0, v0, #1 ; 64-bit shift: v0 = v0 >> 1
sll v0, #0, v0 ; Sign-extend result
; Answer in v0
// Power64: Assume r3 = a, r4 = b, zero-extended
add r3, r3, r4 ; 64-bit addition: r3 = r3 + r4
rldicl r3, r3, 63, 32 ; Extract bits 63 through 32 from result
; (shift + zero-extend in one instruction)
; result in r3
Or directly access larger than the native register SIMD register , Of course , From general register to SIMD Registers must also increase memory consumption .
If the computer's processor supports carry addition , Then you can also use The third way of thinking .
At this time , If the register size is n position , Well, the two one. n The sum of unsigned integers of bits can be understood as n+1 position , adopt RCR( Shift right with carry cycle ) Instructions , You can get the correct average , Without losing the overflow bit .
△ Shift right with carry cycle
// x86-32
mov eax, a
add eax, b ; Add, overflow goes into carry bit
rcr eax, 1 ; Rotate right one place through carry
// x86-64
mov rax, a
add rax, b ; Add, overflow goes into carry bit
rcr rax, 1 ; Rotate right one place through carry
// 32-bit ARM (A32)
mov r0, a
adds r0, b ; Add, overflow goes into carry bit
rrx r0 ; Rotate right one place through carry
// SH-3
clrt ; Clear T flag
mov a, r0
addc b, r0 ; r0 = r0 + b + T, overflow goes into T bit
rotcr r0 ; Rotate right one place through carry
What if the processor does not support the right shift operation with carry cycle ?
Internal circulation can also be used (rotation intrinsic):
unsigned average(unsigned a, unsigned b)
{
#if defined(_MSC_VER)
unsigned sum;
auto carry = _addcarry_u32(0, a, b, &sum);
sum = (sum & ~1) | carry;
return _rotr(sum, 1);
#elif defined(__clang__)
unsigned carry;
sum = (sum & ~1) | carry;
auto sum = __builtin_addc(a, b, 0, &carry);
return __builtin_rotateright32(sum, 1);
#else
#error Unsupported compiler.
#endif
}
The result is ,x86 The code generation under the architecture has not changed much ,MSCver Code generation under architecture gets worse , and arm-thumb2 Of clang Better code generation .
// _MSC_VER
mov ecx, a
add ecx, b ; Add, overflow goes into carry bit
setc al ; al = 1 if carry set
and ecx, -2 ; Clear bottom bit
movzx ecx, al ; Zero-extend byte to 32-bit value
or eax, ecx ; Combine
ror ear, 1 ; Rotate right one position
; Result in eax
// __clang__
mov ecx, a
add ecx, b ; Add, overflow goes into carry bit
setc al ; al = 1 if carry set
shld eax, ecx, 31 ; Shift left 64-bit value
// __clang__ with ARM-Thumb2
movs r2, #0 ; Prepare to receive carry
adds r0, r0, r1 ; Calculate sum with flags
adcs r2, r2 ; r2 holds carry
lsrs r0, r0, #1 ; Shift sum right one position
lsls r1, r2, #31 ; Move carry to bit 31
adds r0, r1, r0 ; Combine
The thinking of Microsoft God
Raymond Chen1992 Joined Microsoft in 1996 , So far, he has served in 25 year , do UEX-Shell, Participate in Windows Development ,Windows Many aspects of the system are initially UI He started the architecture .
He was in MSDN Built on blogThe Old New Thing It is also a well-known pure technology output website in the industry .
The comment area of this blog is also haunted by various gods of Microsoft , Continue to explore .
Someone has proposed a new method , stay MIPS ASM share 36 Cycle :
Some people are aiming at 2016 The patent law of , Instead of using (a / 2) + (b / 2) + (a & b & 1) Methods , Why not just put (a & 1) & ( b & 1 ) ) Put it into the adder as carry to calculate ?
Someone else recommended in the comment area TopSpeed compiler , You can define an inline function by specifying appropriate code bytes and calling conventions , To solve the problem “ The result of multiplication and division is 16 position , The intermediate calculated value is not ” The situation of .
It can only be said , There is no end to learning .
original text :
https://devblogs.microsoft.com/oldnewthing/20220207-00/?p=106223
Reference link :
https://news.ycombinator.com/item?id=30252263
边栏推荐
- I was beaten by the interviewer because I didn't understand the sorting
- SAP ui5 application development tutorial 106 - how to improve the readability of SAP ui5 application routing URL trial version
- Grabbing and sorting out external articles -- status bar [4]
- 视频网站手绘
- 107. Some details of SAP ui5 overflow toolbar container control and resize event processing
- Roads and routes -- dfs+topsort+dijkstra+ mapping
- “薪资倒挂”、“毕业生平替” 这些现象说明测试行业已经...
- BGP comprehensive experiment
- 全栈开发提效神器——ApiFox(Postman + Swagger + Mock + JMeter)
- [wave modeling 3] three dimensional random real wave modeling and wave generator modeling matlab simulation
猜你喜欢
Remote control service
微信小程序:独立后台带分销功能月老办事处交友盲盒
Arbitrum: two-dimensional cost
Pandora IOT development board learning (RT thread) - Experiment 4 buzzer + motor experiment [key external interrupt] (learning notes)
There is a new Post-00 exam king in the testing department. I really can't do it in my old age. I have
Single step debugging of master data reading of SAP commerce cloud products
整理混乱的头文件,我用include what you use
dotnet-exec 0.6.0 released
I was beaten by the interviewer because I didn't understand the sorting
Arbitrum:二维费用
随机推荐
Query for Boolean field as "not true" (e.g. either false or non-existent)
Robley's global and Chinese markets 2022-2028: technology, participants, trends, market size and share Research Report
Discrete mathematics: propositional symbolization of predicate logic
如果消费互联网比喻成「湖泊」的话,产业互联网则是广阔的「海洋」
Pycharm professional download and installation tutorial
Introduction to the gtid mode of MySQL master-slave replication
Hedhat firewall
Global and Chinese markets of radiation linear accelerators 2022-2028: Research Report on technology, participants, trends, market size and share
[Chongqing Guangdong education] National Open University spring 2019 1042 international economic law reference questions
【FPGA教程案例9】基于vivado核的时钟管理器设计与实现
[development of large e-commerce projects] performance pressure test - Optimization - impact of middleware on performance -40
Pandora IOT development board learning (RT thread) - Experiment 4 buzzer + motor experiment [key external interrupt] (learning notes)
SAP UI5 应用开发教程之一百零六 - 如何提高 SAP UI5 应用路由 url 的可读性试读版
Expose testing outsourcing companies. You may have heard such a voice about outsourcing
实战模拟│JWT 登录认证
19. Delete the penultimate node of the linked list
Armv8-a programming guide MMU (3)
Package What is the function of JSON file? What do the inside ^ angle brackets and ~ tilde mean?
【海浪建模1】海浪建模的理论分析和matlab仿真
Delaying wages to force people to leave, and the layoffs of small Internet companies are a little too much!