当前位置:网站首页>After reading the average code written by Microsoft God, I realized that I was still too young
After reading the average code written by Microsoft God, I realized that I was still too young
2022-07-05 01:15:00 【QbitAl】
Bowen From the Aofei temple
qubits | official account QbitAI
Rounding The average of unsigned integers , Flowers can come out ?
see , Microsoft God Raymond Chen A recent long article directly detonated the extranet technology platform , Sparked countless discussions :
Countless people click in with great confidence : Isn't it a simple programming problem for primary school students who divide by two after addition ?
unsigned average(unsigned a, unsigned b)
{
return (a + b) / 2;
}
But follow the great God and dig deep , But gradually stare at the dog ……
It's not that simple to average
Let's start with the method that primary school students will know at the beginning , This simple method has a fatal flaw :
If the length of an unsigned integer is 32 position , So if the two added values are half the maximum length , Then only when adding in the first step , It will happen out of memory .
That is to say average(0x80000000U, 0x80000000U)=0.
But there are many solutions , Most experienced developers First Can think of , Is to limit the length of the number added in advance , Avoid spillovers .
There are two ways :
1、 When you know the larger value of two unsigned integers added , Subtract the smaller value and divide by two , In advance Reduce the length :
unsigned average(unsigned low, unsigned high)
{
return low + (high - low) / 2;
}
2、 Divide two unsigned integers in advance , At the same time through Bitwise AND Correct the lower digits , Ensure that when both integers are odd , The results are still correct .
( By the way , This is a patented method ,2016 Year out )
unsigned average(unsigned a, unsigned b)
{
return (a / 2) + (b / 2) + (a & b & 1);
}
These two are common ideas , Many netizens also said , The fastest thing I can think of is 2016 Patented method .
There are also ways that can be quickly thought of by the majority of netizens SWAR(SIMD within a register):
unsigned average(unsigned a, unsigned b)
{
return (a & b) + (a ^ b) / 2;// variant (a ^ b) + (a & b) * 2
as well as C++ 20 In version std: : midpoint function .
Next , The author puts forward The second way of thinking :
If the unsigned integer is 32 Bit and the size of the local register is 64 position , Or the compiler supports multi word operation , You can force the addition value into long integer data .
unsigned average(unsigned a, unsigned b)
{
// Suppose "unsigned" is a 32-bit type and
// "unsigned long long" is a 64-bit type.
return ((unsigned long long)a + b) / 2;
}
however , Here's a point that needs special attention :
It must be ensured that 64 The first bit of the register 32 All for 0, It won't affect the rest 32 A value .
Like x86-64 and aarch64 These architectures will automatically 32 A value Zero expansion by 64 A value :
// x86-64: Assume ecx = a, edx = b, upper 32 bits unknown
mov eax, ecx ; rax = ecx zero-extended to 64-bit value
mov edx, edx ; rdx = edx zero-extended to 64-bit value
add rax, rdx ; 64-bit addition: rax = rax + rdx
shr rax, 1 ; 64-bit shift: rax = rax >> 1
; result is zero-extended
; Answer in eax
// AArch64 (ARM 64-bit): Assume w0 = a, w1 = b, upper 32 bits unknown
uxtw x0, w0 ; x0 = w0 zero-extended to 64-bit value
uxtw x1, w1 ; x1 = w1 zero-extended to 64-bit value
add x0, x1 ; 64-bit addition: x0 = x0 + x1
ubfx x0, x0, 1, 32 ; Extract bits 1 through 32 from result
; (shift + zero-extend in one instruction)
; Answer in x0
and Alpha AXP、mips64 And other architectures will 32 A value Symbol extension by 64 A value .
This time , You need to add an additional zeroing instruction , For example, through the deletion instruction of carry left two words rldicl:
// Alpha AXP: Assume a0 = a, a1 = b, both in canonical form
insll a0, #0, a0 ; a0 = a0 zero-extended to 64-bit value
insll a1, #0, a1 ; a1 = a1 zero-extended to 64-bit value
addq a0, a1, v0 ; 64-bit addition: v0 = a0 + a1
srl v0, #1, v0 ; 64-bit shift: v0 = v0 >> 1
addl zero, v0, v0 ; Force canonical form
; Answer in v0
// MIPS64: Assume a0 = a, a1 = b, sign-extended
dext a0, a0, 0, 32 ; Zero-extend a0 to 64-bit value
dext a1, a1, 0, 32 ; Zero-extend a1 to 64-bit value
daddu v0, a0, a1 ; 64-bit addition: v0 = a0 + a1
dsrl v0, v0, #1 ; 64-bit shift: v0 = v0 >> 1
sll v0, #0, v0 ; Sign-extend result
; Answer in v0
// Power64: Assume r3 = a, r4 = b, zero-extended
add r3, r3, r4 ; 64-bit addition: r3 = r3 + r4
rldicl r3, r3, 63, 32 ; Extract bits 63 through 32 from result
; (shift + zero-extend in one instruction)
; result in r3
Or directly access larger than the native register SIMD register , Of course , From general register to SIMD Registers must also increase memory consumption .
If the computer's processor supports carry addition , Then you can also use The third way of thinking .
At this time , If the register size is n position , Well, the two one. n The sum of unsigned integers of bits can be understood as n+1 position , adopt RCR( Shift right with carry cycle ) Instructions , You can get the correct average , Without losing the overflow bit .
△ Shift right with carry cycle
// x86-32
mov eax, a
add eax, b ; Add, overflow goes into carry bit
rcr eax, 1 ; Rotate right one place through carry
// x86-64
mov rax, a
add rax, b ; Add, overflow goes into carry bit
rcr rax, 1 ; Rotate right one place through carry
// 32-bit ARM (A32)
mov r0, a
adds r0, b ; Add, overflow goes into carry bit
rrx r0 ; Rotate right one place through carry
// SH-3
clrt ; Clear T flag
mov a, r0
addc b, r0 ; r0 = r0 + b + T, overflow goes into T bit
rotcr r0 ; Rotate right one place through carry
What if the processor does not support the right shift operation with carry cycle ?
Internal circulation can also be used (rotation intrinsic):
unsigned average(unsigned a, unsigned b)
{
#if defined(_MSC_VER)
unsigned sum;
auto carry = _addcarry_u32(0, a, b, &sum);
sum = (sum & ~1) | carry;
return _rotr(sum, 1);
#elif defined(__clang__)
unsigned carry;
sum = (sum & ~1) | carry;
auto sum = __builtin_addc(a, b, 0, &carry);
return __builtin_rotateright32(sum, 1);
#else
#error Unsupported compiler.
#endif
}
The result is ,x86 The code generation under the architecture has not changed much ,MSCver Code generation under architecture gets worse , and arm-thumb2 Of clang Better code generation .
// _MSC_VER
mov ecx, a
add ecx, b ; Add, overflow goes into carry bit
setc al ; al = 1 if carry set
and ecx, -2 ; Clear bottom bit
movzx ecx, al ; Zero-extend byte to 32-bit value
or eax, ecx ; Combine
ror ear, 1 ; Rotate right one position
; Result in eax
// __clang__
mov ecx, a
add ecx, b ; Add, overflow goes into carry bit
setc al ; al = 1 if carry set
shld eax, ecx, 31 ; Shift left 64-bit value
// __clang__ with ARM-Thumb2
movs r2, #0 ; Prepare to receive carry
adds r0, r0, r1 ; Calculate sum with flags
adcs r2, r2 ; r2 holds carry
lsrs r0, r0, #1 ; Shift sum right one position
lsls r1, r2, #31 ; Move carry to bit 31
adds r0, r1, r0 ; Combine
The thinking of Microsoft God
Raymond Chen1992 Joined Microsoft in 1996 , So far, he has served in 25 year , do UEX-Shell, Participate in Windows Development ,Windows Many aspects of the system are initially UI He started the architecture .
He was in MSDN Built on blogThe Old New Thing It is also a well-known pure technology output website in the industry .
The comment area of this blog is also haunted by various gods of Microsoft , Continue to explore .
Someone has proposed a new method , stay MIPS ASM share 36 Cycle :
Some people are aiming at 2016 The patent law of , Instead of using (a / 2) + (b / 2) + (a & b & 1) Methods , Why not just put (a & 1) & ( b & 1 ) ) Put it into the adder as carry to calculate ?
Someone else recommended in the comment area TopSpeed compiler , You can define an inline function by specifying appropriate code bytes and calling conventions , To solve the problem “ The result of multiplication and division is 16 position , The intermediate calculated value is not ” The situation of .
It can only be said , There is no end to learning .
original text :
https://devblogs.microsoft.com/oldnewthing/20220207-00/?p=106223
Reference link :
https://news.ycombinator.com/item?id=30252263
边栏推荐
- Maximum number of "balloons"
- There is a new Post-00 exam king in the testing department. I really can't do it in my old age. I have
- Global and Chinese markets for industrial X-ray testing equipment 2022-2028: Research Report on technology, participants, trends, market size and share
- 微信小程序;胡言乱语生成器
- Yyds dry goods inventory kubernetes management business configuration methods? (08)
- Package What is the function of JSON file? What do the inside ^ angle brackets and ~ tilde mean?
- Database postragesq role membership
- Getting started with Paxos
- 整理混乱的头文件,我用include what you use
- Basic operations of database and table ----- create index
猜你喜欢
Yyds dry goods inventory kubernetes management business configuration methods? (08)
Postman automatically fills headers
26.2 billion! These universities in Guangdong Province have received heavy support
微信小程序:独立后台带分销功能月老办事处交友盲盒
How to use words to describe breaking change in Spartacus UI of SAP e-commerce cloud
【海浪建模1】海浪建模的理论分析和matlab仿真
微信小程序:微群人脉微信小程序源码下载全新社群系统优化版支持代理会员系统功能超高收益
【大型电商项目开发】性能压测-性能监控-堆内存与垃圾回收-39
[flutter topic] 64 illustration basic textfield text input box (I) # yyds dry goods inventory #
dotnet-exec 0.6.0 released
随机推荐
The performance of major mainstream programming languages is PK, and the results are unexpected
Digital DP template
Senior Test / development programmers write no bugs? Qualifications (shackles) don't be afraid of mistakes
Arbitrum: two-dimensional cost
[development of large e-commerce projects] performance pressure test - Optimization - impact of middleware on performance -40
Single step debugging of master data reading of SAP commerce cloud products
Global and Chinese market of veterinary thermometers 2022-2028: Research Report on technology, participants, trends, market size and share
There is a new Post-00 exam king in the testing department. I really can't do it in my old age. I have
Research Report on the overall scale, major producers, major regions, products and application segmentation of agricultural automatic steering system in the global market in 2022
Poap: the adoption entrance of NFT?
Jcenter () cannot find Alibaba cloud proxy address
Game 280 of leetcode week
Database postragesql lock management
Classification of performance tests (learning summary)
Intel sapphire rapids SP Zhiqiang es processor cache memory split exposure
Les phénomènes de « salaire inversé » et de « remplacement des diplômés » indiquent que l'industrie des tests a...
Talking about JVM 4: class loading mechanism
【纯音听力测试】基于MATLAB的纯音听力测试系统
Playwright之录制
node工程中package.json文件作用是什么?里面的^尖括号和~波浪号是什么意思?