当前位置:网站首页>After reading the average code written by Microsoft God, I realized that I was still too young
After reading the average code written by Microsoft God, I realized that I was still too young
2022-07-05 01:15:00 【QbitAl】
Bowen From the Aofei temple
qubits | official account QbitAI
Rounding The average of unsigned integers , Flowers can come out ?
see , Microsoft God Raymond Chen A recent long article directly detonated the extranet technology platform , Sparked countless discussions :

Countless people click in with great confidence : Isn't it a simple programming problem for primary school students who divide by two after addition ?
unsigned average(unsigned a, unsigned b)
{
return (a + b) / 2;
}But follow the great God and dig deep , But gradually stare at the dog ……
It's not that simple to average
Let's start with the method that primary school students will know at the beginning , This simple method has a fatal flaw :
If the length of an unsigned integer is 32 position , So if the two added values are half the maximum length , Then only when adding in the first step , It will happen out of memory .
That is to say average(0x80000000U, 0x80000000U)=0.
But there are many solutions , Most experienced developers First Can think of , Is to limit the length of the number added in advance , Avoid spillovers .
There are two ways :
1、 When you know the larger value of two unsigned integers added , Subtract the smaller value and divide by two , In advance Reduce the length :
unsigned average(unsigned low, unsigned high)
{
return low + (high - low) / 2;
}2、 Divide two unsigned integers in advance , At the same time through Bitwise AND Correct the lower digits , Ensure that when both integers are odd , The results are still correct .
( By the way , This is a patented method ,2016 Year out )
unsigned average(unsigned a, unsigned b)
{
return (a / 2) + (b / 2) + (a & b & 1);
}These two are common ideas , Many netizens also said , The fastest thing I can think of is 2016 Patented method .
There are also ways that can be quickly thought of by the majority of netizens SWAR(SIMD within a register):
unsigned average(unsigned a, unsigned b)
{
return (a & b) + (a ^ b) / 2;// variant (a ^ b) + (a & b) * 2as well as C++ 20 In version std: : midpoint function .
Next , The author puts forward The second way of thinking :
If the unsigned integer is 32 Bit and the size of the local register is 64 position , Or the compiler supports multi word operation , You can force the addition value into long integer data .
unsigned average(unsigned a, unsigned b)
{
// Suppose "unsigned" is a 32-bit type and
// "unsigned long long" is a 64-bit type.
return ((unsigned long long)a + b) / 2;
}however , Here's a point that needs special attention :
It must be ensured that 64 The first bit of the register 32 All for 0, It won't affect the rest 32 A value .
Like x86-64 and aarch64 These architectures will automatically 32 A value Zero expansion by 64 A value :
// x86-64: Assume ecx = a, edx = b, upper 32 bits unknown
mov eax, ecx ; rax = ecx zero-extended to 64-bit value
mov edx, edx ; rdx = edx zero-extended to 64-bit value
add rax, rdx ; 64-bit addition: rax = rax + rdx
shr rax, 1 ; 64-bit shift: rax = rax >> 1
; result is zero-extended
; Answer in eax
// AArch64 (ARM 64-bit): Assume w0 = a, w1 = b, upper 32 bits unknown
uxtw x0, w0 ; x0 = w0 zero-extended to 64-bit value
uxtw x1, w1 ; x1 = w1 zero-extended to 64-bit value
add x0, x1 ; 64-bit addition: x0 = x0 + x1
ubfx x0, x0, 1, 32 ; Extract bits 1 through 32 from result
; (shift + zero-extend in one instruction)
; Answer in x0and Alpha AXP、mips64 And other architectures will 32 A value Symbol extension by 64 A value .
This time , You need to add an additional zeroing instruction , For example, through the deletion instruction of carry left two words rldicl:
// Alpha AXP: Assume a0 = a, a1 = b, both in canonical form
insll a0, #0, a0 ; a0 = a0 zero-extended to 64-bit value
insll a1, #0, a1 ; a1 = a1 zero-extended to 64-bit value
addq a0, a1, v0 ; 64-bit addition: v0 = a0 + a1
srl v0, #1, v0 ; 64-bit shift: v0 = v0 >> 1
addl zero, v0, v0 ; Force canonical form
; Answer in v0
// MIPS64: Assume a0 = a, a1 = b, sign-extended
dext a0, a0, 0, 32 ; Zero-extend a0 to 64-bit value
dext a1, a1, 0, 32 ; Zero-extend a1 to 64-bit value
daddu v0, a0, a1 ; 64-bit addition: v0 = a0 + a1
dsrl v0, v0, #1 ; 64-bit shift: v0 = v0 >> 1
sll v0, #0, v0 ; Sign-extend result
; Answer in v0
// Power64: Assume r3 = a, r4 = b, zero-extended
add r3, r3, r4 ; 64-bit addition: r3 = r3 + r4
rldicl r3, r3, 63, 32 ; Extract bits 63 through 32 from result
; (shift + zero-extend in one instruction)
; result in r3Or directly access larger than the native register SIMD register , Of course , From general register to SIMD Registers must also increase memory consumption .
If the computer's processor supports carry addition , Then you can also use The third way of thinking .
At this time , If the register size is n position , Well, the two one. n The sum of unsigned integers of bits can be understood as n+1 position , adopt RCR( Shift right with carry cycle ) Instructions , You can get the correct average , Without losing the overflow bit .

△ Shift right with carry cycle
// x86-32
mov eax, a
add eax, b ; Add, overflow goes into carry bit
rcr eax, 1 ; Rotate right one place through carry
// x86-64
mov rax, a
add rax, b ; Add, overflow goes into carry bit
rcr rax, 1 ; Rotate right one place through carry
// 32-bit ARM (A32)
mov r0, a
adds r0, b ; Add, overflow goes into carry bit
rrx r0 ; Rotate right one place through carry
// SH-3
clrt ; Clear T flag
mov a, r0
addc b, r0 ; r0 = r0 + b + T, overflow goes into T bit
rotcr r0 ; Rotate right one place through carryWhat if the processor does not support the right shift operation with carry cycle ?
Internal circulation can also be used (rotation intrinsic):
unsigned average(unsigned a, unsigned b)
{
#if defined(_MSC_VER)
unsigned sum;
auto carry = _addcarry_u32(0, a, b, &sum);
sum = (sum & ~1) | carry;
return _rotr(sum, 1);
#elif defined(__clang__)
unsigned carry;
sum = (sum & ~1) | carry;
auto sum = __builtin_addc(a, b, 0, &carry);
return __builtin_rotateright32(sum, 1);
#else
#error Unsupported compiler.
#endif
}The result is ,x86 The code generation under the architecture has not changed much ,MSCver Code generation under architecture gets worse , and arm-thumb2 Of clang Better code generation .
// _MSC_VER
mov ecx, a
add ecx, b ; Add, overflow goes into carry bit
setc al ; al = 1 if carry set
and ecx, -2 ; Clear bottom bit
movzx ecx, al ; Zero-extend byte to 32-bit value
or eax, ecx ; Combine
ror ear, 1 ; Rotate right one position
; Result in eax
// __clang__
mov ecx, a
add ecx, b ; Add, overflow goes into carry bit
setc al ; al = 1 if carry set
shld eax, ecx, 31 ; Shift left 64-bit value
// __clang__ with ARM-Thumb2
movs r2, #0 ; Prepare to receive carry
adds r0, r0, r1 ; Calculate sum with flags
adcs r2, r2 ; r2 holds carry
lsrs r0, r0, #1 ; Shift sum right one position
lsls r1, r2, #31 ; Move carry to bit 31
adds r0, r1, r0 ; CombineThe thinking of Microsoft God
Raymond Chen1992 Joined Microsoft in 1996 , So far, he has served in 25 year , do UEX-Shell, Participate in Windows Development ,Windows Many aspects of the system are initially UI He started the architecture .

He was in MSDN Built on blogThe Old New Thing It is also a well-known pure technology output website in the industry .
The comment area of this blog is also haunted by various gods of Microsoft , Continue to explore .
Someone has proposed a new method , stay MIPS ASM share 36 Cycle :
Some people are aiming at 2016 The patent law of , Instead of using (a / 2) + (b / 2) + (a & b & 1) Methods , Why not just put (a & 1) & ( b & 1 ) ) Put it into the adder as carry to calculate ?
Someone else recommended in the comment area TopSpeed compiler , You can define an inline function by specifying appropriate code bytes and calling conventions , To solve the problem “ The result of multiplication and division is 16 position , The intermediate calculated value is not ” The situation of .
It can only be said , There is no end to learning .

original text :
https://devblogs.microsoft.com/oldnewthing/20220207-00/?p=106223
Reference link :
https://news.ycombinator.com/item?id=30252263
边栏推荐
- Database postragesql client authentication
- Compare whether two lists are equal
- Basic operation of database and table ----- the concept of index
- [microprocessor] VHDL development of microprocessor based on FPGA
- Global and Chinese market of portable CNC cutting machines 2022-2028: Research Report on technology, participants, trends, market size and share
- 每日刷题记录 (十三)
- Arbitrum: two-dimensional cost
- To sort out messy header files, I use include what you use
- Global and Chinese markets of radiation linear accelerators 2022-2028: Research Report on technology, participants, trends, market size and share
- What is the current situation and Prospect of the software testing industry in 2022?
猜你喜欢

Basic operation of database and table ----- the concept of index
![Pandora IOT development board learning (RT thread) - Experiment 4 buzzer + motor experiment [key external interrupt] (learning notes)](/img/ad/5fbf8c3f71ea204bcd48473c3811f6.png)
Pandora IOT development board learning (RT thread) - Experiment 4 buzzer + motor experiment [key external interrupt] (learning notes)

整理混乱的头文件,我用include what you use

Database performance optimization tool

Nebula Importer 数据导入实践

Inventory of more than 17 typical security incidents in January 2022

Arbitrum: two-dimensional cost

程序员SQL数据脚本编码能力弱,BI做不出来怎么办?

Talking about JVM 4: class loading mechanism

SAP UI5 应用开发教程之一百零六 - 如何提高 SAP UI5 应用路由 url 的可读性试读版
随机推荐
【微处理器】基于FPGA的微处理器VHDL开发
How to use words to describe breaking change in Spartacus UI of SAP e-commerce cloud
Hand drawn video website
A simple SSO unified login design
多模输入事件分发机制详解
潘多拉 IOT 开发板学习(RT-Thread)—— 实验4 蜂鸣器+马达实验【按键外部中断】(学习笔记)
SAP UI5 应用开发教程之一百零七 - SAP UI5 OverflowToolbar 容器控件介绍的试读版
Which financial products with stable income are good
Introduction to the gtid mode of MySQL master-slave replication
Basic operation of database and table ----- phased test II
Robley's global and Chinese markets 2022-2028: technology, participants, trends, market size and share Research Report
Expansion operator: the family is so separated
[FPGA tutorial case 10] design and implementation of complex multiplier based on Verilog
To sort out messy header files, I use include what you use
Introduction to redis (1)
Are you still writing the TS type code
There is a new Post-00 exam king in the testing department. I really can't do it in my old age. I have
POAP:NFT的采用入口?
【FPGA教程案例9】基于vivado核的时钟管理器设计与实现
Mongodb series learning notes tutorial summary