当前位置:网站首页>After reading the average code written by Microsoft God, I realized that I was still too young

After reading the average code written by Microsoft God, I realized that I was still too young

2022-07-05 01:15:00 QbitAl

Bowen From the Aofei temple
qubits | official account QbitAI

Rounding The average of unsigned integers , Flowers can come out ?

see , Microsoft God Raymond Chen A recent long article directly detonated the extranet technology platform , Sparked countless discussions :

4f92a656db6053f29ac8244e7b8a30b9.png

Countless people click in with great confidence : Isn't it a simple programming problem for primary school students who divide by two after addition ?

unsigned average(unsigned a, unsigned b)
{
    return (a + b) / 2;
}

But follow the great God and dig deep , But gradually stare at the dog ……

It's not that simple to average

Let's start with the method that primary school students will know at the beginning , This simple method has a fatal flaw :

If the length of an unsigned integer is 32 position , So if the two added values are half the maximum length , Then only when adding in the first step , It will happen out of memory .

That is to say average(0x80000000U, 0x80000000U)=0.

But there are many solutions , Most experienced developers First Can think of , Is to limit the length of the number added in advance , Avoid spillovers .

There are two ways :

1、 When you know the larger value of two unsigned integers added , Subtract the smaller value and divide by two , In advance Reduce the length

unsigned average(unsigned low, unsigned high)
{
    return low + (high - low) / 2;
}

2、 Divide two unsigned integers in advance , At the same time through Bitwise AND Correct the lower digits , Ensure that when both integers are odd , The results are still correct .

( By the way , This is a patented method ,2016 Year out )

unsigned average(unsigned a, unsigned b)
{
    return (a / 2) + (b / 2) + (a & b & 1);
}

These two are common ideas , Many netizens also said , The fastest thing I can think of is 2016 Patented method .

There are also ways that can be quickly thought of by the majority of netizens SWAR(SIMD within a register):

unsigned average(unsigned a, unsigned b)
{
    return (a & b) + (a ^ b) / 2;//  variant  (a ^ b) + (a & b) * 2

as well as C++ 20 In version std: : midpoint function .

Next , The author puts forward The second way of thinking

If the unsigned integer is 32 Bit and the size of the local register is 64 position , Or the compiler supports multi word operation , You can force the addition value into long integer data .

unsigned average(unsigned a, unsigned b)
{
    // Suppose "unsigned" is a 32-bit type and
    // "unsigned long long" is a 64-bit type.
    return ((unsigned long long)a + b) / 2;
}

however , Here's a point that needs special attention :

It must be ensured that 64 The first bit of the register 32 All for 0, It won't affect the rest 32 A value .

Like x86-64 and aarch64 These architectures will automatically 32 A value Zero expansion by 64 A value :

// x86-64: Assume ecx = a, edx = b, upper 32 bits unknown
    mov     eax, ecx        ; rax = ecx zero-extended to 64-bit value
    mov     edx, edx        ; rdx = edx zero-extended to 64-bit value
    add     rax, rdx        ; 64-bit addition: rax = rax + rdx
    shr     rax, 1          ; 64-bit shift:    rax = rax >> 1
                            ;                  result is zero-extended
                            ; Answer in eax

// AArch64 (ARM 64-bit): Assume w0 = a, w1 = b, upper 32 bits unknown
    uxtw    x0, w0          ; x0 = w0 zero-extended to 64-bit value
    uxtw    x1, w1          ; x1 = w1 zero-extended to 64-bit value
    add     x0, x1          ; 64-bit addition: x0 = x0 + x1
    ubfx    x0, x0, 1, 32   ; Extract bits 1 through 32 from result
                            ; (shift + zero-extend in one instruction)
                            ; Answer in x0

and Alpha AXP、mips64 And other architectures will 32 A value Symbol extension by 64 A value .

This time , You need to add an additional zeroing instruction , For example, through the deletion instruction of carry left two words rldicl:

// Alpha AXP: Assume a0 = a, a1 = b, both in canonical form
    insll   a0, #0, a0      ; a0 = a0 zero-extended to 64-bit value
    insll   a1, #0, a1      ; a1 = a1 zero-extended to 64-bit value
    addq    a0, a1, v0      ; 64-bit addition: v0 = a0 + a1
    srl     v0, #1, v0      ; 64-bit shift:    v0 = v0 >> 1
    addl    zero, v0, v0    ; Force canonical form
                            ; Answer in v0

// MIPS64: Assume a0 = a, a1 = b, sign-extended
    dext    a0, a0, 0, 32   ; Zero-extend a0 to 64-bit value
    dext    a1, a1, 0, 32   ; Zero-extend a1 to 64-bit value
    daddu   v0, a0, a1      ; 64-bit addition: v0 = a0 + a1
    dsrl    v0, v0, #1      ; 64-bit shift:    v0 = v0 >> 1
    sll     v0, #0, v0      ; Sign-extend result
                            ; Answer in v0

// Power64: Assume r3 = a, r4 = b, zero-extended
    add     r3, r3, r4      ; 64-bit addition: r3 = r3 + r4
    rldicl  r3, r3, 63, 32  ; Extract bits 63 through 32 from result
                            ; (shift + zero-extend in one instruction)
                            ; result in r3

Or directly access larger than the native register SIMD register , Of course , From general register to SIMD Registers must also increase memory consumption .

If the computer's processor supports carry addition , Then you can also use The third way of thinking .

At this time , If the register size is n position , Well, the two one. n The sum of unsigned integers of bits can be understood as n+1 position , adopt RCR( Shift right with carry cycle ) Instructions , You can get the correct average , Without losing the overflow bit .

3b036db87cb3e610c063013c5bb9be60.png

Shift right with carry cycle
// x86-32
    mov     eax, a
    add     eax, b          ; Add, overflow goes into carry bit
    rcr     eax, 1          ; Rotate right one place through carry

// x86-64
    mov     rax, a
    add     rax, b          ; Add, overflow goes into carry bit
    rcr     rax, 1          ; Rotate right one place through carry

// 32-bit ARM (A32)
    mov     r0, a
    adds    r0, b           ; Add, overflow goes into carry bit
    rrx     r0              ; Rotate right one place through carry

// SH-3
    clrt                    ; Clear T flag
    mov     a, r0
    addc    b, r0           ; r0 = r0 + b + T, overflow goes into T bit
    rotcr   r0              ; Rotate right one place through carry

What if the processor does not support the right shift operation with carry cycle ?

Internal circulation can also be used (rotation intrinsic):

unsigned average(unsigned a, unsigned b)
{
#if defined(_MSC_VER)
    unsigned sum;
    auto carry = _addcarry_u32(0, a, b, &sum);
    sum = (sum & ~1) | carry;
    return _rotr(sum, 1);
#elif defined(__clang__)
    unsigned carry;
    sum = (sum & ~1) | carry;
    auto sum = __builtin_addc(a, b, 0, &carry);
    return __builtin_rotateright32(sum, 1);
#else
#error Unsupported compiler.
#endif
}

The result is ,x86 The code generation under the architecture has not changed much ,MSCver Code generation under architecture gets worse , and arm-thumb2 Of clang Better code generation .

// _MSC_VER
    mov     ecx, a
    add     ecx, b          ; Add, overflow goes into carry bit
    setc    al              ; al = 1 if carry set
    and     ecx, -2         ; Clear bottom bit
    movzx   ecx, al         ; Zero-extend byte to 32-bit value
    or      eax, ecx        ; Combine
    ror     ear, 1          ; Rotate right one position
                            ; Result in eax

// __clang__
    mov     ecx, a
    add     ecx, b          ; Add, overflow goes into carry bit
    setc    al              ; al = 1 if carry set
    shld    eax, ecx, 31    ; Shift left 64-bit value

// __clang__ with ARM-Thumb2
    movs    r2, #0          ; Prepare to receive carry
    adds    r0, r0, r1      ; Calculate sum with flags
    adcs    r2, r2          ; r2 holds carry
    lsrs    r0, r0, #1      ; Shift sum right one position
    lsls    r1, r2, #31     ; Move carry to bit 31
    adds    r0, r1, r0      ; Combine

The thinking of Microsoft God

Raymond Chen1992 Joined Microsoft in 1996 , So far, he has served in 25 year , do UEX-Shell, Participate in Windows Development ,Windows Many aspects of the system are initially UI He started the architecture .

8db6de544c850dc4b7393ae317d58cd5.png

He was in MSDN Built on blogThe Old New Thing It is also a well-known pure technology output website in the industry .

The comment area of this blog is also haunted by various gods of Microsoft , Continue to explore .

Someone has proposed a new method , stay MIPS ASM share 36 Cycle :

Some people are aiming at 2016 The patent law of , Instead of using (a / 2) + (b / 2) + (a & b & 1) Methods , Why not just put (a & 1) & ( b & 1 ) ) Put it into the adder as carry to calculate ?

Someone else recommended in the comment area TopSpeed compiler , You can define an inline function by specifying appropriate code bytes and calling conventions , To solve the problem “ The result of multiplication and division is 16 position , The intermediate calculated value is not ” The situation of .

It can only be said , There is no end to learning .

434784f316695d5f808bf46b41e71b21.png

original text :
https://devblogs.microsoft.com/oldnewthing/20220207-00/?p=106223

Reference link :
https://news.ycombinator.com/item?id=30252263

原网站

版权声明
本文为[QbitAl]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202141039063068.html

随机推荐