当前位置：网站首页>Floating point number exploration

Floating point number exploration

2022-07-25 09:21:00 【halazi100】

Floating point number exploration

Floating point numbers are used in computers to approximate any real number . specifically , The real number consists of an integer or fixed point （ Mantissa ） Times some base （ Usually in computers 2） Omega to the integer power of omega .

How to convert decimal to binary

Integral part

Method 1 Integral part divided by 2 Write the remainder upside down

59/2    **** ***1
29/2     *** ***1
14/2      ** ***0
 7/2       * ***1

 3/2         ***1
 1/2          **1
 0/2           *0
 0/2            0

59 The binary representation of 0011 1011;

Method 2 Binary decomposition
Convert a decimal number into multiple 2 The sum of the integral powers of , Then they are converted into binary , Finally, merge all binaries ;

54 = 2^5 + 2^4 + 2^2 + 2^1
   = 0010 0000 + 0001 0000 + 0000 0100 + 0000 0010
   = 0011 0110

The fractional part

multiply 2 Rounding

Such as 0.25 Binary conversion

0.25*2=0.5  0
0.5*2 =1.0  1

namely 0.25 Convert binary to 01

Such as 0.4 Binary conversion

0.4*2 =0.8  0
0.8*2 =1.6  1
0.6*2 =1.2  1
0.2*2 =0.4  0
...

namely 0.4 Change to binary to 0110 0110 ...., That is, binary description of decimals cannot be absolutely accurate ;

The representation of floating point numbers

According to international standards IEEE 754, Any binary floating point number V It can be expressed as follows
(-1)^S * M * 2^E
among

(-1)^S The sign bit , When S=0,V Being positive , When S=1,V It's a negative number .
M Represents a significant number ,[1,2).
2^E The index , With 2 Base number .

For example, in the decimal system 5.0 It's written as a binary floating point number 101.0, In this form, it is (-1)^0 * 1.01 * 2^2,
among S=0,M=1.01,E=2.

Another example is the decimal system -5.5, It's written as a binary floating point number 101.1, In this form, it is (-1)^1 * 1.011 * 2^2,
among S=1,M=1.011,E=2.

The representation of floating-point numbers in memory

according to IEEE754 The standard stipulates ：
about 32 Floating point number of bits （float type ）, The highest bit is the sign bit S, Next in 8 Bits are exponents E, The rest 23 Bits are significant numbers M.
about 64 Floating point number of bits （double type ）, The highest 1 Bits are sign bits S, And then 11 Bits are exponents E, The rest 52 Bits are significant numbers M.

┌─────────────┬──────────────────┬────────────────────┬─────────────────────┐
│    type     │   S(sign bit)    │  E(Exponent area)  │  M(Mantissa area)   │
├─────────────┼──────────────────┼────────────────────┼─────────────────────┤
│    float    │   1 bit(31bit)   │  8 bits(23-30bit)  │  23 bits(0-22bit)   │
├─────────────┼──────────────────┼────────────────────┼─────────────────────┤
│    double   │   1 bit(63bit)   │ 11 bits(52-62bit)  │  52 bits(0-51bit)   │
└─────────────┴──────────────────┴────────────────────┴─────────────────────┘

float And double The representation of type data is the same inside the computer , However, due to the different storage space , The range and accuracy of data values that can be represented are different .

Sign bit S

For sign bits , Only 0 and 1 Two cases , They are positive and negative respectively .

Significant figures M

For significant numbers M, because M The range is [1,2), in other words M The integer part of must be 1, therefore IEEE754 The standard stipulates , Keep it in the computer M when , By default, the first digit of this number is always 1, So you can give up , Save only the following fraction .
For example preservation 1.01 When , Save only a fraction 01, And round off the integer part , Wait until you read , Put the first 1 Add .
The purpose of this is to save 1 Significant digits .
32 Bit floating-point numbers are left to M Only 23 position , After giving up the first one , You can keep 24 Significant digits .

Significant figures M The number of digits determines the accuracy of the data

float：2^23 = 8388608, common 7 position , Most can have 7 Significant digits ; float The accuracy of is 6-7 Significant digits ( Can guarantee 6 position );
double：2^52 = 4503599627370496, common 16 position , Most can have 16 Significant digits ;double The accuracy of is 15-16 Significant digits ;

Index part E

Index part E It's an unsigned integer

If E by 8 position （float type ）, that E The range that can be expressed is 0-255,
If E by 11 position （double type ）, that E The range that can be expressed is 0-2047;
This index E Obviously it can be negative , but unsigned int The type of E It's a nonnegative number .
therefore IEEE754 The standard stipulates , In memory , The real index has to add an intermediate value （8 Bit E The median value is 127,11 Bit E The median value is 1023）.

Like a float Count E=3, Then when saving into memory, add 127 Programming 130 after , Then convert it into binary, that is 1000 0010 Post storage .

E Not all for 0 Or not all of them 1
For floating-point numbers 5.0

S=0, Direct storage ;
M=1.01, Round off integer 1, Put the decimal part 01 Storage , The spare bits in the back are 0 A filling ;
E=2, Need to add 127 become 129 And convert it into binary post storage area ;

be 5.0 The final binary representation is 0-100 0000 1-010 0000 0000 0000 0000 0000
With 16 The hexadecimal display is 40 A0 00 00

For floating-point numbers -5.5

S=1, Direct storage ;
M=1.011, Round off integer 1, Put the decimal part 011 Storage , The spare bits in the back are 0 A filling ;
E=2, Need to add 127 become 129 And convert it into binary post storage area ;

be 5.0 The final binary representation is 1-100 0000 1-011 0000 0000 0000 0000 0000
With 16 The hexadecimal display is C0 B0 00 00

┌─────────────┬──────────────────┬────────────────────┬─────────────────────┐
│    value    │   S(sign bit)    │  E(Exponent area)  │  M(Mantissa area)   │
├─────────────┼──────────────────┼────────────────────┼─────────────────────┤
│     5.0     │        0         │   100 0000 1       │  010 0000 ...       │
├─────────────┼──────────────────┼────────────────────┼─────────────────────┤
│    -5.5     │        1         │   100 0000 1       │  011 0000 ...       │
└─────────────┴──────────────────┴────────────────────┴─────────────────────┘

#include <stdio.h>
int main()
{
    float f1 = 5.0;
    float f2= -5.5;
    printf("%f, 0x%x\n", f1, *(unsigned int*)&f1); // 5.000000, 0x40a00000
    printf("%f, 0x%x\n", f2, *(unsigned int*)&f2); // -5.500000, 0xc0b00000
    return 0;
}

E All for 0 when
Take floating point numbers for example .
because E add 127 After all 0, in other words E The real value of is -127, That is, the floating-point index part is 2^(-127), This is a very small number , At this point, the significant number M No more first 1, It's reduced to 0 Decimals of integers .
This is to show that 0, And close to 0 A very small number of .
The same with double precision floating point .

#include <stdio.h>
void show_binary(const float f) {
    unsigned int num = *(unsigned int*)&f;
    printf("%.6f, 0x%X: ", f, num);

    const size_t max_size = 8 * sizeof(float);
    int i = (int)max_size;
    while (0 <= --i) {
        printf("%c", ((num >> i) & 0x1) + '0');
        if (0 == (i%4)) {
            printf(" ");
        }
    }
    printf("\n");
}
int main()
{
    float f21 = 5.0f;
    float f22= -5.5f;
    float f31 = 0.0f;
    float f32 = 0.000001f;
    show_binary(f21); //  5.000000, 0x40A00000: 0100 0000 1010 0000 0000 0000 0000 0000 
    show_binary(f22); // -5.500000, 0xC0B00000: 1100 0000 1011 0000 0000 0000 0000 0000 
    show_binary(f31); //  0.000000, 0x0: 0000 0000 0000 0000 0000 0000 0000 0000 
    show_binary(f32); //  0.000001, 0x358637BD: 0011 0101 1000 0110 0011 0111 1011 1101
    return 0;
}

E All for 1 when
Take floating point numbers for example .
because E add 127 After all 1, in other words E The real value of is 128, That is, the floating-point exponent part is 2^128, It shows that this is a huge number , At this point, it means positive and negative infinity （ The positive and negative are determined by S decision ）.
The same with double precision floating point .

Index E The number of bits in a part determines the range of data that can be represented
Occupy 4 Bytes of int The range of types ：[-2^31,2^31-1];
Occupy 4 Bytes of float The range of types ： It's about [-3.4*10^38,3.4*10^38], namely (-2^128,+2^128);

why int and float All occupy 4 Bytes of memory ,float But than int The scope of expression is much larger ？
Secret

float The number of specific numbers that can be expressed is the same as int identical
float There is a discontinuity between representable numbers , There are jumps
float Just an approximate representation , Cannot be used as an exact number
Because the memory representation is relatively complex ,float The speed of computing is faster than int A lot slower

Summary

The memory representation of floating-point type is different from that of integer type
Floating point type memory representation is more complex
Floating point types can represent a wider range
Floating point type is an imprecise type
Floating point types are slower

原网站

版权声明
本文为[halazi100]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/201/202207191245515008.html