当前位置：网站首页>Floating point notation (summarized from cs61c and CMU CSAPP)

Floating point notation (summarized from cs61c and CMU CSAPP)

2022-06-24 09:59:00 【Just a little advice】

Insert picture description here

Summarized from Berkeley's CS61C2021 Fall classes ,Nick Keynote speaker and CMU2015 Year talk CS APP Curriculum , The titles are floating point, Compared with CMU It was quite detailed , I read it first CS61C Read it again CMU Curriculum , Click on floating point Click on slides Can download pdf（ It was very detailed. You can have a look at it for yourself , I selected a few to post ）, course home https://inst.eecs.berkeley.edu/~cs61c/fa21/
P&H It's a textbook chapter , The Chinese version is a computer composition and design hardware and software interface RISCV
There is also a floating point number website that you can play https://www.h-schmidt.net/FloatConverter/IEEE754.html
The representation of floating point numbers was proposed by Berkeley , Because the decimal point can float , The accuracy is reduced but the representation range is enlarged (Can represent a very large range with roughly the same “precision”), And hope to be consistent with the complement representation of integers （32 individual 0 still 0）, And I hope I can sort without comparing floating-point numbers (Make it possible to sort without needing to do floating-point comparisons)
float It's single precision 32 position 4 byte ,double yes 8 Byte double precision , Master first float,float What cannot be expressed is +∞, -∞, Not-A-Number (NaN), exponent overflow, exponent underflow, +/- zero So many special cases , What can be expressed is divided into normal format and denorm, The maximum can be expressed to $2^{127}=2^{120}*2^7=128*(2^{10})^{12}=128*10^{36}=2*10^{38}$ , according to $2^{10}=10^3$ Convert to
Offset calculation method bias = 2 ^ (k - 1) - 1,k Is the number of exponential digits ,8 Bit index is calculated bias = 127（ Positive and negative numbers stand half way ,bias gives us a balanced value）, The decimal part can be expressed as fraction, F, mantissa, M, significand
Master first normal format, namely 1.xxx
denorm format Used to represent 0 Number of nearby ,0.xxx, The index part is 0000_0000, At this time, the decimal part does not need +1, Then the index part E = 1 - Bias（ This part CMU Quite well , In order to achieve a smooth transition between the two representations , It can be used 8 Digit number , Index part k = 4 The offset is 7, The decimal part is 3 position ）, The teacher commented those IEEE folks are really smart
overflow Part of the index is 1111_1111, First of all Inf And then there was NaN,NAN Usually sqer(-1) perhaps 0/0,NAN You can't compare sizes ,+INF yes 1.0/0.0,-INF yes -1.0/0.0,INF You can compare the size
Floating point numbers have +/- 0, A number that is too small to represent
To sum up
How to compare floating point numbers , Compare the sign bits first, then the exponents （ Unsigned comparison , Using offsets, you can directly compare sizes , If you use a complement, you can't directly compare sizes ）, Finally, compare the decimal part （ The index part is important to distinguish the distribution of numbers , You can see on the number axis ）
CMU In the course C The code can be written , See how to get NAN and INF Of , Direct definition a = 1e20,CMU The teacher did a good job in the course , these 0 and 1 It's not really a number , It's just that we see things from different angles （ Carelessness ）