Floating-Point Numbers Representation
Shows representation of the floating-point numbers in binary formats.
Format
Real numbers are represented in computing as an array of bits. Generally, it is 32 or 64 bits long.
The first bit is a sign followed by the exponent bits. Other bits are for significand (mantissa). Implementations of the floating-point arithmetics are described in the IEE 754 standard.
The single-precision (float data type in C language) values are stored as 4 bytes or 32 bits: 1 sign bit, 8-bit exponent, 23-bit significand.
The double-precision (double data type in C language) values are stored as 8 bytes or 64 bits: 1 sign bit, 11-bit exponent, 52-bit significand.
Other formats are: 2-byte (half-precision), 10-byte (double-extended precision), and 16-byte (quadruple-precision). The single-precision floating-point format is described in this article on Wikipedia.
Exponent
The exponents are power of two. The exponent is stored in biased form. To get the real exponent substract the value 127 (for single-precision) or 1023 (for double-precision) from the stored exponent. Note: all zeroes and all ones in exponent are reserved for special numbers.
For example, the single-precision exponent stored as 123. The real exponent is 123 - 127 = -4. And the exponent part encoded the value 2-4 = 1/24 = 0,0625.
Mantissa
The mantissa is stored as a binary fraction of the form 1.NNN...But this 1 is not stored but assumed. For example, the number 1.625 is stored as 1.625×20. The value 1.625 stored as 0.625 in binary. Which is 1010000 00000000 00000000 = 1/21 + 0/22 + 1/23 + 0/24 ... + 0/223 = 1/2 + 1/8 = 0.625.
This representation is inaccurate by design for some numbers. The error accumulates when summing several numbers. To reduce this kind of numerical errors algorithms were developed.
Javascript representation is the 64-bit number